What do you mean it has to be done in a half hour?

Sometimes, you can see problems coming.

A vendor of ours provides security information monthly in a series of text files. Typically, those text files are in the 10s of megabytes large. Those files get viewed  cleaned up, and imported into our system by portfolio managers monthly.

The system to do the loading is homegrown, not ETL based (someone thought users should ‘upload’ the files) into an ASP.NET MVC app, and then render UIs with the data. :eyeroll:

Naturally, as our data requirements got larger the files got bigger. And bigger.

This last month, the system eventually exploded. A 1 GB text file proved to be too much for our cute little “not-quite-an-ETL” tool. The portfolio managers needed a solution, and in order to quickly get ’em one, I present (with relevant bits redacted) a 30 minute “lets play with F#” script to solve the problem.

#r "FSharp.Data.3.0.0\\lib\\net45\\Fsharp.Data.dll"
open System.IO;
open FSharp.Data
type BigOleFile = CsvProvider<"C:\\working\\sample.txt", "\t">
let writeData filePath stringLines =
try
File.WriteAllLines (filePath , Array.ofList stringLines)
Ok (List.length stringLines)
with
| e -> Error e.Message
let loadFile filePath =
// Shouldn't these be results to?
let folderPath = Path.GetDirectoryName filePath
let fileName = Path.GetFileName filePath
let getNewFileName v =
sprintf "%s%c%s_%s" folderPath Path.DirectorySeparatorChar v fileName
let toUpper c =
(char ((string c).ToUpper () ))
let inRange c1 c2 v =
v >= c1 && v <= c2
let groupByFirstCharacterOfIssueColumnName (row:BigOleFile.Row) =
match toUpper row.ColumnToGroupBy.[0] with
| x when inRange 'A' 'B' x -> "A-B"
| x when inRange 'C' 'D' x -> "C-D"
| x when inRange 'E' 'F' x -> "E-F"
| 'G' -> "G"
| x when inRange 'H' 'K' x -> "H-K"
| 'L' -> "L"
| _ -> "M-Z"
let stringifyLine (line:BigOleFile.Row) =
sprintf "%s" "redacted" /// blah blah redacted
let fileHeader = "" // redacted
BigOleFile.Load(filePath).Rows
|> Seq.groupBy groupByFirstCharacterOfIssueColumnName
|> Seq.iter (fun (groupByKey, groupedRows) ->
let newFileName = getNewFileName groupByKey
let res = writeData newFileName (fileHeader :: ((groupedRows |> Seq.map stringifyLine) |> List.ofSeq))
match res with
| Ok m -> printfn "Wrote file '%s' with '%d' rows" newFileName m
| Error x -> printfn "Error writing file '%s'. Error text: %s" newFileName x
)

view raw
SplitTheBadBoyUp.fsx
hosted with ❤ by GitHub

Nothing too fancy. The function simply takes a file path, creates a couple of functions to help it along, and then loads up the file and splits it up into distinct new files. Whole thing took 30 minutes to do. Yes, the complexity is O(n-squared), but when you’ve got panicked users, and all of a half hour to hit it, getting it ‘working’ first is the best way to go.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s