Sometimes, you can see problems coming.
A vendor of ours provides security information monthly in a series of text files. Typically, those text files are in the 10s of megabytes large. Those files get viewed cleaned up, and imported into our system by portfolio managers monthly.
The system to do the loading is homegrown, not ETL based (someone thought users should ‘upload’ the files) into an ASP.NET MVC app, and then render UIs with the data. :eyeroll:
Naturally, as our data requirements got larger the files got bigger. And bigger.
This last month, the system eventually exploded. A 1 GB text file proved to be too much for our cute little “not-quite-an-ETL” tool. The portfolio managers needed a solution, and in order to quickly get ’em one, I present (with relevant bits redacted) a 30 minute “lets play with F#” script to solve the problem.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#r "FSharp.Data.3.0.0\\lib\\net45\\Fsharp.Data.dll" | |
open System.IO; | |
open FSharp.Data | |
type BigOleFile = CsvProvider<"C:\\working\\sample.txt", "\t"> | |
let writeData filePath stringLines = | |
try | |
File.WriteAllLines (filePath , Array.ofList stringLines) | |
Ok (List.length stringLines) | |
with | |
| e -> Error e.Message | |
let loadFile filePath = | |
// Shouldn't these be results to? | |
let folderPath = Path.GetDirectoryName filePath | |
let fileName = Path.GetFileName filePath | |
let getNewFileName v = | |
sprintf "%s%c%s_%s" folderPath Path.DirectorySeparatorChar v fileName | |
let toUpper c = | |
(char ((string c).ToUpper () )) | |
let inRange c1 c2 v = | |
v >= c1 && v <= c2 | |
let groupByFirstCharacterOfIssueColumnName (row:BigOleFile.Row) = | |
match toUpper row.ColumnToGroupBy.[0] with | |
| x when inRange 'A' 'B' x -> "A-B" | |
| x when inRange 'C' 'D' x -> "C-D" | |
| x when inRange 'E' 'F' x -> "E-F" | |
| 'G' -> "G" | |
| x when inRange 'H' 'K' x -> "H-K" | |
| 'L' -> "L" | |
| _ -> "M-Z" | |
let stringifyLine (line:BigOleFile.Row) = | |
sprintf "%s" "redacted" /// blah blah redacted | |
let fileHeader = "" // redacted | |
BigOleFile.Load(filePath).Rows | |
|> Seq.groupBy groupByFirstCharacterOfIssueColumnName | |
|> Seq.iter (fun (groupByKey, groupedRows) -> | |
let newFileName = getNewFileName groupByKey | |
let res = writeData newFileName (fileHeader :: ((groupedRows |> Seq.map stringifyLine) |> List.ofSeq)) | |
match res with | |
| Ok m -> printfn "Wrote file '%s' with '%d' rows" newFileName m | |
| Error x -> printfn "Error writing file '%s'. Error text: %s" newFileName x | |
) |
Nothing too fancy. The function simply takes a file path, creates a couple of functions to help it along, and then loads up the file and splits it up into distinct new files. Whole thing took 30 minutes to do. Yes, the complexity is O(n-squared), but when you’ve got panicked users, and all of a half hour to hit it, getting it ‘working’ first is the best way to go.