What do you mean it has to be done in a half hour?

Sometimes, you can see problems coming.

A vendor of ours provides security information monthly in a series of text files. Typically, those text files are in the 10s of megabytes large. Those files get viewed cleaned up, and imported into our system by portfolio managers monthly.

The system to do the loading is homegrown, not ETL based (someone thought users should ‘upload’ the files) into an ASP.NET MVC app, and then render UIs with the data. :eyeroll:

Naturally, as our data requirements got larger the files got bigger. And bigger.

This last month, the system eventually exploded. A 1 GB text file proved to be too much for our cute little “not-quite-an-ETL” tool. The portfolio managers needed a solution, and in order to quickly get ’em one, I present (with relevant bits redacted) a 30 minute “lets play with F#” script to solve the problem.

	#r "FSharp.Data.3.0.0\\lib\\net45\\Fsharp.Data.dll"

	open System.IO;
	open FSharp.Data
	type BigOleFile = CsvProvider<"C:\\working\\sample.txt", "\t">

	let writeData filePath stringLines =
	try
	File.WriteAllLines (filePath , Array.ofList stringLines)
	Ok (List.length stringLines)
	with
	\| e -> Error e.Message

	let loadFile filePath =
	// Shouldn't these be results to?
	let folderPath = Path.GetDirectoryName filePath
	let fileName = Path.GetFileName filePath

	let getNewFileName v =
	sprintf "%s%c%s_%s" folderPath Path.DirectorySeparatorChar v fileName
	let toUpper c =
	(char ((string c).ToUpper () ))
	let inRange c1 c2 v =
	v >= c1 && v <= c2

	let groupByFirstCharacterOfIssueColumnName (row:BigOleFile.Row) =
	match toUpper row.ColumnToGroupBy.[0] with
	\| x when inRange 'A' 'B' x -> "A-B"
	\| x when inRange 'C' 'D' x -> "C-D"
	\| x when inRange 'E' 'F' x -> "E-F"
	\| 'G' -> "G"
	\| x when inRange 'H' 'K' x -> "H-K"
	\| 'L' -> "L"
	\| _ -> "M-Z"

	let stringifyLine (line:BigOleFile.Row) =
	sprintf "%s" "redacted" /// blah blah redacted

	let fileHeader = "" // redacted

	BigOleFile.Load(filePath).Rows
	\|> Seq.groupBy groupByFirstCharacterOfIssueColumnName
	\|> Seq.iter (fun (groupByKey, groupedRows) ->
	let newFileName = getNewFileName groupByKey
	let res = writeData newFileName (fileHeader :: ((groupedRows \|> Seq.map stringifyLine) \|> List.ofSeq))
	match res with
	\| Ok m -> printfn "Wrote file '%s' with '%d' rows" newFileName m
	\| Error x -> printfn "Error writing file '%s'. Error text: %s" newFileName x
	)

view raw

SplitTheBadBoyUp.fsx

hosted with ❤ by GitHub

Nothing too fancy. The function simply takes a file path, creates a couple of functions to help it along, and then loads up the file and splits it up into distinct new files. Whole thing took 30 minutes to do. Yes, the complexity is O(n-squared), but when you’ve got panicked users, and all of a half hour to hit it, getting it ‘working’ first is the best way to go.

What do you mean it has to be done in a half hour?

Published by Christopher Brown

Leave a comment Cancel reply

Tell folks about it!

Related

Published by Christopher Brown

Leave a comment Cancel reply