Eeek. Opened a can of worms!

It's 5:30 am where I am (I'm "still up" and not "up early"), so I'll hit 
the highlights, and detail more later if for some reason there's an 
interest.

-- application: data aggregation

-- I get data from county services and school districts. They have no 
way to correlate and aggregate their data. That's what I am doing. I 
match kids that are flagged in county services for whatever reason and 
have to match them up with school records, health records, judicial 
records where applicable, etc.

-- I get dozens of CSV files of various subjects. Every school 
district's CSV file for any subject (attendance, grades, etc) has 
similar content but not identical. Ditto for demographics from various 
agencies.

-- before I can even attempt aggregation, I need to normalize all this 
data so it can be indexed and  analyzed, and I ensure the data is 
cleaned and standardized for display.

-- there's a top layer DSL where I describe data sources and how to 
transform any one particular raw data file into a normalized data file. 
There another set of descriptions to allow any one field to come from a 
number of possible sources. So, something like birthcity might come from 
data source X, but if not available, check data source Y, etc.

-- this process has been abstracted into a data aggregation core to do 
the normalization, the indexing, and other tasks, with an 
application-specific layer to handle the definition of transformations, 
indexes that are needed, order of precedence, and the stitching of data 
from various sources into records.

-- So, this particular step I've been talking about is where a raw CSV 
file undergoes normalization by reorgnizing the fields of each record 
into a common structure for that given data topic, each field undergoes 
some scrubbing (character case, packing phones, normalizing date 
formats, translation of codes into strings, etc).

-- raw data files range from a handful of columns to a couple dozen 
columns. From a few hundred rows to a couple million rows.

-- data is an array of hashes

-- by the time we get done normalizing a particular raw source, it can 
hit the 4GB memory limit any one ruby fork has available to play with 
(many forks run in parallel on multiple cores)

-- while most CSV files work out just fine reading into RAM, 
transforming them 100% in RAM, and then writing in a simple single step, 
many files do not.

-- so we have updated the process to load X records from the raw file, 
tranform X records in memory, then write X records to disk, and loop 
untill all records are done.

-- and we have to deal with indexes, duplicates, and other issues in 
there as well

-- imagine 2,000,000 raw records from one file which get processed in 
200,000 record chunks, but output back to another single file.

-- as I step through each chunk of 200,000 records, I can get the 
longest length of that 200,000, and I can store that, but I can't know 
what the longest length is for the next 200,000 that I haven't loaded 
yet.

-- having processed and written the first 200,000 results to disk, and 
then determining the length of the second 200,000, I'm not going to go 
back and change the first 200,000 just to make them all the same. 
there's no value in that at all.

-- So, when I get done with each 200,000 chunk, I now have a series of 
integers which tells me the length of the records in each chunk of 
200,000 rows.

-- the file has already been written, so again, I'm not going to go back 
and move everything to insert this data at the top (which is where I 
would put if indeed every record was the same length)

-- so, I put this data at the end.

-- BTW, the rows lengths are similiar enough that the disk efficiency is 
not an issue.

-- I read this last line using tail, and I strip off the leading empty 
bytes (if any) as I described earlier

-- it's a couple of very simple calculation to convert any "index" 
position into the exact byte seek position to find a specific record.

-- from this point on, the records are read as random access from disk 
because otherwise I would need oodles of GB of data in RAM all at once 
during the aggregation process.

-- is doing 200,000 or even 500,000 at a time in chunks really any 
faster than doing them one at a time -- that I actually don't know yet, 
I am just now finishing all the code that this "chunking" touches and 
ensure I get the same results I used to. the size of the chunks isn't as 
important for speed as it is for memory management -- making sure I stay 
within 4GB.

-- as for the speed issues, we've done a lot of profiling, and even 
wrote a mini compiler to read our file tranformation DSL and output a 
stream of inline variable declarations and commands whichs gets included 
as a module on the fly for each data source. That trick saved us from 
parsing the DSL for each data row and literally shaved hours off the 
total processing time. We attacked many other levels of optimization 
while working to keep the code as readable as possible, because it's a 
complicated layering of abstractions and processes.

-- I will look into cdb

-- gw




-- 
Posted via http://www.ruby-forum.com/.