Greg Willits wrote:
> Eeek. Opened a can of worms!
...
> -- by the time we get done normalizing a particular raw source, it can 
> hit the 4GB memory limit any one ruby fork has available to play with 
> (many forks run in parallel on multiple cores)

Ah, so this is a 4GB process limit not file limit.

> -- imagine 2,000,000 raw records from one file which get processed in 
> 200,000 record chunks, but output back to another single file.
> 
> -- as I step through each chunk of 200,000 records, I can get the 
> longest length of that 200,000, and I can store that, but I can't know 
> what the longest length is for the next 200,000 that I haven't loaded 
> yet.

Standard "external processing", this is fine.

> -- it's a couple of very simple calculation to convert any "index" 
> position into the exact byte seek position to find a specific record.

OK, so you want your output file to have a property which CSV files 
don't, namely that you can jump to line N with a direct seek.

> -- I will look into cdb

Here's another worm for your can: CouchDB :-)

Output each normalised CSV row as a JSON document, then it will build 
arbitrary indexes out of that, defined using Javascript (or Ruby or 
...).

Unfortunately, building the indexes isn't as fast as it should be yet, 
and I note you have a lot of hand-optimised code. But if you ever want 
to be able to search your big files by content of a field, rather than 
just jump to the N'th record, this might be your saviour.

Maybe mongodb would be even faster, but I've not played with it yet.

Regards,

Brian.
-- 
Posted via http://www.ruby-forum.com/.