Greg Willits wrote:
> Yep, already done that. Where this 'holes' business comes in, is that to 
> stay below the 4GB limit, the data has to be processed and the file 
> written out in chunks. Each chunk may have a unique line length. So, we 
> find the longest line of the chunk, and write records at that interval 
> using seek. Each record terminates with a line feed.

To me, this approach smells. For example, it could have *really* bad 
disk usage if one record in your file is much larger than all the 
others.

Is the reason for this fixed-space padding just so that you can jump 
directly to record number N in the file, by calculating its offset?

If so, it sounds to me like what you really want is cdb:
http://cr.yp.to/cdb.html

You emit key/value records of the form

+1,50:1->(50 byte record)
+1,70:2->(70 byte record)
+1,60:3->(60 byte record)
...
+2,500:10->(500 byte record)
... etc

then pipe it into cdbmake. The resulting file is built, in a single 
pass, with a hash index, allowing you to jump to record with key 'x' 
instantly.

There's a nice and simple ruby-cdb library available, which wraps djb's 
cdb library.

Of course, with cdb you're not limited to integers as the key to locate 
the records, nor do they have to be in sequence. Any unique key string 
will do - consider it like an on-disk frozen Hash. (The key doesn't have 
to be unique actually, but then when you search for key K you would ask 
for all records matching this key)
-- 
Posted via http://www.ruby-forum.com/.