Greg Willits wrote: > Yep, already done that. Where this 'holes' business comes in, is that to > stay below the 4GB limit, the data has to be processed and the file > written out in chunks. Each chunk may have a unique line length. So, we > find the longest line of the chunk, and write records at that interval > using seek. Each record terminates with a line feed. To me, this approach smells. For example, it could have *really* bad disk usage if one record in your file is much larger than all the others. Is the reason for this fixed-space padding just so that you can jump directly to record number N in the file, by calculating its offset? If so, it sounds to me like what you really want is cdb: http://cr.yp.to/cdb.html You emit key/value records of the form +1,50:1->(50 byte record) +1,70:2->(70 byte record) +1,60:3->(60 byte record) ... +2,500:10->(500 byte record) ... etc then pipe it into cdbmake. The resulting file is built, in a single pass, with a hash index, allowing you to jump to record with key 'x' instantly. There's a nice and simple ruby-cdb library available, which wraps djb's cdb library. Of course, with cdb you're not limited to integers as the key to locate the records, nor do they have to be in sequence. Any unique key string will do - consider it like an on-disk frozen Hash. (The key doesn't have to be unique actually, but then when you search for key K you would ask for all records matching this key) -- Posted via http://www.ruby-forum.com/.