2009/7/6 Greg Willits <lists / gregwillits.ws>:
> Robert Klemme wrote:

> We're working with large data sets for aggregation which takes a long
> time to run, and second only to the ease and clarity of the top level
> DSL, is the speed of the aggregation process itself so we can afford to
> do more analysis.

Did you actually measure significant differences in time or are you
just assuming there is a significant impact because you write less and
have to do less processing?

>>> Because I don't know, I've gone ahead and packed the holes with a known
>>> character. However, if I can avoid that I want to because it sucks up
>>> some time I'd like to avoid in large files, but it's not super critical=
.
>>>
>>> At this point I'm more curious than anything. I appreciate the dialog.
>>
>> should probably make sure
>> that your record format allows for easy separation of the data and slack
>> area. =A0There are various well established practices, for example
>> preceding the data area with a length indicator or terminating data with
>> a special marker byte.
>
> Yep, already done that. Where this 'holes' business comes in, is that to
> stay below the 4GB limit, the data has to be processed and the file
> written out in chunks. Each chunk may have a unique line length. So, we
> find the longest line of the chunk, and write records at that interval
> using seek. Each record terminates with a line feed.

Errr, I am not sure I fully understand your approach.  What you write
sounds like if you end up with a file containing multiple sections
which each have lines with identical length.  So a file with two
sections could look like this

aaN0000
aaaN000
aN00000
aaaaaaN
aaaaN00
bN00
bbbN
bbN0
bbN0

Basically you are combining two approaches in one file: fixed length
records and variable length records with termination marker.  That
sounds odd to me.  If file size matters then I do not understand why
you do not just write out the file like a regular text file, i.e. only
use the line termination approach.

> Since we don't know the standard length of each chunk until processing
> is done (and the file has already een started), a set of the lengths is
> added to the end of the file instead of the beginning.
>
> When reading data, the fastest way to get the last line which has my
> line lengths, is to use tail.

Why don't you open the file, seek to N bytes before the end and read
them?  You do not need tail for this and you also have all the file
handling in your Ruby program.

> Every other aspect of the file uses the common approach of lines with
> #00 between fields and #10 at the end of the data, followed by zero or
> more fill characters to make each row an equal length of bytes.

It seems either I am missing something or you are doing something
weird for which I do not understand the reason.  Can you shade some
more light on the nature of the processing and why you follow this
approach?  That would be a wonderful completion of the discussion.

Kind regards

robert


--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/