Robert Klemme wrote:
> On 06.07.2009 00:13, Greg Willits wrote:
> 
>> Generally, I wouldn't read in the holes, but I have this one little step 
>> that does end up with some holes, and I know it. What I don't know is 
>> what to expect in those holes. Null values or, garbage "A' characters 
>> left over from file X.
>> 
>> Logically I would expect garbage data, but the literal impact of 
>> paragraphs quoted earlier from the Unix book above indicates I should 
>> expect null values. I can't think of any tools I have that would enable 
>> me to test this.
> 
> I would not expect anything in those bytes for the simple reason that
> this reduces portability of your program.

Understood. In this case, I'm making a concious decision to go with 
whatever is faster. I've already written the code so that it is easy to 
add back in the packing if it's ever needed.

We're working with large data sets for aggregation which takes a long 
time to run, and second only to the ease and clarity of the top level 
DSL, is the speed of the aggregation process itself so we can afford to 
do more analysis.


>> Because I don't know, I've gone ahead and packed the holes with a known 
>> character. However, if I can avoid that I want to because it sucks up 
>> some time I'd like to avoid in large files, but it's not super critical.
>> 
>> At this point I'm more curious than anything. I appreciate the dialog.
> 
> should probably make sure
> that your record format allows for easy separation of the data and slack
> area.  There are various well established practices, for example
> preceding the data area with a length indicator or terminating data with
> a special marker byte.

Yep, already done that. Where this 'holes' business comes in, is that to 
stay below the 4GB limit, the data has to be processed and the file 
written out in chunks. Each chunk may have a unique line length. So, we 
find the longest line of the chunk, and write records at that interval 
using seek. Each record terminates with a line feed.

Since we don't know the standard length of each chunk until processing 
is done (and the file has already een started), a set of the lengths is 
added to the end of the file instead of the beginning.

When reading data, the fastest way to get the last line which has my 
line lengths, is to use tail. This  returns a string starting from the 
last record's EOL marker to the EOF. This "line" has the potential 
(likelihood) to include the empty bytes of the last record in front of 
the actual I want because of how tail interprets "lines" between EOL 
markers. I need to strip those empty bytes from the start of the line 
before I get to the line lengths data.

Every other aspect of the file uses the common approach of lines with 
#00 between fields and #10 at the end of the data, followed by zero or 
more fill characters to make each row an equal length of bytes.

-- gw






-- 
Posted via http://www.ruby-forum.com/.