Robert Klemme wrote: > On 06.07.2009 00:13, Greg Willits wrote: > >> Generally, I wouldn't read in the holes, but I have this one little step >> that does end up with some holes, and I know it. What I don't know is >> what to expect in those holes. Null values or, garbage "A' characters >> left over from file X. >> >> Logically I would expect garbage data, but the literal impact of >> paragraphs quoted earlier from the Unix book above indicates I should >> expect null values. I can't think of any tools I have that would enable >> me to test this. > > I would not expect anything in those bytes for the simple reason that > this reduces portability of your program. Understood. In this case, I'm making a concious decision to go with whatever is faster. I've already written the code so that it is easy to add back in the packing if it's ever needed. We're working with large data sets for aggregation which takes a long time to run, and second only to the ease and clarity of the top level DSL, is the speed of the aggregation process itself so we can afford to do more analysis. >> Because I don't know, I've gone ahead and packed the holes with a known >> character. However, if I can avoid that I want to because it sucks up >> some time I'd like to avoid in large files, but it's not super critical. >> >> At this point I'm more curious than anything. I appreciate the dialog. > > should probably make sure > that your record format allows for easy separation of the data and slack > area. There are various well established practices, for example > preceding the data area with a length indicator or terminating data with > a special marker byte. Yep, already done that. Where this 'holes' business comes in, is that to stay below the 4GB limit, the data has to be processed and the file written out in chunks. Each chunk may have a unique line length. So, we find the longest line of the chunk, and write records at that interval using seek. Each record terminates with a line feed. Since we don't know the standard length of each chunk until processing is done (and the file has already een started), a set of the lengths is added to the end of the file instead of the beginning. When reading data, the fastest way to get the last line which has my line lengths, is to use tail. This returns a string starting from the last record's EOL marker to the EOF. This "line" has the potential (likelihood) to include the empty bytes of the last record in front of the actual I want because of how tail interprets "lines" between EOL markers. I need to strip those empty bytes from the start of the line before I get to the line lengths data. Every other aspect of the file uses the common approach of lines with #00 between fields and #10 at the end of the data, followed by zero or more fill characters to make each row an equal length of bytes. -- gw -- Posted via http://www.ruby-forum.com/.