* Hal E. Fulton (hal9000 / hypermetrics.com) wrote:

> I'm assuming (perhaps incorrectly) that a seek and a read of a
> specified number of bytes is faster than a read of a single line of
> unknown length.

Possibly.  On a file with very large lines (>512 bytes at least,
depending on the filesystem) this is certainly correct.

> A rather theoretical question, since the technique I'm proposing might
> waste almost as much as it saves...
>
> Suppose on the first line (line 0) you store the length of line 1.
>
> Then on each line you store the length of the NEXT line in the same
> way.
>
> You can hop through the file quickly in this way, no?  On each
> intermediate line, you need only a read a single number (perhaps three
> digits fixed-length).

You mean like:

010
ABCDEFGHIJ
013
ABCDEFGHIJKLM
002
AB
026
ABCDEFGHIJKLMNOPQRSTUVWXYZ

The problem is the entries are tiny in this case; the read() is going to
end up block-aligned to the filesystem, so chances are each time you
read your few bytes to get the size of the next record, you're also
getting the next record read in and discarded.

All those seeks in this case will most likely just result in the access
being much slower.

However, if your file's more like:

0015234
...
0042145
...
0734523
...
7453443
...

Then you're probably on a good track.  This is actually how you skip
through a tar file and various other archive formats, provided you can
seek properly on it (difficult if you're using a stream compressor like
gzip).

You could build an off-file index if you were going to need a lot of
entries; scan the entire file once, then spread that cost over a large
number of lookups.  This is what usually happens when you're processing
a mailspool in a MUA; you read the entire thing, then just store the
offsets and a bit of metadata about each part.  If you were to do this
for a fortune style system, you might daemonise a persistant lookup
process :)

-- 
Thomas 'Freaky' Hurst  -  freaky / aagh.net  -  http://www.aagh.net/
-
If you can't measure output then you measure input.