* Hal E. Fulton (hal9000 / hypermetrics.com) wrote: > I'm assuming (perhaps incorrectly) that a seek and a read of a > specified number of bytes is faster than a read of a single line of > unknown length. Possibly. On a file with very large lines (>512 bytes at least, depending on the filesystem) this is certainly correct. > A rather theoretical question, since the technique I'm proposing might > waste almost as much as it saves... > > Suppose on the first line (line 0) you store the length of line 1. > > Then on each line you store the length of the NEXT line in the same > way. > > You can hop through the file quickly in this way, no? On each > intermediate line, you need only a read a single number (perhaps three > digits fixed-length). You mean like: 010 ABCDEFGHIJ 013 ABCDEFGHIJKLM 002 AB 026 ABCDEFGHIJKLMNOPQRSTUVWXYZ The problem is the entries are tiny in this case; the read() is going to end up block-aligned to the filesystem, so chances are each time you read your few bytes to get the size of the next record, you're also getting the next record read in and discarded. All those seeks in this case will most likely just result in the access being much slower. However, if your file's more like: 0015234 ... 0042145 ... 0734523 ... 7453443 ... Then you're probably on a good track. This is actually how you skip through a tar file and various other archive formats, provided you can seek properly on it (difficult if you're using a stream compressor like gzip). You could build an off-file index if you were going to need a lot of entries; scan the entire file once, then spread that cost over a large number of lookups. This is what usually happens when you're processing a mailspool in a MUA; you read the entire thing, then just store the offsets and a bit of metadata about each part. If you were to do this for a fortune style system, you might daemonise a persistant lookup process :) -- Thomas 'Freaky' Hurst - freaky / aagh.net - http://www.aagh.net/ - If you can't measure output then you measure input.