* Thomas Hurst (tom.hurst / clara.net) wrote:

> It's much cheaper, involving generating a single random number, a
> single seek, a single stat, and something like (average_tagline_length
> * 1.5) + 1 readline()s.  Almost certaily the better approach on a
> 300,000 line tagline file than doing on average 150k readline()s and
> rand()s :)

Obviously, this has the disadvantage of giving longer entries more
chance of being selected, but for most cases and without an index, it's
probably about optimal.

An alternative is to scan the file, store a list of all the entry
offsets using a fixed field length, ala:

00001   000000
00002   000224
00003   000568
.....   ......
30000   056434

Since each entry in that index is of an identical size, picking a random
one using a rand() and seek() should be fine; then you just use the
index you've read to seek() directly to the random entry.

Even without the index cached, this should be at least comparable to
reading each entry and generating a random number per line, and once the
index is generated it'll be much faster.

-- 
Thomas 'Freaky' Hurst  -  freaky / aagh.net  -  http://www.aagh.net/
-
It has long been known that one horse can run faster
than another -- but which one?  Differences are crucial.
		-- Lazarus Long