"Joe Graham" <joe / aiobjects.com> wrote in message
news:z1o67.2262$4v6.336942 / e420r-atl3.usenetserver.com...
> Hello,
> Does anyone have an efficient indexing algorithm or care to point me in
the
> right direction.  I am looking to index several terabytes of data across

Not sure how this relates to Ruby, but it is an interesting problem
nonetheless, and Ruby would probably be good for prototyping a solution.

The solution depends heavily on what you want to do. If you want to index
usenet, then look at the Google whitepaper on indexing. BTW: Google are
already indexing usenet. Interestingly, the Google algorithm is extremely
straighforward:
Each known word is a key into a relatively small map of known words (use a
hash or whatever). Each entry is the head of a long diskbased list of
text-pages all referring to the indexed word. The main index of known words
easily fits into main memory.

To identify pages that matches a set of words, just locate the index list of
each word. Read the list from disk to memory, a block at a time, and strip
out any list entries that do not occur in all lists (one for each keyword in
the search). Since the lists are stored in sorted order (indexed by page
id), this is a linear time operation.
When a page is modified and needs to be reindexed, or a new page arrives,
the sorted index lists needs to be updated: This can be done off-line as a
batch job. If frequent updates of the pagelists are neccessary, diskbased
B-trees is probably a good choice, but simpler methods will also work.
B-trees trade space and complexity for fast diskbased updates. (B-trees are
btw also nice for in-memory maps because they are cpu-cache friendly).

Mikkel