Excellent suggestion, thanks. There is a ubiquity to this solution that I certainly not have noticed otherwise. kudos. -----Original Message----- From: MikkelFJ [mailto:mikkelj-anti-spam / post1.dknet.dk] Sent: Saturday, July 21, 2001 5:41 PM To: ruby-talk ML; undisclosed-recipients: Subject: [ruby-talk:18276] Re: Effecient indexing algorithm "Joe Graham" <joe / aiobjects.com> wrote in message news:z1o67.2262$4v6.336942 / e420r-atl3.usenetserver.com... > Hello, > Does anyone have an efficient indexing algorithm or care to point me in the > right direction. I am looking to index several terabytes of data across Not sure how this relates to Ruby, but it is an interesting problem nonetheless, and Ruby would probably be good for prototyping a solution. The solution depends heavily on what you want to do. If you want to index usenet, then look at the Google whitepaper on indexing. BTW: Google are already indexing usenet. Interestingly, the Google algorithm is extremely straighforward: Each known word is a key into a relatively small map of known words (use a hash or whatever). Each entry is the head of a long diskbased list of text-pages all referring to the indexed word. The main index of known words easily fits into main memory. To identify pages that matches a set of words, just locate the index list of each word. Read the list from disk to memory, a block at a time, and strip out any list entries that do not occur in all lists (one for each keyword in the search). Since the lists are stored in sorted order (indexed by page id), this is a linear time operation. When a page is modified and needs to be reindexed, or a new page arrives, the sorted index lists needs to be updated: This can be done off-line as a batch job. If frequent updates of the pagelists are neccessary, diskbased B-trees is probably a good choice, but simpler methods will also work. B-trees trade space and complexity for fast diskbased updates. (B-trees are btw also nice for in-memory maps because they are cpu-cache friendly). Mikkel