MikkelFJ wrote: > For multi word searching I also found an aritcle on using vector spaces. > Each document is given a vector and your query is also given a vector. > You enumerate all known words. Each word becomes a position in the vector. > The value in the vector is the number of word occurrences. > You locate the document with the smallest distance from the query vector > using some euclidian or other measure > of distance. This also handles inexact queries and is supposedly memory > efficient. > The query would now need to look up each word in some hashtable or similar > to locate the vector index. > The documents need not be stored in-memory, only their vector representation > (which can be compressed). > A dump implementation would now scan all document vectors against the query, > but I'm sure there is more clever way. > Perhaps this is an excercise for a new kata? Or for the original one... :) http://pragprog.com/pragdave/Practices/CodeKata Cheers Dave