MikkelFJ wrote:

> For multi word searching I also found an aritcle on using vector spaces.
> Each document is given a vector and your query is also given a vector.
> You enumerate all known words. Each word becomes a position in the vector.
> The value in the vector is the number of word occurrences.
> You locate the document with the smallest distance from the query vector
> using some euclidian or other measure
> of distance. This also handles inexact queries and is supposedly memory
> efficient.
> The query would now need to look up each word in some hashtable or similar
> to locate the vector index.
> The documents need not be stored in-memory, only their vector representation
> (which can be compressed).
> A dump implementation would now scan all document vectors against the query,
> but I'm sure there is more clever way.
> Perhaps this is an excercise for a new kata?

Or for the original one... :)

http://pragprog.com/pragdave/Practices/CodeKata


Cheers


Dave