On Tue, May 27, 2003 at 02:00:18AM +0900, MikkelFJ wrote:
> For multi word searching I also found an aritcle on using vector spaces.
> Each document is given a vector and your query is also given a vector.
> You enumerate all known words. Each word becomes a position in the vector.
> The value in the vector is the number of word occurrences.
> You locate the document with the smallest distance from the query vector
> using some euclidian or other measure
> of distance. This also handles inexact queries and is supposedly memory
> efficient.
> The query would now need to look up each word in some hashtable or similar
> to locate the vector index.
> The documents need not be stored in-memory, only their vector representation
> (which can be compressed).
> A dump implementation would now scan all document vectors against the query,
> but I'm sure there is more clever way.
> Perhaps this is an excercise for a new kata?

In fact Dave sort of solved the Kata already :)
http://pragprog.com/pragdave/Tech/Blog/Searching.rdoc,v

-- 
 _           _                             
| |__   __ _| |_ ___ _ __ ___   __ _ _ __  
| '_ \ / _` | __/ __| '_ ` _ \ / _` | '_ \ 
| |_) | (_| | |_\__ \ | | | | | (_| | | | |
|_.__/ \__,_|\__|___/_| |_| |_|\__,_|_| |_|
	Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

Not only Guinness - Linux is good for you, too.
	-- Banzai on IRC