On Tue, May 27, 2003 at 02:00:18AM +0900, MikkelFJ wrote: > For multi word searching I also found an aritcle on using vector spaces. > Each document is given a vector and your query is also given a vector. > You enumerate all known words. Each word becomes a position in the vector. > The value in the vector is the number of word occurrences. > You locate the document with the smallest distance from the query vector > using some euclidian or other measure > of distance. This also handles inexact queries and is supposedly memory > efficient. > The query would now need to look up each word in some hashtable or similar > to locate the vector index. > The documents need not be stored in-memory, only their vector representation > (which can be compressed). > A dump implementation would now scan all document vectors against the query, > but I'm sure there is more clever way. > Perhaps this is an excercise for a new kata? In fact Dave sort of solved the Kata already :) http://pragprog.com/pragdave/Tech/Blog/Searching.rdoc,v -- _ _ | |__ __ _| |_ ___ _ __ ___ __ _ _ __ | '_ \ / _` | __/ __| '_ ` _ \ / _` | '_ \ | |_) | (_| | |_\__ \ | | | | | (_| | | | | |_.__/ \__,_|\__|___/_| |_| |_|\__,_|_| |_| Running Debian GNU/Linux Sid (unstable) batsman dot geo at yahoo dot com Not only Guinness - Linux is good for you, too. -- Banzai on IRC