"MikkelFJ" <mikkelfj-anti-spam / bigfoot.com> wrote in message
news:3ed243b6$0$97260$edfadb0f / dread12.news.tele.dk...


> The documents need not be stored in-memory, only their vector
representation
> (which can be compressed).
> A dump implementation would now scan all document vectors against the
query,
> but I'm sure there is more clever way.
> Perhaps this is an excercise for a new kata?

Which makes me think of many keyed indices.
Using space filling curves (a kind of fractals), you can map a vector into a
single linear value.
The linear value is the position on the line of the space filling curve.
The vector is the position in space where the curve has the given distance.
A Piano curve is one such fractal.

The linear index can be stored in B-Trees or similar efficient lookup
datastructures.

The nice thing about space filling curves is that things many curves exhibit
a kind of sensible distance measure.
That is, if the distance between two linear indices is small then the points
are also reasonably close together on the curve.
This is not an exact science but it does work.
Spacefilling curves have been used to index images where each key is some
aspect of the image, such as how much yellow etc.
Given a new image of the same person in the same shirt etc. you should have
a chance of retrieving and old similar image.

If we take this idea and apply it to document storage, we can now easily
find the best document matches by looking up the documents
stored with linear indices within a predetermined distance of the query. The
distance chosen may take into account the nature of the
space filling curve.

A problem is that it can be a bit slow to calculate the linear index.

Do we have a new google?  :-)