Chris,

On 2/9/06, chrisjroos / gmail.com <chrisjroos / gmail.com> wrote:
> Ok, so it took just over 7 hours to build the index of 3000 items.

something sounds drastically wrong.

> Since then it has been running for a further 12 hours trying to use
> that index to obtain likely matches for the same 3000 items; i.e. for
> each of the 3000 items I am trying to get the best matches from the
> index (using find related).

again, something seems funny here.  Just performing a benchmark on the
dominant calculation for index building using for somewhere around 3000
documents with 50 unique keywords.  This took on the order of 4 minutes
on my 1.3GHz Athlon.  The pure-ruby version would take exponentially longer.

I don't think 3000 one line items should take that long (how many words
are on within a "line"?).  Also, are all the words in a line pretty unique?
The LSI algorithm does rely on an intersection of words.  If they are fairly
unique and only a couple of words in each line, an LSI algorithm probably
won't be the right path.

In any case, I'll be happy to look into it for you.  Can you send me
any data and
your lsi usage code snippet offlist?  From there I might get a better
idea what's
going wrong.

I've used ruby to do some pretty heavy lifting.  LSI might be "slow",
but those numbers seem quite shocking  (if you were talking about 3000
documents of 10_000 words each where most words are unique, that might
take a small part of the age of the universe. ;).

Cameron

p.s. for those more curious about the details:

Classifier::LSI uses the standard LSI technique, which is tokenizing
important keywords and then building an "index" by performing an SVD
decomposition (which is a [documents] by [unique keywords] matrix). 
This is the slow part I'm referring to.