On Feb 9, 2006, at 7:48, chrisjroos / gmail.com wrote:
> Since then it has been running for a further 12 hours trying to use
> that index to obtain likely matches for the same 3000 items; i.e. for
> each of the 3000 items I am trying to get the best matches from the
> index (using find related).
>
> Should I even bother waiting for it to finish or should I be
> investigating something else to achieve similar results?

Can't comment on the time it takes, but the data you're using doesn't  
seem particularly suited to LSI, in my opinion (and this sort of  
thing is my occupation these days).  LSI's not magic - what it's  
doing is taking advantage of the statistical properties of language.   
So it needs two things to work well: a relatively large set of words  
compared to the number of items, and the items should be (more or  
less) standard language.

Obviously I don't know exactly what the product names are, but as a  
class, product names don't strike me as fitting those constraints  
very well.  Firstly because I expect them to be fairly short (5-6  
words, tops?), and secondly because they lack a lot of the syntax and  
semantic relations that you'd find in a sentence (nominals don't have  
very much internal structure, in general).

Other approaches that might be promising might be standard word/ 
document search (like ferret, already mentioned), or a language model  
approach, which works using relative frequencies of words.  In the  
power tool domain, for instance, "grit" might correlate highly with  
"sander", and so you could say that anything with "grit" in it is  
related to sanding.

That said, I'm not aware of any Ruby libraries which implement this  
sort of thing, so if you wanted to stick with Ruby, you'd be doing it  
yourself (it's not a particularly sophisticated approach, though, so  
it likely wouldn't be that hard).

matthew smillie.



----
Matthew Smillie            <M.B.Smillie / sms.ed.ac.uk>
Institute for Communicating and Collaborative Systems
University of Edinburgh