Finding related items (like latent semantic indexing)

I’ve been trying to use Classifier::LSI to provide a means of finding
‘related items’, where each item is a one line description of a product.

Although on small samples the Classifier works great, it completely
baulks on my current dataset of 3000 items.

I’ve started to look at ferret this morning, following a post on the
ruby mailing list. I’d guess that the Fuzzy Query would be the thing
that I need, although it doesn’t appear to be as comprehensive as the
LSI stuff in classifier (I realise they are doing different things).

I’m really just after any thoughts anyone might have…

Thanks in advance,

Chris

Hi Chris,

I plan on adding a “More Like This” function to Ferret but I’m really
swamped (doing other stuff on Ferret) at the moment. If you want to
have a go at implementing it yourself you could have a look at the way
it’s done in Lucene. It’s not too much work but it could take you a
while to get your head around the Ferret internals and the current
Ferret codebase is soon to be obselete. Sorry I can’t be of more help.

Cheers,
Dave

Hi Chris,

I just noticed that you are indexing one line product descriptions.
What I’d suggest doing (I believe this is how the lucene MoreLikeThis
query works) is just taking the description of your start product and
using that as the query. So if the description is;

"apple ipod nano 4Gb black"

then your query will be;

"description:(apple ipod nano 4Gb black)"

Hope that helps,
Dave