Using Classifier::LSI

Hi,

I’ve just spent some time looking at the ruby classifier[1] library.

I need to index 3000 one line product names.

The default behaviour is to index items as they are inserted (at least
it would certainly appear that’s what’s happening based on simple
benchmarking). This causes the time to index new items to rise
exponentially. After some experimentation, I found that turning off
the automatic indexing and then manually building once all items were
added was far quicker (17 secs to add 100 items with auto indexing, 0.6
secs to add 100 items with manual index building).

I also did a quick test to confirm that the output was the same in both
cases.

I’m really just wondering why the default behaviour seems to be the
most inefficient and whether I’m missing anything obvious using the
manual build method?

Any help / pointers would be much appreciated.

Chris

[1] http://classifier.rufy.com

Hey Chris,

On 2/8/06, [email protected] [email protected] wrote:

The default behaviour is to index items as they are inserted (at least
it would certainly appear that’s what’s happening based on simple
benchmarking). This causes the time to index new items to rise
exponentially.

right. the index building takes time and there is a lot of double work
going
on.

After some experimentation, I found that turning off
the automatic indexing and then manually building once all items were
added was far quicker (17 secs to add 100 items with auto indexing, 0.6
secs to add 100 items with manual index building).

Great. This is exactly why there is an :auto_rebuild option, so you can
make
this more effecient (like you discovered).

I’m really just wondering why the default behaviour seems to be the
most inefficient and whether I’m missing anything obvious using the
manual build method?

I believe the default behavior is directed toward simple usage to help
give
people a feel for what the library does. Given the mathematics
underneath the
index building, when you have 3000 items the index building is slower.
(I believe the docs say around 500 is rule of thumb for speedy
return).

I don’t think you’re missing anything, here. Just use
:auto_rebuild=>false and build it manually.

I was actually advocating a lazy evaluation as the default (so the
index wouldn’t be built until you needed it, which would work rather
effeciently for both simple usage and your case). The argument
against this was to prevent users from thinking the searching was
slow. Once the index is built, everything else flies. LSI just
really is math intensive, which is why there is a C-based backend for
it. Just make sure you’re using something other than the pure-ruby
version if speed becomes an issue.

Let us know if anything else is unclear.

I’m doing a project at the moment cataloguing publications where I’ve
been using Ferret (a ruby port of lucene). I have to say, this is a
brilliant library. It has really good search capablilities, division of
‘documents’ into various fields that can be individually searched, and
it’s fast. Here are some basic benchmarks using the most simple entry
mode (both stores and tokenises all input) using ruby 1.8.4
(2005-10-29) [i686-linux] on a 1.4GHz Celeron M processor, 768MB
memory:

tim@ghostknife:~/programming/ruby> cat bm_ferret.rb
require ‘benchmark’
require ‘rubygems’
require ‘ferret’

$index = Ferret::Index::Index.new :path => ‘./test_index’, :create =>
true
$words = []

File.open(’/usr/share/dict/words’) do |dict|
dict.each_line do |line|
$words << line.chomp
end
end
def get_words
ret = []
10.times do
ret << $words[(rand * $words.length).to_i]
end
ret
end
Benchmark.bmbm(7) do |x|
x.report('Filing: ‘) do
1000.times {$index << {:key => (rand * 1000000).to_i,
:data => get_words.join(’ ')}}
end
end

tim@ghostknife:~/programming/ruby> ruby bm_ferret.rb
Rehearsal --------------------------------------------
Filing: 7.920000 0.630000 8.550000 ( 8.551508)
----------------------------------- total: 8.550000sec

           user     system      total        real

Filing: 7.680000 0.700000 8.380000 ( 8.637888)

Ok, so it took just over 7 hours to build the index of 3000 items.

Since then it has been running for a further 12 hours trying to use
that index to obtain likely matches for the same 3000 items; i.e. for
each of the 3000 items I am trying to get the best matches from the
index (using find related).

Should I even bother waiting for it to finish or should I be
investigating something else to achieve similar results?

BTW. I am using the c extension.

Thanks in advance,

Chris

On Feb 9, 2006, at 7:48, [email protected] wrote:

Since then it has been running for a further 12 hours trying to use
that index to obtain likely matches for the same 3000 items; i.e. for
each of the 3000 items I am trying to get the best matches from the
index (using find related).

Should I even bother waiting for it to finish or should I be
investigating something else to achieve similar results?

Can’t comment on the time it takes, but the data you’re using doesn’t
seem particularly suited to LSI, in my opinion (and this sort of
thing is my occupation these days). LSI’s not magic - what it’s
doing is taking advantage of the statistical properties of language.
So it needs two things to work well: a relatively large set of words
compared to the number of items, and the items should be (more or
less) standard language.

Obviously I don’t know exactly what the product names are, but as a
class, product names don’t strike me as fitting those constraints
very well. Firstly because I expect them to be fairly short (5-6
words, tops?), and secondly because they lack a lot of the syntax and
semantic relations that you’d find in a sentence (nominals don’t have
very much internal structure, in general).

Other approaches that might be promising might be standard word/
document search (like ferret, already mentioned), or a language model
approach, which works using relative frequencies of words. In the
power tool domain, for instance, “grit” might correlate highly with
“sander”, and so you could say that anything with “grit” in it is
related to sanding.

That said, I’m not aware of any Ruby libraries which implement this
sort of thing, so if you wanted to stick with Ruby, you’d be doing it
yourself (it’s not a particularly sophisticated approach, though, so
it likely wouldn’t be that hard).

matthew smillie.


Matthew S. [email protected]
Institute for Communicating and Collaborative Systems
University of Edinburgh

Chris,

On 2/9/06, [email protected] [email protected] wrote:

Ok, so it took just over 7 hours to build the index of 3000 items.

something sounds drastically wrong.

Since then it has been running for a further 12 hours trying to use
that index to obtain likely matches for the same 3000 items; i.e. for
each of the 3000 items I am trying to get the best matches from the
index (using find related).

again, something seems funny here. Just performing a benchmark on the
dominant calculation for index building using for somewhere around 3000
documents with 50 unique keywords. This took on the order of 4 minutes
on my 1.3GHz Athlon. The pure-ruby version would take exponentially
longer.

I don’t think 3000 one line items should take that long (how many words
are on within a “line”?). Also, are all the words in a line pretty
unique?
The LSI algorithm does rely on an intersection of words. If they are
fairly
unique and only a couple of words in each line, an LSI algorithm
probably
won’t be the right path.

In any case, I’ll be happy to look into it for you. Can you send me
any data and
your lsi usage code snippet offlist? From there I might get a better
idea what’s
going wrong.

I’ve used ruby to do some pretty heavy lifting. LSI might be “slow”,
but those numbers seem quite shocking (if you were talking about 3000
documents of 10_000 words each where most words are unique, that might
take a small part of the age of the universe. ;).

Cameron

p.s. for those more curious about the details:

Classifier::LSI uses the standard LSI technique, which is tokenizing
important keywords and then building an “index” by performing an SVD
decomposition (which is a [documents] by [unique keywords] matrix).
This is the slow part I’m referring to.

Ok, so I’m thinking (taking into account what Matthew S. has said)
that it’s due to the number of ‘unique’ words found in my documents -
4343 unique words found in 3320 product names.

I have actually spent the morning looking at ferret and some other text
searching libraries and at the moment I think ferret will do what I
require.

Thanks for the help folks.

Chris

Ok, so it took just over 7 hours to build the index of 3000 items.

Depending on the size of your inputs, it can take longer.
Classifier::LSI has a very specific sweet spot of data volume in its
current incarnation. It’s not currently suited for very large loads. I
am currently working on a version that will work with the MPI so that
clusters can attack the problem. LSI just isn’t useful for large
datasets without very hefty hardware.

But if your matrix is only 3000x4000-ish, it shouldn’t take that long.
Make sure that your classifier is binding to gsl, and if you need to,
email me and we can pursue it further. I only experience a breakdown in
speed when my unique-words list is in the 15k+ range, with more than
1500 items.

Also, Remember that LSI wants to work with the semantics. A list of
product titles isn’t going to bring very much semantic data into the
system. The way I originally envisioned using LSI was to create
summaries of documents, and working on sets of document abstracts to
rapidly provide relevant ACM papers. :slight_smile:

On 2/9/06, [email protected] [email protected] wrote:

I have actually spent the morning looking at ferret and some other text
searching libraries and at the moment I think ferret will do what I
require.

There is also a Classifier::Bayes available that might be more in sync
with what you’re trying to do.

Cameron

On 2/9/06, [email protected] [email protected] wrote:

Ok, so I’m thinking (taking into account what Matthew S. has said)
that it’s due to the number of ‘unique’ words found in my documents -
4343 unique words found in 3320 product names.

I have actually spent the morning looking at ferret and some other text
searching libraries and at the moment I think ferret will do what I
require.

sounds like the right approach, glad you found something that worked.

Cameron