Doing LSI at scale in Ruby

Hi all,

I’m looking to find out whether anyone is doing latent semantic indexing
(LSI) in Ruby at any kind of web scale, and if so, what tools and
techniques
you’re using?

Just for context, I’ve been working on this problem for a few days now.
I’ve tried the Classifier gem via “gem install” and compiled from
source
and at least two other forks. I’ve tried compiling various versions of
the
GSL library, most of which would not allow the gsl gem to compile, and
it
seems that in the combinations where I can actually get the full set of
libraries to install, I receive an error like the following:

/home/ck1/.rvm/gems/ruby-1.9.2-p180@classifier_test/gems/kitop-classifier-1.4.4/lib/classifier/lsi.rb:316:in
SV_decomp': Ruby/GSL error code 24, svd of MxN matrix, M<N, is not implemented (file svd.c, line 61), the requested feature is not (yet) implemented (GSL::ERROR::EUNIMPL) from /home/ck1/.rvm/gems/ruby-1.9.2-p180@classifier_test/gems/kitop-classifier-1.4.4/lib/classifier/lsi.rb:316:inbuild_reduced_matrix’
from
/home/ck1/.rvm/gems/ruby-1.9.2-p180@classifier_test/gems/kitop-classifier-1.4.4/lib/classifier/lsi.rb:128:in
build_index' from /home/ck1/.rvm/gems/ruby-1.9.2-p180@classifier_test/gems/kitop-classifier-1.4.4/lib/classifier/lsi.rb:66:inadd_item’
from lsi_test.rb:18:in block in <main>' from lsi_test.rb:18:ineach’
from lsi_test.rb:18:in `’

This particular stack trace was when running with a fork of Classifier,
but
the result is essentially the same with the original gem with the
exception
of the line numbers, and it looks as though the error is unrelated to
Classifier but rather the gsl gem or the underlying GSL library.

Any help or shared experiences will be appreciated. Thanks in advance.

On May 26, 2011, at 11:32 , Chris K. wrote:

implemented (GSL::ERROR::EUNIMPL)
This particular stack trace was when running with a fork of Classifier, but
the result is essentially the same with the original gem with the exception
of the line numbers, and it looks as though the error is unrelated to
Classifier but rather the gsl gem or the underlying GSL library.

You’d be better off contacting the author. There is no guarantee that
they read this list.

On 05/27/11 04:32, Chris K. wrote:

I’m looking to find out whether anyone is doing latent semantic indexing
(LSI) in Ruby at any kind of web scale, and if so, what tools and techniques
you’re using?

The author of Picky http://florianhanke.com/picky/ presented it last
night
at the Melbourne Ruby group. Not sure if it’s interesting to you, but it
looks
like a different kind of search engine to Sphinx, etc.

Clifford H…

On May 26, 2011, at 20:16 , Karl S. wrote:

Starting about a week ago, ruby is crashing fairly often during rails
development: rails server, console, and during spec runs. But it’s not consistent.

Please don’t thread hijack. Start a new thread properly.

On Fri, May 27, 2011 at 5:16 AM, Karl S. [email protected]
wrote:

At first I was startled because I have never seen ruby crash before on this
machine.
Now that the novelty has worn thin, it’s becoming quite a distraction.

Since everything was working just a few days ago, I’m stumped as to what may be
causing this. I need help tracking down the cause.

Well, what did change in your system environment in the last few days?

Looking at the crash log, and considering the crash happens after an
SQL statement: What database are you using, and what is its version?
And what’s your Rails/ActiveRecord version?

If possible, build an application with the minimum set of external
libraries that still produces a crash.


Phillip G.

A method of solution is perfect if we can forsee from the start,
and even prove, that following that method we shall attain our aim.
– Leibnitz

Starting about a week ago, ruby is crashing fairly often during rails
development: rails server, console, and during spec runs. But it’s not
consistent.

$ ruby -v
ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-darwin10.7.0]
$ rvm -v
rvm 1.6.10 by Wayne E. Seguin ([email protected]) [https://
rvm.beginrescueend.com/]

Please take a look at some of the crash logs:

I have tried the following, but did not help:

  • remove all gems and re-bundle
  • uninstall ruby 1.9.2-p180 and re-install
  • use ruby 1.9.2-p136 with new gem re-bundle

At first I was startled because I have never seen ruby crash before on
this machine. Now that the novelty has worn thin, it’s becoming quite a
distraction.

Since everything was working just a few days ago, I’m stumped as to what
may be causing this. I need help tracking down the cause.

On May 27, 2011, at 00:57 , Phillip G. wrote:

SQL statement: What database are you using, and what is its version?
And what’s your Rails/ActiveRecord version?

If possible, build an application with the minimum set of external
libraries that still produces a crash.

This seems a lot more relevant:

c:0042 p:---- s:0136 b:0136 l:000135 d:000135 CFUNC :require
c:0041 p:0012 s:0132 b:0132 l:000116 d:000131 BLOCK
blah/gems/activesupport-3.0.7/lib/active_support/dependencies.rb:239
c:0040 p:0005 s:0130 b:0130 l:000121 d:000129 BLOCK
blah/gems/activesupport-3.0.7/lib/active_support/dependencies.rb:225
c:0039 p:0045 s:0128 b:0128 l:000127 d:000127 METHOD
blah/gems/activesupport-3.0.7/lib/active_support/dependencies.rb:596
c:0038 p:0041 s:0122 b:0122 l:000121 d:000121 METHOD
blah/gems/activesupport-3.0.7/lib/active_support/dependencies.rb:225
c:0037 p:0013 s:0117 b:0117 l:000116 d:000116 METHOD
blah/gems/activesupport-3.0.7/lib/active_support/dependencies.rb:239
c:0036 p:0011 s:0112 b:0112 l:000111 d:000111 TOP
blah/gems/ruby_parser-2.0.6/lib/ruby_parser.rb:7
c:0035 p:---- s:0110 b:0110 l:000109 d:000109 FINISH
c:0034 p:---- s:0108 b:0108 l:000107 d:000107 CFUNC :require

Which looks to be:

require ‘racc/parser.rb’

Which ships with ruby… Something is brokey with racc itself? I
dunno…

On May 27, 2011, at 12:57 AM, Phillip G. wrote:

SQL statement: What database are you using, and what is its version?
And what’s your Rails/ActiveRecord version?

If possible, build an application with the minimum set of external
libraries that still produces a crash.

Since I am not doing anything unusual (using common gems and typical
methods for ruby/gem installation), I would expect to see others report
the same issue. I have deleted and re-installed 1.9.2-p180 several
times, tried reverting to 1.9.2-p136, and erased and re-installed all
gems. Still keeps on crashing.

Not 100% sure what has changed. I did update Postgres to 9.0.4 via brew,
so the pg gem would have been compiled against the new version. But
again, I would expect others who have done the same to report issues.

The crashing is common, but not consistent. For example, it took 4 times
running ‘rake -T’ before it would finally work. But eventually it did
work.

Because of it’s inconsistency, could this is a threading or timing issue
with the pg gem?

So for what it’s worth, the particular issue I was running into was not
caused by Classifier at all but rather by the test data I was using.
The
application this is for is still under development, so the texts that
I’m
indexing are multiple-paragraph blocks being generated using
Faker::Lorem.
The problem here is that this library has a limited vocabulary of less
than
200 words, and Classifier::LSI requires that the number of unique words
being indexed across all texts must be greater than or equal to the
number
of text instances. (It seems like it was also filtering out some number
of
words – probably one- and two-character words which might be considered
stop words.) So as soon as the number of records indexed exceeded the
number of unique words, the underlying library (GNU GSL) propagated an
exception.

I’ve now tested this against a set of strings utilizing a richer
vocabulary,
and even though indexing slows down exponentially with greater numbers
of
records, it completes successfully. Hope this description helps someone
else out.

On May 27, 2011, at 09:15 , Karl S. wrote:

Since I am not doing anything unusual (using common gems and typical methods for
ruby/gem installation)

Well… you’re calling into ruby_parser in a rails app, which I think is
a tad unusual… Still looking to get real information as to why.

Thanks, Ryan. I will do this too, was just looking to see what the
current
de facto standard method for this is. Digging a little deeper on both
GitHub and RubyForge, it seems that the gem has been pretty much dormant
for
several years, so I’m looking to see whether people have moved on to
another
fork or another lib. Will post any findings.

Thanks, Clifford, for the tip. It’s not exactly what I need for this
particular part of the application, as I’m using the Classifier LSI
feature
to index documents and detect similar records, but it might be worth
investigating as a replacement for Sphinx in other places in this app
and
others I’m working on.