Stop words in queries


#1

I’ve run in to an issue that I’m not sure how to address. Basically, I’m
building queries with occur_default Search::BooleanClause::Occur::MUST,
and using the StandardAnalyzer which does stop filtering. The stop
filtering is working beautifully on the indexing side. The problem is
that when the query parser parses through a query with a stop word in
it, say “the oregon trail”, it builds a query that looks something like
this:

MUST title:
MUST title: oregon
MUST title: trail

Which unfortunately fails when searching for the previously indexed “The
Oregon Trail” because it doesn’t have a blank title term in it.

Is there a good way to deal with this issue besides filtering stop words
before handing the query string off to the parser?

Thanks!

Nathaniel

P.S. I’m using the pure Ruby part of Ferret 0.9.0 on Ruby 1.8.4.


#2

Hi Nathaniel,

This is a bug. I might get around to fixing it but I can’t promise
anything. I’m focusing entirely on the C extension version of Ferret
(which doesn’t have this bug).

Cheers,
Dave

PS: Sorry for the slow reply. It’s been a tough few weeks here.


#3

David B. wrote:

This is a bug. I might get around to fixing it but I can’t promise
anything. I’m focusing entirely on the C extension version of Ferret
(which doesn’t have this bug).

Bummer… I’ve got a hack to avoid the problem for the time being, but
it’s really ugly :-/

This brings up another issue, though, that I’ll go ahead and broach…

I’m kind of sad that you’ve made the jump to C so soon. Ferret is
brimming with potential, but it still feels a lot like, well, Java. The
API is still pretty heavy, and when I dig in to the underlying code it
feels over-designed. I’m guessing a lot of that is due to the straight
translation from Java, which while it’s a good first step, it’s also not
surprising that it would initially result in a library that feels pretty
alien.

While I understand the performance reasons for using C, doing so also
makes it much harder to refactor and refine the API, and my feeling is
that for most problems, the pure-Ruby performance isn’t a show-stopper.
Putting everything in C also makes it harder for folks such as myself,
who don’t do much C, to hack on the internals ourselves and push patches
back up to you.

I hope this comes off the right way - it’s open-source, and you’re of
course free to take the project where you will. I’m also extremely
grateful for the project - it’s helping me out a lot. I just have doubts
about the long-term viability of Ferret within the Ruby community when
an API (and underlying code) that I find I spend a lot of time fighting
is getting set in stone so early. I’d hate to see you spend a lot of
time on it to only have it be a prototype for a more Ruby-ish library
that comes along later. I want Ferret to be the standard by which other
indexing tools are measured, in Ruby and elsewhere, and I don’t think
that raw benchmarks are going to drive that.

PS: Sorry for the slow reply. It’s been a tough few weeks here.

No problem! So what exactly do you do? Are you a student? Freelancer?
Employee? Astronaut?

Thanks a ton for the great library,

Nathaniel


#4

On 4/14/06, Nathaniel T. removed_email_address@domain.invalid wrote:

that for most problems, the pure-Ruby performance isn’t a show-stopper.
Putting everything in C also makes it harder for folks such as myself,
who don’t do much C, to hack on the internals ourselves and push patches
back up to you.

The pure ruby version is still there and I’d love for someone to take
over from me. I completely agree with you on the advantages of having
a pure ruby version. I personally want the performance which is why I
have taken the C route. And there is a huge difference. Somewhere
around 100 times. There are people out there who were still using Java
Lucene for indexing because of performance issues so I wasn’t the only
one concerned about the performance. As for refactoring the API, I
understand it is very difficult for some Ruby programmers to get
around the C code but you don’t need to send me a patch. Just let me
know what you think needs to be changed.

that raw benchmarks are going to drive that.
I want the same thing too. The other advantage to having the C version
is that it won’t be too much work to Ferret in Perl, Python, Tcl etc.

PS: Sorry for the slow reply. It’s been a tough few weeks here.

No problem! So what exactly do you do? Are you a student? Freelancer?
Employee? Astronaut?

I’m currently an athlete. I’m practicing Judo in Japan and working on
Ferret whenever I have time.


#5

David B. wrote:

The pure ruby version is still there and I’d love for someone to take
over from me. I completely agree with you on the advantages of having
a pure ruby version. I personally want the performance which is why I
have taken the C route. And there is a huge difference. Somewhere
around 100 times. There are people out there who were still using Java
Lucene for indexing because of performance issues so I wasn’t the only
one concerned about the performance.

Understood, and I do look forward to improved performance.

As for refactoring the API, I
understand it is very difficult for some Ruby programmers to get
around the C code but you don’t need to send me a patch. Just let me
know what you think needs to be changed.

My big suggestion would be to cut down on the surface area of the API -
it’s almost overly flexible, and feels over-designed (probably due to
the port from Java). Fewer (documented) classes, simplified options,
etc. Basically, it’s a bit overwhelming to someone coming at it for the
first time, and I don’t think that’s strictly (or even mostly) a
documentation issue. As I use it more I’ll try to come up with specific
examples.

My small suggestion would be to use symbols (and booleans) for
configuration instead of the constants currently being used. For
instance:

Ferret::Document::Field::Store::YES -> true
Ferret::Document::Field::Store::NO -> false
Ferret::Document::Field::Store::COMPRESS -> :compress

and

Ferret::Document::Field::Index::NO -> false
Ferret::Document::Field::Index::TOKENIZED -> :tokenized
Ferret::Document::Field::Index::UNTOKENIZED -> :untokenized

I think this would help Ferret configuration feel much more Rubyish.

that raw benchmarks are going to drive that.
I want the same thing too. The other advantage to having the C version
is that it won’t be too much work to Ferret in Perl, Python, Tcl etc.

But why share? (just kidding :wink:

I’m currently an athlete. I’m practicing Judo in Japan and working on
Ferret whenever I have time.

Fascinating (and very cool). Best of luck with it!

Thanks again for Ferret,

Nathaniel T.