Bug in search matching?

ahfeel · October 20, 2006, 5:02pm

Hi

Here’s a little code reproducing something that i consider as a bug, if
it’s not please explain :]

http://pastie.caboo.se/18693

Thanks by advance,
Cheers,
JÃ©rÃ©mie ‘ahFeel’ BORDIER

ahfeel · October 20, 2006, 6:52pm

On 10/21/06, ahFeel [email protected] wrote:

Hi

Here’s a little code reproducing something that i consider as a bug, if
it’s not please explain :]

Parked at Loopia

Hi Jérémie,

You can get rid of this behaviour by building your own analyzer and
not including the HyphenFilter. This is a tricky issue which I haven’t
quite worked out yet. For example, when you search for “set-up” do you
want that to match “set up” and “setup”. What if you search for
“setup” or “set up”? Should they match all three versions too? With
the current HyphenFilter these all three versions in queries will
match all three versions in the index. However, this comes at the loss
of recall. The problems occur during phrase queries. To make it so
that “set-up” matches both “setup” and “set up”, “set-up” is analyzed
as "set up and “setup” so in the first position there are two words in
the tokenstream; “set” and “setup”. When I parse the phrase “set-up
files” I get the two phrases:

"set____up__files"
"setup______files"

So as you can see the second phrase only has two terms. so there is a
gap in betwen. To get the phrase “setup files” to match this I need to
give it a slop value.

Now I realize the solution is not ideal. I’ve had to forsake some
precision for a gain in recall but I can’t think of a better way. If
you can come up with a fool-proof way to handle hyphenated terms I’d
love to hear it. I will probably remove the HyphenFilter from the
StandardFilter in a futer version if I can’t think of a better way to
do this.

By the way, for the people reading this who think that “setup” is not
a word, I agree so consider “e-mail” and “email” instead.

Cheers,
Dave

PS: I’ve pasted the code below for reference. I’m not sure how long
the pasties stick around for.

require ‘rubygems’
require ‘ferret’

path = “/tmp/index”
system(“rm -rf #{path}; mkdir -p #{path}”)
index = Ferret::Index::Index.new(:path => path)

index << {:type => :bug, :name => ‘foo-bar’}
index << {:type => :bug, :name => ‘foo-bar-core’}

queries = [‘foo-bar’, ‘foo-core’]
queries.each do |name|
query = “type:bug AND name:#{name}”
puts “\nquery : #{query}”
res = index.search(query)
puts “total hits = #{res.total_hits}”
res.hits.each { |x| p index[x.doc].load.inspect }
end

ahfeel · October 20, 2006, 7:35pm

Hi Dave !

Thank you for your answer, i’ve totally understood the matter and i must
say that’s quite annoying… I guess that you won’t satisfy everyone
with removing this feature, because it really depends on the application
you wanna run… Someone running something like a wiki would like to get
the same results with e-mail and email, that’s correct and it’s of
course a good feature, but in my case, i really don’t want my query
“name:package-dev” to send back ‘package-foobar-dev’ etc… that’s a
really big problem for me.

Actually, i think using different operators calling different parsing
methods could be a correct solution, like “type:e-mail” would match
email and ‘e mail’ and “type=e-mail” would only match “e-mail” (in
regexp: /^e-mail.*/). The ‘=’ operator is quite self explanatory for
exact pattern matching, so it could be easy to understand…

It could be a way to keep the flexibility of the current search matching
method, and to include a more strict pattern method for those who needs
that…

Anyway, Thank you for the solution !
Cheers,
Jeremie ‘ahFeel’ BORDIER