[Repost] Problem with url searching

ahfeel · April 3, 2007, 12:04pm

Hi all,

I’ve posted that few weeks ago but no one answered, but this feature is
REALLY important for us.

I have many objects with a url field, of course containing standards
urls…
I’m trying to match them but i actually got problems with that.

Here’s a little code of what i would like to achieve:
require ‘rubygems’
require ‘ferret’
require ‘ftools’

class TestAnalyzer
def token_stream(field, str)
ts = Ferret::Analysis::AsciiStandardTokenizer.new(str)
ts = Ferret::Analysis::AsciiLowerCaseFilter.new(ts)
end
end

system ‘rm -rf /tmp/ferret_test’ if File.exists?(‘/tmp/ferret_test’)
File.mkpath(‘/tmp/ferret_test’)
INDEX = Ferret::I.new(:path => ‘/tmp/ferret_test’, :analyzer =>
TestAnalyzer.new)
INDEX << {:type => :url, :url => ‘http://google.fr’}
INDEX << {:type => :url, :url => ‘http://ferret.davebalmain.com’}
INDEX << {:type => :url, :url => ‘http://www.unixaumonde.com’}
INDEX << {:type => :url, :url => ‘http://www.rift.fr’}

[‘type:url AND url:google’,
‘type:url AND url:“://foobar”’,
‘type:url AND url:“http://goo”',
'type:url AND url:"http://goo"’].each do |q|
puts “\nSearching #{q}”
INDEX.search(q).hits.each { |x| p INDEX[x.doc].load }
puts “\n”
end

I hope Dave or anyone else will be able to give us an hint or a release,
something like this…

Regards,
Jeremie ‘ahFeel’ BORDIER
Rift Technologies

ahfeel · April 3, 2007, 12:41pm

On Tue, Apr 03, 2007 at 12:04:28PM +0200, ahFeel wrote:

Hi all,

I’ve posted that few weeks ago but no one answered, but this feature is
REALLY important for us.

I have many objects with a url field, of course containing standards
urls…
I’m trying to match them but i actually got problems with that.

Ok, here we go:

First of all, use

INDEX.process_query(query_string)

to see how Ferret sees your querys after the QueryParser parsed them.

You’ll see that the results ferret gives perfectly match the queries the
parser generated from your query strings - but these are not the results
you want.

So you’ll have do work on the analysis part. Here it seems your problem
is that your analyzer is stripping away the wildcards you use, i.e.

a = TestAnalyzer.new
qp = Ferret::QueryParser.new :analyzer => a
qp.parse ‘url:“http://ferret.davebalmain.com”’ #
url:ferret.davebalmain.com
qp.parse ‘url:“http://ferret*”’ # url:ferret → bad,
won’t mach

A custom URLAnalyzer that strips away the protocol://, but leaves intact
wildcards in queries could help here. You also should think about
further tokenizing the domain part by splitting at ‘.’ (as a
LetterTokenizer would do). So url:ferret would match
the ferret.davebalmain.com url even without wildcard.

Also keep in mind that you do not have to use Ferret’s Query Parser if
it doesn’t fit your needs - you can always build your own.

Jens

–
Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

ahfeel · April 3, 2007, 2:10pm

Thank you for you’re usefull answer, even if it’s quite a weird behavior
of Ferret’s query parser, i’ll try to go on with that

Thanks again Jens for everything you do for Ferret too !

ahfeel · April 6, 2007, 7:46am

On 4/3/07, ahFeel [email protected] wrote:

Thank you for you’re usefull answer, even if it’s quite a weird behavior
of Ferret’s query parser, i’ll try to go on with that

I can see why this behaviour may seem a little weird. Unfortunately,
the way phrase queries are implemented, it is impossible to have a
wildcard term within a phrase query. So “http://goo*” treats
http://goo* as a term in a phrase query and runs it through the
analyzer which then strips the wild-card character ‘*’.

“http://goo”* is a phrase query with ‘*’ at the end which doesn’t have
any meaning in ferret query language.

http://goo* should work with a WhiteSpaceAnalyzer. The
StandardAnalyzer strips the http:// (or file:/// or ftp://) from the
beginning of terms during analysis. However, when you add a wild-card
character to a query the term doesn’t get analyzed. So basically the
query http://google.fr will be converted to the query google.fr and
will match, but the query http://goo* will not be analyzed and match
http://goo* but there is no http://google.fr in the index, only
google.fr, so you won’t get a match. Searching for goo* however will
work. What you might like to try is stripping http:// from your
queries with a simple query.gsub(/http:///, ‘’).

Hope that helps,
Dave