Regexpr. analyzer

hawe · October 27, 2006, 5:15pm

Hi!

I want to index html files, but w/o the tags, so I was thinking either I
remove them before I index it (expensive), or put up an RegExpAnalyzer.
BTW, when using an analyzer, does that mean that everything which it
declines (i.e. the RegExpAnalyzer doesn’t match) won’t be put into the
index files (i.e. blows it up)?

I came up with a simple test, which didn’t work in act_as_ferret, but
now in pure ferret doesn’t work as well. I expected, with the code
below, that only “abc” will be indexed, as only it matches the regexpr.
What’s wrong?

@index = Ferret::Index::Index.new(:path =>
‘c:/projects/peter/lib/ferretidx’,
:analyzer => RegExpAnalyzer.new(/[a-f]/))

@index << {:id => “15”, :title => “Programming Ruby”, :content =>
“some thing abc”}

@index.search_each(‘content:“some”’) do |id, score|
puts “Document #{id} found with a score of #{score}”
end

Thanks a lot,
hawe.

hawe · October 27, 2006, 8:32pm

On 27.10.2006, at 17:15, hawe wrote:

I want to index html files, but w/o the tags, so I was thinking
either I
remove them before I index it (expensive), or put up an
RegExpAnalyzer.

What’s so expensive about stripping the tags prior to adding the html
to the index? I’m not sure which regex engine RegExpAnalyzer uses,
but the Ruby’s regex engine is implemented in C, so it shouldn’t make
much of a difference.

BTW, when using an analyzer, does that mean that everything which it
declines (i.e. the RegExpAnalyzer doesn’t match) won’t be put into the
index files (i.e. blows it up)?

Yep. That’s why you should use this analyzer only for the field
that’s used to index the HTML, perhaps by using a PerFieldAnalzyer.

@index << {:id => “15”, :title => “Programming Ruby”, :content =>
“some thing abc”}

@index.search_each(‘content:“some”’) do |id, score|
puts “Document #{id} found with a score of #{score}”
end

Consider:

index = Ferret::I.new(:analyzer =>
Ferret::Analysis::RegExpAnalyzer.new(/[a-f]/))

index << “prose”
index << “fade”

index.search(“prose”).total_hits # -> 2

What happens is that “prose” becomes “e” and “fade” goes untouched.
Ferret uses the same analyzer for indexing and query parsing. As a
consequence, index.search(“prose”) becomes index.search(“e”) which
matches both “fade” and “prose”.

I’d suggest you use a separate tag stripper instead of using
RegExpAnalyzer. Proper tag stripping is not a trivial RegExp,
especially if you’re dealing with non-well-formed documents.

HTH,
Andy