Acts_as_ferret : cannot use a customized Analyzer (as indicated in the AdvancedUsageNotes)

Hi all,

I cannot make aaf (rev. 220) use my custom analyzer, despite following
the
indications @

http://projects.jkraemer.net/acts_as_ferret/wiki/AdvancedUsage

To pinpoint the problem, I created a model + a simple analyzer with 2
stop
words : “fax” and “gsm”.

test 1 : model.rebuild_index + model.find_by_contents(“fax”) # fax is a
stop word.
=> I get a result when I should not.

(note : I delete the index directory => I can see the index is
recreated,
index/develop

).

test 2 : insert a ‘raise’ in the token_stream() method => it’s never
thrown.

test 3 : use the standard analyzer, to exclude the 2 stop words => same
wrong result.
class AccessPointKind2 < ActiveRecord::Base

  set_table_name "access_point_kinds2"

    acts_as_ferret(
        {:remote => true, :fields => { :name  => {:store => :yes}} } 

,
{ :analyzer =>
Ferret::Analysis::StandardAnalyzer.new([“fax”,“gsm”])
}
)
end

Here are the model and the analyzer :
MODEL :

class AccessPointKind2 < ActiveRecord::Base
set_table_name “access_point_kinds2”

  acts_as_ferret(
      {:remote => true, :fields => { :name  => {:store => :yes}} } ,
      {:analyzer => PlainAsciiAnalyzer.new}
    )

end

ANALYZER
lib : plain_ascii_analyzer.rb
class PlainAsciiAnalyzer < ::Ferret::Analysis::Analyzer
include ::Ferret::Analysis
def token_stream(field, str)
StopFilter.new(
StandardTokenizer.new(str) ,
[“fax”, “gsm”]
)
# raise <<<----- is never executed when uncommented !!
end
end

In the console, I rebuild the index + search for a stop word => I get a
results, when I should not :

reload!; AccessPointKind2.rebuild_index ;
AccessPointKind2.find_by_contents(“gsm”).collect &:name
Reloading…
AccessPointKind2 Columns (0.002963) SHOW FIELDS FROM
access_point_kinds2
Asked for a remote server ? true, ENV[“FERRET_USE_LOCAL_INDEX”] is nil,
looks like we are not the server
Will use remote index server which should be available at
druby://localhost:9010
default field list: [:name]
AccessPointKind2 Load (0.002706) SELECT * FROM access_point_kinds2
WHERE
(access_point_kinds2.id in (‘7’,‘12’,‘13’,‘8’,‘2’))
Query: gsm
total hits: 5, results delivered: 5
=> [“gsm”, “gsm”, “gsm(werk)”, “gsm(privé)”, “gsm(privé)”]

I guess it’s obvious, but I cannot see it.
Help.

Thanks in advance.

Alain

Hi,

I just tried and I’m afraid I couldn’t reproduce your problem here (with
aaf trunk). I just committed a testcase using StandardAnalyzer with your
stop word list, and it works as intended. I also tried with your
analyzer class from below, same result.

Could you please try the lates aaf from trunk to see if it fixes your
problem?

Cheers,
Jens

On Tue, Nov 13, 2007 at 01:47:04PM +0100, Alain R. wrote:

words : “fax” and “gsm”.
test 2 : insert a ‘raise’ in the token_stream() method => it’s never thrown.
Ferret::Analysis::StandardAnalyzer.new([“fax”,“gsm”])

ANALYZER
end
AccessPointKind2 Columns (0.002963) SHOW FIELDS FROM access_point_kinds2

I guess it’s obvious, but I cannot see it.
Help.

Thanks in advance.

Alain


Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk


Jens Krämer
http://www.jkraemer.net/ - Blog
http://www.omdb.org/ - The new free film database

remark : some spaces were erroneously inserted before the word “the”
when I formatted the email, and are not present in the real code.

So

=> #<Country id: 11, name: "  the">

reload! ; Country.rebuild_index ; Country.find_by_contents("
the")

should read :

=> #<Country id: 11, name: “the”>

reload! ; Country.rebuild_index ;
Country.find_by_contents(“the”)

Jens,

I just tried and I’m afraid I couldn’t reproduce your problem here
(with
aaf trunk). …
Could you please try the lates aaf from trunk to see if it fixes
your
problem?

Same problem after installing the lasted version (262) of aaf : the
custop
analyzer I pass as an aaf parameter is not used.

As a quick test, I tried using the “No Stop Word” custom analyzer as
documented @
http://projects.jkraemer.net/acts_as_ferret/wiki/AdvancedUsage
on a simple LUT table/model, to no avail.
I tried the new syntax with the same wrong result.

Setup :

  • I’ve installed the latest trunk version of aaf (262)
  • killed + restarted a (new) DrB server
    $ ./script/ferret_server -e production start
  • checked the Ferret version :
    $ gem list ferret ==> ferret (0.11.4)

Test :

I created a record where the name is a default stop word

Country.find 11

  Country Load (0.000388)   SELECT * FROM countries WHERE

(countries.id = 11)
=> #<Country id: 11, name: " the">

model, way 1 :

class Country < ActiveRecord::Base
acts_as_ferret( { :fields => [:name] }, { :analyzer =>
Ferret::Analysis::StandardAnalyzer.new( []) } )
end

model, way 2 :

class Country < ActiveRecord::Base
acts_as_ferret(
:fields => [:name] ,
:remote => true,
:ferret => {:analyzer => Ferret::Analysis::
StandardAnalyzer.new([]) }
)
end

PROBLEM : in both cases it doesn’t find any record where the name is
‘the’

reload! ; Country.rebuild_index ; Country.find_by_contents("
the")

reload! ; Country.rebuild_index ; Country.find_by_contents (“the”)
Reloading…
Asked for a remote server ? true, ENV[“FERRET_USE_LOCAL_INDEX”] is nil,
looks like we are not the server
Will use remote index server which should be available at
druby://localhost:9010
default field list: [:name]
Query: the
total hits: 0, results delivered: 0
=> #<ActsAsFerret::SearchResults:0x324ab3c @per_page=0,
@current_page=nil,
@total_hits=0, @results=[], @total_pages=0>

I tried with my custom analyser (from the previous message), with the
same
wrong result.

So, it looks like aaf is not using the custom analyzer I declared in the
model.
It doesn’t make any sense to me.

Alain R.

I’m one step further :

  • Good : I now know aaf knows about/received the custom analyzer
    but
  • Bad : the analyzer is not used by aaf ( : it stops on words it
    should
    not stop on)

New test : a “no stop word” analyzer, adapted from the german stemming
analyser @
http://projects.jkraemer.net/acts_as_ferret/wiki/AdvancedUsage

file: model/country.rb

class Test2Analyzer < ::Ferret::Analysis::Analyzer
include Ferret::Analysis
def initialize(stop_words = [])
@stop_words = stop_words
end
def token_stream(field, str)
StemFilter.new(StopFilter.new(LowerCaseFilter.new(
StandardTokenizer.new(str)), @stop_words), ‘de’)
end
end
class Country < ActiveRecord::Base
acts_as_ferret(
:fields => [:name] ,
:remote => true,
:ferret => {:analyzer => Test2Analyzer.new([]) }
)
end

0°/ delete the ferret index directory
1°/ restart the console and rebuild the index :

./script/console

Country.rebuild_index
Asked for a remote server ? true, ENV[“FERRET_USE_LOCAL_INDEX”] is
nil,
looks like we are not the server
Will use remote index server which should be available at
druby://localhost:9010
default field list: [:name]
=> nil

2°/ confirm that aaf knows about my “no_stop_words” custom analyzer :

puts Country.aaf_index.to_yaml
— !ruby/object:ActsAsFerret::RemoteIndex
config:
:fields:

  • :name
    :mysql_fast_batches: true
    :name: countries
    :class_name: Country
    :index_dir:
    /Users/aravet/aaprojets/newgids/newgids_machine/index/development/country
    :remote: druby://localhost:9010
    :reindex_batch_size: 1000
    :store_class_name: false
    :ferret_fields:
    :name:
    :store: :no
    :term_vector: :with_positions_offsets
    :boost: 1.0
    :index: :yes
    :highlight: :yes
    :single_index: false
    :ferret: &id001
    :key: :id
    :auto_flush: true
    :or_default: false
    :path:
    /Users/aravet/aaprojets/newgids/newgids_machine/index/development/country
    :create_if_missing: true
    :handle_parse_errors: true
    :analyzer: !ruby/object:Test2Analyzer <<<<----------- Good
    stop_words: [] <<<<----------- Good
    :default_field:
    • :name
      :enabled: true
      ferret_config: *id001
      server: !ruby/object:DRb::DRbObject
      ref:
      uri: druby://localhost:9010
      => nil

3°/ confirm that there is record with name == “the”

Country.find_by_name “the”
Country Load (0.000427) SELECT * FROM countries WHERE
(countries.name
= ‘the’) LIMIT 1
=> #<Country id: 11, name: “the”>

4°/ try and find “t*” it with aaf
=> DOES NOT WORK (does not find Country[:name => “the”])

Country.find_by_contents “t*”
Query: t*
total hits: 0, results delivered: 0
=> #<ActsAsFerret::SearchResults:0x31ff754 @per_page=0,
@current_page=nil,
@total_hits=0, @results=[], @total_pages=0>

5°/ do the same for “t*”, a non stop word
=> IT WORKS (finds Country[:name => “Frankrijk”])

Country.find_by_contents “f*”
Country Load (0.000420) SELECT * FROM countries WHERE (countries.id
in
(‘2’))
Query: f*
total hits: 1, results delivered: 1
=> #<ActsAsFerret::SearchResults:0x31fa4ac @per_page=1,
@current_page=nil,
@total_hits=1, @results=[#<Country id: 2, name: “Frankrijk”>],
total_pages1

So, aaf (rev 262)

  • associates the right custom analyzer with the model,
  • but doesn’t seem to use it when finding_by_contents (? and rebuilding
    the
    index ??)

Alain

Alain R. wrote:

class Country < ActiveRecord::Base
acts_as_ferret(
:fields => [:name] ,
:remote => true,
:ferret => {:analyzer => Test2Analyzer.new([]) }
)
end

Try this:

acts_as_ferret({ :fields => [:name], :remote => true },
{ :analyzer => Test2Analyzer.new([]) })

On Thu, Nov 15, 2007 at 12:24:25AM +0100, Hongli L. wrote:

acts_as_ferret({ :fields => [:name], :remote => true },
{ :analyzer => Test2Analyzer.new([]) })

this won’t help, these are both valid ways to call acts_as_ferret. The
:ferret syntax is the preferred one, however.

Jens


Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

Hi Alain,

could you please check the index created by aaf with plain ferret and
your custom analyzer to see if your queries deliver the expected results
then?

That way we should be able to find out if the problem is with indexing
or searching through aaf.

Jens

On Thu, Nov 15, 2007 at 12:00:04AM +0100, Alain R. wrote:

end

0°/ delete the ferret index directory
=> nil
:name: countries
:boost: 1.0
:handle_parse_errors: true

5°/ do the same for “t*”, a non stop word

So, aaf (rev 262)

  • associates the right custom analyzer with the model,
  • but doesn’t seem to use it when finding_by_contents (? and rebuilding the
    index ??)

Alain


Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk


Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

Jens K. [email protected] writes:

Try this:

acts_as_ferret({ :fields => [:name], :remote => true },
{ :analyzer => Test2Analyzer.new([]) })

this won’t help, these are both valid ways to call acts_as_ferret. The
:ferret syntax is the preferred one, however.

Just for information, I was using an old or bad syntax for aaf.

I was using acts_as_ferret :fields [], :analyzer => MyAnalyzer.new
and it wasn’t working. (A raise in initialize of MyAnalyzer was raising
but not in token_stream)

I’m now using :ferret => {:analyzer => MyAnalyzer} and it works as
expected.