ANN: acts_as_ferret

weibel · December 14, 2005, 7:37pm

First post here! Here’s my question:

I have several related Category objects that all belong_to a Job
object. When a new Job object is to be created a user will have to
click on the CSS tabs that I have setup with link_to Action Methods.
I do not want the data from the forms to be persisted until all the
sections are complete and the user clicks “Create Project” Also I
want the Controller to dynamically store/update each view’s session
when any tab is arbitrarily selected

For Example, the form tabs resemble this:

Art Details | Dev Details | Marketing Details

So when I am finished with “Art Details” and click on “Dev Details”,
I want to store that form data in a session - the same for other tabs
when the new view is selected via clicking on a new tab.

I considered using a pseudo-cart type of object to store the Projects
“Details” objects and their associated attributes, but this doesn’t
really Details for this model because the child Objects of Project
will not know about their association or foreign keys until they are
persisted. Moreover, it would seem logical that I just store the
post variables in some object, but then how would I restore those
values in the fields if they go back to a previous tab?

Here’s my Object model

Project
|__
|
ArtDetails belongs to Project
DevDetails belongs_to Project
MarketingDetails belongs_to Project

Any suggestions? TIA!

weibel · December 14, 2005, 7:49am

On 12/14/05, jennyw [email protected] wrote:

    end
    alias :ferret_update :ferret_create

Hi Jenny,

Glad to hear you like Ferret.

Note that I’ve add a key option to the index;

@@index ||= Index::Index.new(:key => [:id, :ferret_class],

This will ensure that the index is kept unique for these fields, ie
every time I do an update the old document will be automatically
deleted. This only happens when you set the key option.

Also, a question for David – is auto_flush => true supposed to remove
the lock automatically after writes?

Yes, that is the way it is supposed to work.

I ask because I also tried the
code that Kasper originally posted, and I kept getting locking errors
unless I closed the index after updates (and I also wasn’t quite able to
get that code to work before giving up). I was running both a Web
instance and trying to get at it with console, which is similar, I
think, to what would happen with multiple FCGI processes.

Have you tried it with the latest version of Ferret? 3.0 had a few
bugs but 3.1 should be fine. Let me know if you are still getting lock
errors.

Cheers,
Dave

weibel · December 14, 2005, 8:07pm

On 12/15/05, Erik H. [email protected] wrote:

If you do go the route of searching with ActiveRecord first and using
those results to constrain the Ferret search, consider using a Filter
(not sure how that is implemented in Ferret, but in Java Lucene there
are overloaded search methods that accept a Filter).

Filters are implemented in Ferret the same way as they are in Java.
They’re unit tested but I haven’t used them very much and I don’t
suspect many other people have yet either. But they’re there if you
need them. You pass a filter object as one of the options to any of
the search methods.

Dave

weibel · December 14, 2005, 7:49pm

On Dec 14, 2005, at 3:06 AM, Abdur-Rahman A. wrote:

activerecords), really depends on the use of conditions…
My recommendation is to index the fields you want to use as search
criteria into Ferret rather than trying to mix and match Ferret and
ActiveRecord searches. Optimizing the two will be tricky - would it
be quicker to search with Ferret and then pull from the DB or
constrain the set by the DB first then full-text search on those?
My hunch is that no database will have better performance than the
potential fully optimized Ferret. It’s certainly true in the Java
Lucene that it is as fast and usually faster than a relational
database for querying.

If you do go the route of searching with ActiveRecord first and using
those results to constrain the Ferret search, consider using a Filter
(not sure how that is implemented in Ferret, but in Java Lucene there
are overloaded search methods that accept a Filter).

Erik

weibel · December 14, 2005, 8:19pm

On 14-dec-2005, at 19:48, Erik H. wrote:

can’t really think of what would be faster (searching ferret first
database for querying.

If you do go the route of searching with ActiveRecord first and
using those results to constrain the Ferret search, consider using
a Filter (not sure how that is implemented in Ferret, but in Java
Lucene there are overloaded search methods that accept a Filter).

Maybe someone can help me finish http://www.julik.nl/code/active-
search/classes/ActiveSearch/FerretIndexer.html? I am sotring out the
kinks but I am stumbling upon

RuntimeError: could not obtain lock:

and I should admit I am absolutely lost in how to handle concurrency
with Ferret.

–
Julian ‘Julik’ Tarkhanov
me at julik.nl

weibel · December 15, 2005, 3:18am

David B. <dbalmain.ml@…> writes:

Also, I don’t know if you meant to use symbols but you shouldn’t use
‘:’ in a field name as it will through off the query parser. Get rid
of the ‘"’ around :ferret_class and :id and you’ll be fine.

Now that I think about it, I was confused for a bit about the keys
defined and
was having trouble doing lookups. It turned out to be a different
problem, but
in my search for a way to fix it, I changed those fields names to match
(I even
tried just using symbols, but it seems that ferret didn’t like that too
much
(should symbols be an allowable option for a field name?).

Ferrets truely a great piece of work and the documentation is already
quite
good, but I think there’s a lot more needed to make it fully accessible.
Hopefully as more of us dig in, we can add to what’s there. I guess
that’s a
topic for the ferret mailing list, though. ;~)

Thomas

weibel · December 15, 2005, 3:12am

David B. <dbalmain.ml@…> writes:

Also, I don’t know if you meant to use symbols but you shouldn’t use
‘:’ in a field name as it will through off the query parser. Get rid
of the ‘"’ around :ferret_class and :id and you’ll be fine.

Yeah, I realized this one a little while after I pasted it. I had them
as
strings and had reverted back to the “:” prefixed names in an attempt to
see if
that was causing a problem I was having. I guessed I pasted it a little
too soon.

I made both these changes on the wiki already.

Great!

One other change you may like to make is to allow Query objects to be
passed to the find_by_contents method as well as Strings, but I’ll
leave that one up to you for the moment.

Yeah, that was the other thing I had started working on but didn’t want
to paste
in yet. I had an implementation of it, but it was ugly, so I’m reworking
it a
bit and hope to have that in place over the weekend.

Hope that helps,
Dave

Thanks again for developing Ferret. I’ve been waiting for this ever
since I
first started playing with Ruby and saw Erik’s registered (though, sadly
never
completed) rlucene project.

Thomas

weibel · December 15, 2005, 3:21am

jennyw <jennyw@…> writes:

It’s so great that people are working on this! Ferret is great and I
look forward to seeing it better integrated with Rails.

Thomas – I tried this code but experienced a few problems with it. I
never got it to work, and gave up since it’s not exaclty what I need
(the documents I’m storing in Ferret don’t exactly match my model
objects, but are a composite of them). Still, I have some feedback that
might (or might not) be helpful.

As I (think I) mentioned in my note on the wiki, the code I put there
definitely
was buggy. I just wanted to put it out in case anyone else wanted to
start
taking a stab at it. I’ll have a newer version sometime next week, I
hope.

In addition to what David mentioned, I noticed that you use the method
class_variable_set in the method acts_as_ferret. This isn’t available in
Ruby 1.8.2. Moreover, I’m not sure why you’re using this here since the
variable names are not dynamic. I just changed these to:
         <at>  <at> fields_for_ferret = Array.new   
         <at>  <at> class_index_dir = configuration[:index_dir]

I’m not sure why I did that either. :-/ Guess I was just trying to get
anything
to work at that point. I’ll implement your fix.

Also, I noticed that the indentation on the class method append_features
was a bit off … it looked like super was the beginning of a block.
Just a minor thing.

I fixed a few indentation problems when I added it to the wiki, but must
have
missed that one. Thanks.

Also, I’m confused about the name for the SingletonMethods module. What
is the singleton that’s being referred to here?

I adopted that from the plugin howtos on the rails wiki:
http://wiki.rubyonrails.org/rails/pages/HowToWriteAnActsAsFoxPlugin

weibel · December 15, 2005, 6:28am

On 15-dec-2005, at 5:59, David B. wrote:

Using the latest version of ferret and setting :auto_flush => true
should solve this problem. Have you tried that? It only works in
Index::Index though and it’s not necessary for and IndexSearcher. If
you use IndexWriter and IndexReader directly you have to handle it
yourself.

David, thanks for the advice - I’ll try that and report the results.
Basically, it feels sort of odd - doing this macro-style Ferret
binging. Ferret is so vast and powerful that
this would be not enough to make use of all of it’s features. Maybe
you can send me some advice off-list how I could
probably expand the API of the FerretIndexer to give more access to
the most needed Ferret features in a convenient way (without making
it too big because the whole idea of the plugin is a one-liner
integration into a model, not a document cluster with 10 million
entries in it.

If someone else wants to shed some light (or help with code) I would
be glad to get some help, I am swamped now and won’t be able to get
to it until at least next week.

–
Julian ‘Julik’ Tarkhanov
me at julik.nl

weibel · December 15, 2005, 6:01am

On 12/15/05, Julian ‘Julik’ Tarkhanov [email protected] wrote:

maybe could someday became a method for rails itself), what is
My hunch is that no database will have better performance than the
search/classes/ActiveSearch/FerretIndexer.html? I am sotring out the
kinks but I am stumbling upon

RuntimeError: could not obtain lock:

and I should admit I am absolutely lost in how to handle concurrency
with Ferret.

Using the latest version of ferret and setting :auto_flush => true
should solve this problem. Have you tried that? It only works in
Index::Index though and it’s not necessary for and IndexSearcher. If
you use IndexWriter and IndexReader directly you have to handle it
yourself.

weibel · December 15, 2005, 6:55am

Hi Julian,

I’m really busy porting everything in Ferret to C at the moment. Next
year though I should have some time to play around with integrating it
into Rails. Until then I’ll try and be as helpful as possible to
others trying to do the same thing. Good luck!

Cheers,
Dave

weibel · December 15, 2005, 12:05pm

Hello!

I have been following this thread carefully, ferret just got a little
easier to dive into. Kudos to you guys, and especially to the authors of
ferret! This was just what we needed here at our little webdev shop.

Now I have a problem you guys might know a solution to. I have managed
to get the code from the wiki working, with a little bit of tweaking,
but it does not seem to build queries correctly when it gets fed with
UTF-8 characters. Is this a fault on my side or a known issue with
ferret? I looked at the trac but it seemed it should support UTF-8? I
must have overlooked something…

I didnt dare to touch the wiki, but here is a somewhat altered version
of the plugin, and it should be fully functional. I added some small
things, since we wanted a counter for the Paginator. I know though that
doing a full-out-search just to count might not be the best way to
count, so if anyone has a suggestion to better this, please share!

Oh, and I added a rake task to rebuild the index, but it relies on the
INDEX_PATH being set in the environment.rb

Here it is

CODE for acts_as_ferret.rb

require ‘active_record’
require ‘ferret’

module FerretMixin
module Acts #:nodoc:
module ARFerret #:nodoc:

    def self.append_features(base)
       super
       base.extend(MacroMethods)
    end

    # declare the class level helper methods
    # which will load the relevant instance methods defined below

when invoked

    module MacroMethods

       def acts_as_ferret
          extend FerretMixin::Acts::ARFerret::ClassMethods
          class_eval do
             include FerretMixin::Acts::ARFerret::ClassMethods

             after_create :ferret_create
             after_update :ferret_update
             after_destroy :ferret_destroy
          end
       end

    end

    module ClassMethods
       include Ferret
       INDEX_PATH = "#{RAILS_ROOT}/db/ferret"
       def self.reloadable?; false end

       # Finds instances by file contents.
       def find_by_ferret(query, options = {})
          @@index_searcher ||= Search::IndexSearcher.new(INDEX_PATH)
          @@query_parser   ||=

QueryParser.new(@@index_searcher.reader.get_field_names.to_a)
query = @@query_parser.parse(query)
result = []
conditions = {}
conditions[:num_docs] = options[:limit] unless
options[:limit].blank?
conditions[:first_doc] = options[:offset] unless
options[:offset].blank?

          hits = @@index_searcher.search(query, conditions)
          hits.each do |hit, score|
               id = @@index_searcher.reader.get_document(hit)['id']
             result << self.find(id) unless id.nil?
          end
          return result
       end

       def count_by_ferret(query)
             @@index_searcher ||=

Search::IndexSearcher.new(INDEX_PATH)
@@query_parser ||=
QueryParser.new(@@index_searcher.reader.get_field_names.to_a)
query = @@query_parser.parse(query)
return @@index_searcher.search(query).total_hits
end

       # private

       def ferret_create
          # code to update or add to the index
          @@index ||= Index::Index.new(:path => INDEX_PATH,
                                     :auto_flush => true)
          @@index << self.to_doc
       end
       def ferret_update
            @@index ||= Index::Index.new(:path => INDEX_PATH,
                                     :auto_flush => true)
         @@index.query_delete("+id:#{self.id}

+ferret_table:#{self.class.table_name}")
@@index << self.to_doc
end

       def ferret_destroy
          # code to delete from index
          @@index ||= Index::Index.new(:path => INDEX_PATH,
                                     :auto_flush => true)
          @@index.query_delete("+id:#{self.id}

+ferret_table:#{self.class.table_name}")
end

       def to_doc
          # Churn through the complete Active Record and add it to

the Ferret document
doc = Ferret::Document::Document.new
doc << Ferret::Document::Field.new(‘ferret_table’,
self.class.table_name, Ferret::Document::Field::Store::YES,
Ferret::Document::Field::Index::UNTOKENIZED)
self.attributes.each_pair do |key,val|
if key == ‘id’
doc << Ferret::Document::Field.new(key, val.to_s,
Ferret::Document::Field::Store::YES,
Ferret::Document::Field::Index::UNTOKENIZED)
else
doc << Ferret::Document::Field.new(key, val.to_s,
Ferret::Document::Field::Store::NO,
Ferret::Document::Field::Index::TOKENIZED)
end
end
return doc
end
end
end
end
end

reopen ActiveRecord and include all the above to make

them available to all our models if they want it

ActiveRecord::Base.class_eval do
include FerretMixin::Acts::ARFerret
end

END acts_as_ferret.rb

RAKE TASK in /lib/tasks/indexer.rake

include FileUtils

desc “Perform ferret index”
task :indexer => :environment do
if !File.exist?(INDEX_PATH)
puts “Creating index dir in #{INDEX_PATH}”
FileUtils.mkdir_p(INDEX_PATH)
end

classes = []
Dir.glob(File.join(RAILS_ROOT,"app","models","*.rb")).each do

|rbfile|
bname = File.basename(rbfile,’.rb’)
classname = Inflector.camelize(bname)
classes.push(classname)
end
classes.each do |class_obj|
c = eval(class_obj)
if c.respond_to?(:ferret_create)
puts “REBUILDING #{c.name}”
c.find_all.each{|cn|cn.save}
end
end
end

weibel · December 15, 2005, 4:00pm

Hi Albert,

Perhaps you could do something like this in the find_by_ferret method
and get rid of your count_by_ferret method. Just an idea.

         total_hits = hits.each do |hit, score|
            id = @@index_searcher.reader.get_document(hit)['id']
            result << self.find(id) unless id.nil?
         end
         return result, total_hits

Cheers,
Dave

weibel · December 15, 2005, 4:03pm

On 12/15/05, albert ramstedt [email protected] wrote:

ferret? I looked at the trac but it seemed it should support UTF-8? I
must have overlooked something…

The problem is that the analyzer doesn’t understand UTF-8. You need to
write an analyzer that matches the characters in your character set.
Have at the analyzers and tokenizers included with Ferret. They’re
quite simple. Basically you just need to come up with a regular
expression that matches what you consider tokens in your data. For
example, the whitespace tokenizer uses /\S+/. The letter tokenizer
uses /[:alpha:]+/. This is actually where the problem with UTF-8
handling is. [:alpha:] only matches the ascii alphabet in the current
Ruby regexp engine. That will change in Ruby 2.0.

HTH,
Dave

weibel · December 15, 2005, 3:24pm

To answer my own question…

This is a hack to get unicode to work, and relies on the unicode gem.
Also, this, as opposed to my previous code listing, should work out of
the box… except that the constant INDEX_PATH must be set before,
preferable in environment.rb

CODE for acts_as_ferret.rb

require ‘active_record’
require ‘ferret’
require ‘unicode’

class UnicodeLowerCaseFilter < Ferret::Analysis::TokenFilter
def next()
t = @input.next()

   if (t == nil)
     return nil
   end

   t.term_text = Unicode::downcase(t.term_text)

   return t
 end

end

class SwedishTokenizer < Ferret::Analysis::RegExpTokenizer

P     =     /[_\/.,-]/
HASDIGIT     =     /\w*\d\w*/


def token_re()
 %r([[:alpha:]Ã?Ã?Ã?Ã¥Ã¶Ã¤]+(('[[:alpha:]Ã?Ã?Ã?Ã¥Ã¶Ã¤]+)+
   |\.([[:alpha:]Ã?Ã?Ã?Ã¥Ã¶Ã¤]\.)+
   |(@|\&)\w+([-.]\w+)*
  )
   |\w+(([\-._]\w+)*\@\w+([-.]\w+)+
   |#{P}#{HASDIGIT}(#{P}\w+#{P}#{HASDIGIT})*(#{P}\w+)?
   |(\.\w+)+
   |
  )
   )x
 end

end

class SwedishAnalyzer < Ferret::Analysis::Analyzer
def token_stream(field, string)
return UnicodeLowerCaseFilter.new(SwedishTokenizer.new(string))
end
end

module FerretMixin
module Acts #:nodoc:
module ARFerret #:nodoc:

    def self.append_features(base)
       super
       base.extend(MacroMethods)
    end

    # declare the class level helper methods
    # which will load the relevant instance methods defined below

when invoked

    module MacroMethods

       def acts_as_ferret
          extend FerretMixin::Acts::ARFerret::ClassMethods
          class_eval do
             include FerretMixin::Acts::ARFerret::ClassMethods

             after_create :ferret_create
             after_update :ferret_update
             after_destroy :ferret_destroy
          end
       end

    end

    module ClassMethods
       include Ferret
       def self.reloadable?; false end

       # Finds instances by file contents.
       def find_by_ferret(query, options = {})
          index_searcher ||= Search::IndexSearcher.new(INDEX_PATH)
          query_parser   ||=

QueryParser.new(index_searcher.reader.get_field_names.to_a, {:analyzer
=> SwedishAnalyzer.new()})
query = query_parser.parse(query)
result = []
conditions = {}
conditions[:num_docs] = options[:limit] unless
options[:limit].blank?
conditions[:first_doc] = options[:offset] unless
options[:offset].blank?

          hits = index_searcher.search(query, conditions)
          hits.each do |hit, score|
               id = index_searcher.reader.get_document(hit)['id']
             result << self.find(id) unless id.nil?
          end
          return result
       end

       def count_by_ferret(query)
             index_searcher ||=

Search::IndexSearcher.new(INDEX_PATH)
query_parser ||=
QueryParser.new(index_searcher.reader.get_field_names.to_a, {:analyzer
=> SwedishAnalyzer.new()})
query = query_parser.parse(query)
return index_searcher.search(query).total_hits
end

       # private

       def ferret_create
          # code to update or add to the index
          index ||= Index::Index.new(:key => [:id, :ferret_table],
                                       :path => INDEX_PATH,
                                     :auto_flush => true,
                                     :analyzer =>

SwedishAnalyzer.new())
index << self.to_doc
end
def ferret_update
index ||= Index::Index.new( :key => [:id,
:ferret_table],
:path => INDEX_PATH,
:auto_flush => true,
:analyzer =>
SwedishAnalyzer.new())
index.query_delete("+id:#{self.id.to_s}
+ferret_table:#{self.class.table_name}")
index << self.to_doc
end

       def ferret_destroy
          # code to delete from index
          index ||= Index::Index.new(:key => [:id, :ferret_table],
                                       :path => INDEX_PATH,
                                     :auto_flush => true,
                                     :analyzer =>

SwedishAnalyzer.new())
index.query_delete("+id:#{self.id.to_s}
+ferret_table:#{self.class.table_name}")
end

       def to_doc
          # Churn through the complete Active Record and add it to

the Ferret document
doc = Ferret::Document::Document.new
doc << Ferret::Document::Field.new(‘ferret_table’,
self.class.table_name, Ferret::Document::Field::Store::YES,
Ferret::Document::Field::Index::UNTOKENIZED)
self.attributes.each_pair do |key,val|
if key == ‘id’
doc << Ferret::Document::Field.new(“id”, val.to_s,
Ferret::Document::Field::Store::YES,
Ferret::Document::Field::Index::UNTOKENIZED)
else
doc << Ferret::Document::Field.new(key, val.to_s,
Ferret::Document::Field::Store::NO,
Ferret::Document::Field::Index::TOKENIZED)
end
end
return doc
end
end
end
end
end

reopen ActiveRecord and include all the above to make

them available to all our models if they want it

ActiveRecord::Base.class_eval do
include FerretMixin::Acts::ARFerret
end

END acts_as_ferret.rb

And the rake task:

include FileUtils

desc “Perform ferret index”
task :indexer => :environment do
if !File.exist?(INDEX_PATH)
puts “Creating index dir in #{INDEX_PATH}”
FileUtils.mkdir_p(INDEX_PATH)
end

classes = []
Dir.glob(File.join(RAILS_ROOT,"app","models","*.rb")).each do

|rbfile|
bname = File.basename(rbfile,’.rb’)
classname = Inflector.camelize(bname)
classes.push(classname)
end
classes.each do |class_obj|
c = eval(class_obj)
if c.respond_to?(:ferret_create)
puts “REBUILDING #{c.name}”
c.find_all.each{|cn|cn.save}
end
end
end

weibel · December 15, 2005, 4:06pm

On 12/15/05, albert ramstedt [email protected] wrote:

require ‘unicode’
def token_re()
 end
end

class SwedishAnalyzer < Ferret::Analysis::Analyzer
def token_stream(field, string)
return UnicodeLowerCaseFilter.new(SwedishTokenizer.new(string))
end
end

Oh, very cool. Sorry, I just replied to your other email before I saw
this. Do you mind if I put it on the Ferret Wiki in the howtos
section? Even better if you could do it.

Thanks for posting this Albert. Hope my other code snippet helped.

Cheers,
Dave

weibel · December 15, 2005, 7:19pm

On 12/16/05, Fabien F. [email protected] wrote:

Nice to see this addition. I’m wondering wether this will work for other
European languages besides Swedish though. Is there a way to make it
more universal?

Hi Fabien,
As far as I know this will work for any european language, or any
language for that matter. You just need to include the required
characters in the regular expression. Once the data is split into
tokens, Ferret doesn’t care what the string looks like. You can even
store binary data like images in a Ferret index if you want to. Now we
just need people to add the necessary characters for all the different
European languages.

Dave

As far

weibel · December 15, 2005, 8:22pm

Hi

Ofcourse you can add it to the wiki! The mail seems to have scrambled
the utf characters, so keep that in mind if you intend to use the
swedish tokenizer.

Albert

weibel · December 16, 2005, 12:16pm

On Dec 16, 2005, at 12:14 AM, hui wrote:

It’s so cool!
I am just looking for the CJK solutions,
Here is “JavaCC code for the Nutch lexical analyzer.”
Inlucded in Nutch source code, so could anyone port it into ferret?

There are several other Analyzers in Lucene that can deal with CJK
(and actually Korean doesn’t really fit with Chinese and Japanese).
Lucene’s StandardAnalyzer recognizes the CJK range just as the Nutch
one does, and there are also these additional ones (in the cjk and cn
directories):

<http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/

analyzers/src/java/org/apache/lucene/analysis/>

Erik

weibel · December 15, 2005, 4:24pm

albert ramstedt <albert@…> writes:

To answer my own question…

This is a hack to get unicode to work, and relies on the unicode gem.
Also, this, as opposed to my previous code listing, should work out of
the box… except that the constant INDEX_PATH must be set before,
preferable in environment.rb

Nice to see this addition. I’m wondering wether this will work for other
European languages besides Swedish though. Is there a way to make it
more universal?

Thanks.