Forum: Ruby on Rails ANN: acts_as_ferret

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
weibel (Guest)
on 2005-12-02 20:43
(Received via mailing list)
Hi all

This week I have worked with Rails and Ferret to test Ferrets (and
Lucenes)
capabilities. I decided to make a mixin for ActiveRecord as it seemed
the
simplest possible solution and I ended up making this into a plugin.

For more info on Ferret see:
http://ferret.davebalmain.com/trac/

The plugin is functional but could easily be refined. Anyway I want to
share it
with you. Regard it as a basic solution. Most of the ideas and code is
taken
from these sources

Howtos and help on Ferret with Rails:
# http://wiki.rubyonrails.com/rails/pages/HowToInteg...
# http://article.gmane.org/gmane.comp.lang.ruby.rails/26859
# http://ferret.davebalmain.com/trac
#
http://aslakhellesoy.com/articles/2005/11/18/using...
# http://rubyforge.org/pipermail/ferret-talk/2005-No...

Howtos on creating plugins:
# http://wiki.rubyonrails.com/rails/pages/HowToWrite...
# http://www.jamis.jamisbuck.org/articles/2005/10/11...
# http://lesscode.org/2005/10/27/rails-simplest-plug...
# http://wiki.rubyonrails.com/rails/pages/HowTosPlugins


The result is the acts_as_ferret Mixin for ActivcRecord.

Use it as follows:
In any model.rb add acts_as_ferret

class Foo < ActiveRecord::Base
  acts_as_ferret
end

All CRUD operations will be performed on both ActiveRecord (as usual)
and a
ferret index for further searching.

The following method is available in your controllers:

ActiveRecord::find_by_contents(query) # Query is a string representing
you query

The plugin follows the usual plugin structure and consists of 2 files:

{RAILS_ROOT}/vendor/plugins/acts_as_ferret/init.rb
{RAILS_ROOT}/vendor/plugins/acts_as_ferret/lib/acts_as_ferret.rb

The Ferret DB is stored in:

{RAILS_ROOT}/db/index.db

Here follows the code:

# CODE for init.rb
require 'acts_as_ferret'
# END init.rb

# CODE for acts_as_ferret.rb
require 'active_record'
require 'ferret'

module FerretMixin #(was: Foo)
   module Acts #:nodoc:
      module ARFerret #:nodoc:

         def self.append_features(base)
            super
            base.extend(MacroMethods)
         end

# declare the class level helper methods
# which will load the relevant instance methods defined below when
invoked

         module MacroMethods

            def acts_as_ferret
               extend FerretMixin::Acts::ARFerret::ClassMethods
               class_eval do
                  include FerretMixin::Acts::ARFerret::ClassMethods

                  after_create :ferret_create
                  after_update :ferret_update
                  after_destroy :ferret_destroy
               end
            end

         end

         module ClassMethods
            include Ferret

            INDEX_DIR = "#{RAILS_ROOT}/db/index.db"

            def self.reloadable?; false end

            # Finds instances by file contents.
            def find_by_contents(query, options = {})
               index_searcher ||= Search::IndexSearcher.new(INDEX_DIR)
               query_parser   ||=
QueryParser.new(index_searcher.reader.get_field_names.to_a)
               query = query_parser.parse(query)

               result = []
               index_searcher.search_each(query) do |doc, score|
                  id = index_searcher.reader.get_document(doc)["id"]
                  res = self.find(id)
                  result << res if res
               end
               index_searcher.close()
               result
            end

            # private

            def ferret_create
               index ||= Index::Index.new(:key => :id,
                                       :path => INDEX_DIR,
                                       :create_if_missing => true,
                                       :default_field => "*")
               index << self.to_doc
               index.optimize()
               index.close()
            end

            def ferret_update
               #code to update index
               index ||= Index::Index.new(:key => :id,
                                       :path => INDEX_DIR,
                                       :create_if_missing => true,
                                       :default_field => "*")
               index.delete(self.id.to_s)
               index << self.to_doc
               index.optimize
               index.close()
            end

            def ferret_destroy
               # code to delete from index
               index ||= Index::Index.new(:key => :id,
                                       :path => INDEX_DIR,
                                       :create_if_missing => true,
                                       :default_field => "*")
               index_writer.delete(self.id.to_s)
               index_writer.optimize()
               index_writer.close()
            end

            def to_doc
# Churn through the complete Active Record and add it to the Ferret
document
               doc = Ferret::Document::Document.new
               self.attributes.each_pair do |key,val|
                  doc << Ferret::Document::Field.new(key, val.to_s,
Ferret::Document::Field::Store::YES,
Ferret::Document::Field::Index::TOKENIZED)
               end
               doc
            end
         end
      end
   end
end

# reopen ActiveRecord and include all the above to make
# them available to all our models if they want it

ActiveRecord::Base.class_eval do
   include FerretMixin::Acts::ARFerret
end

# END acts_as_ferret.rb
obiefernandez (Guest)
on 2005-12-02 21:15
(Received via mailing list)
+1 great work
ezra (Guest)
on 2005-12-02 22:01
(Received via mailing list)
Very nice Kasper-

	Thanks for sharing!

Cheers-
-Ezra
On Dec 2, 2005, at 10:22 AM, Kasper W. wrote:

>
> # http://ferret.davebalmain.com/trac
>
> All CRUD operations will be performed on both ActiveRecord (as
> {RAILS_ROOT}/vendor/plugins/acts_as_ferret/init.rb
> # END init.rb
>             super
>                extend FerretMixin::Acts::ARFerret::ClassMethods
>
>                query_parser   ||=
>                result
>                index.optimize()
>                index << self.to_doc
>                index_writer.delete(self.id.to_s)
> Ferret::Document::Field::Store::YES,
> # them available to all our models if they want it
> http://lists.rubyonrails.org/mailman/listinfo/rails
>

-Ezra Z.
Yakima Herald-Republic
WebMaster
http://yakimaherald.com
509-577-7732
removed_email_address@domain.invalid
James R (Guest)
on 2005-12-02 22:18
Thanks... one problem. I beleive that I'm doing everything correctly
except I keep getting this error on any CRUD operating:

undefined local variable or method `document' for #<Region:0xb7124c50>

(where #<Region:....> is the name of my model)


any ideas? The index is created and I've been able to test Ferret from a
command line script just fine.
listbox (Guest)
on 2005-12-02 22:45
(Received via mailing list)
On 2-dec-2005, at 19:22, Kasper W. wrote:

> Hi all
>
> This week I have worked with Rails and Ferret to test Ferrets (and
> Lucenes)
> capabilities. I decided to make a mixin for ActiveRecord as it
> seemed the
> simplest possible solution and I ended up making this into a plugin.

I recently finished a simple search plugin, which works like this

class Page < ActiveRecord::Base
	indexes_columns :title, :body, :into=>'somecolumn'
end

it's here http://julik.textdriven.com/svn/tools/rails_plugins/
simple_search/ (just finished the tests)

Maybe we can join the two plugins and get a nice search hook for AR
searching? Along the lines of

class Page < ActiveRecord::Base
    indexes_columns :title, :body, :into=>MainFerretIndex # if you
pass a Ferret index it gets hooked instead of a column for LIKE
end

Or even maintain named Ferret indexes if the user has Ferret and
resort to LIKE queries if he doesn't?
--
Julian 'Julik' Tarkhanov
me at julik.nl
weibel (Guest)
on 2005-12-03 01:04
(Received via mailing list)
James R <adamjroth@...> writes:

>
> Thanks... one problem. I beleive that I'm doing everything correctly
> except I keep getting this error on any CRUD operating:

The following in acts_as_ferret.tb should be one line (almost at the end
of the
file)

# Churn through the complete Active Record and add it to the Ferret
document

Take care with those line breaks :-)

Kasper
dbalmain.ml (Guest)
on 2005-12-03 03:10
(Received via mailing list)
Hi Kasper,

Nice work. Do you mind if I put this on the Ferret Wiki?

A few minor points. And a disclaimer, I haven't had time to use Rails
since I started working on Ferret so I could be wrong about a few
things here. I noticed in ferret_destroy you have index_writer. I
think this is meant to be just index. Also, where you have the lines;

              index.optimize()
              index.close()

I would replace these with;

              index.flush()

Optimizing the index every time is not necessary and can be quite slow
for large indexes. Also, if you close the index, the next time you try
to use it you should get an error. I'm not sure why it works for you.
It might be a bug. I'll have to check it out. Better to leave the
index open. If you are optimizing every time because you are really
concerned about search speed, it is better just to set the merge
factor to 2. ie;

               index ||= Index::Index.new(:key => :id,
                                      :path => INDEX_DIR,
                                      :merge_factor => 2)

Remember that there is generally a payoff between indexing speed and
search speed. Also note that I removed the :default_field and
:create_if_missing options. They were set to the defaults anyway.

Another thing, since you are setting the key to :id, there is no need
to do the delete when you do the update. This will happen
automatically.

Lastly, and most importantly, I think this will only work if you only
apply it to one object or you'll get conflicting ids from two
different tables. To make this available to more than one object,
there are two solutions I can think of. You could have a separate
index directory for each object. Or you can set the key like this;

               index ||= Index::Index.new(:key => [:id, :table],
                                      :path => INDEX_DIR)

And your to_doc method would need to store the name of the table in
the :table field in the document.

I hope all this information helps. When I get some time to use Rails
I'll post my own code.

Cheers,
Dave

PS: I just released Ferret 0.3.0 so gem update and enjoy. :)
weibel (Guest)
on 2005-12-03 04:23
(Received via mailing list)
David B. <dbalmain.ml@...> writes:

>
> Hi Kasper,
>
> Nice work. Do you mind if I put this on the Ferret Wiki?

Thanks David

This is really quality input!

It's my first week with Ferret and I'm still working my way into it. I
hope I'll
get time to reflect on your comments before monday.

Feel free to put it on the wiki!

Kasper
erik (Guest)
on 2005-12-03 13:17
(Received via mailing list)
CC'ing ferret-talk also.

Nice work, Kasper!   You've beaten me to it - this was something I
was planning on tackling in the near future.

I've got some additional feedback for you inlined below.  Keep in
mind that I'm being highly detailed in my feedback, in order to help
this extension become the best it can be given Lucene best
practices.  Your work is a great start, and I want to see this
evolve.  All comments below are constructive, not even 'criticism'.
Thanks for getting this started!

On Dec 2, 2005, at 1:22 PM, Kasper W. wrote:
> The result is the acts_as_ferret Mixin for ActivcRecord.
>
> Use it as follows:
> In any model.rb add acts_as_ferret
>
> class Foo < ActiveRecord::Base
>   acts_as_ferret
> end

Ideally there will be many options desired besides just enabling a
table to be indexed fully.  More on that in a moment.

> All CRUD operations will be performed on both ActiveRecord (as
> usual) and a
> ferret index for further searching.

The toughest issue to deal with here is transactions.  Suppose a
database operation rolls back - then what happens to the index?  It's
out of sync.  I don't have any easy solutions though, and it is an
issue that pops up regularly in the Java Lucene community as well.
There is quite a mismatch between a relational database and a full-
text index when it comes to how updates and additions are handled.

At the very least, a warning should be included mentioning the
transactional issue.

Another facility that is desirable with Lucene is the ability to
rebuild the entire index from scratch.  Why?  Perhaps you change the
analyzer, you will need to re-index all documents to have them re-
analyzed.

> The following method is available in your controllers:
>
> ActiveRecord::find_by_contents(query) # Query is a string
> representing you query

Dave mentioned this, but you're currently only indexing "id", but not
the table name.  Thus you could get documents that matching the query
from other tables, and get an id that doesn't exist for the current
table or one from a different table.  Table name needs to be
considered somehow, either by building a separate index for each
table, or adding the table name as an indexed, untokenized field.

> The Ferret DB is stored in:
>
> {RAILS_ROOT}/db/index.db

Please consider NOT calling it a "DB".  Ferret is Lucene.  What it
builds is an "index", not a "database" in the traditional sense.  I
think it would be best to avoid "db" terminology to prevent confusion.

>          module ClassMethods
>             include Ferret
>
>             INDEX_DIR = "#{RAILS_ROOT}/db/index.db"

I'm not sure how to parameterize "acts_as" extensions, but making the
index location more configurable would be good.

>             # Finds instances by file contents.
>             def find_by_contents(query, options = {})
>                index_searcher ||= Search::IndexSearcher.new(INDEX_DIR)
>                query_parser   ||=
> QueryParser.new(index_searcher.reader.get_field_names.to_a)
>                query = query_parser.parse(query)

QueryParser is only one (and often crude) way to formulate a Query.
Ideally there would be a couple of methods to search with, one that
takes a QueryParser-friendly expression like "foo AND bar NOT baz"
and another that takes a Query instance allowing a developer to
formulate sophisticated queries via the Ferret query API rather than
parsing an expression.   There are many good reasons for this, most
importantly from a user interface perspective where the application
makes more sense to have separate fields that build up a query rather
than the one totally free-form Google-esque text box.  Many
applications need full-text search, but not in a way that users need
to know query expression operators like +/-/AND/OR.

Back to the table name issue, here you'll want to wrap the query with
a BooleanQuery AND'd with a TermQuery for table:<table name> so that
you're sure the only hits returned will be for the current table.

>                result = []
>                index_searcher.search_each(query) do |doc, score|
>                   id = index_searcher.reader.get_document(doc)["id"]
>                   res = self.find(id)
>                   result << res if res
>                end

Some handling of paging needs to be added here.  It is unlikely that
all hits are needed, and accessing the Document for every hit will be
an enormous performance bottle-neck with lots of data.  It is very
important to choose the hits enumeration carefully.  Doing a database
query for every hit is also likely to be a huge bottleneck.  Perhaps
doing a SQL "IN" query for all id's after the narrowing the set of
hits (by page) is feasible, though I'm not sure what limits exist on
how many items you can have with an "IN" clause.  I've not delved
into Ferret in much depth yet, but in Java Lucene a HitCollector
would possibly be a good way to handle this.

>                index_searcher.close()
>                result
>             end

It is definitely unwise to close the IndexSearcher instance for every
search - leaving it open allows for field caches to warm  up and
speeds up successive searches.

>             # private
>
>             def ferret_create
>                index ||= Index::Index.new(:key => :id,
>                                        :path => INDEX_DIR,
>                                        :create_if_missing => true,
>                                        :default_field => "*")

Dave mentioned the key thing, and I'll reiterate the need to add the
table name to it.

>                index << self.to_doc
>                index.optimize()
>                index.close()
>             end

Reiterating Dave, but just to be thorough, optimizing and closing an
index is not a good thing to do on every document operation as it can
be slow.  And definitely heed his advice about using flush.  There
does need to be a facility to optimize the index on demand, which
developers may choose to do as a nightly batch process, or
periodically as the index becomes segmented.

>             def ferret_update
>                #code to update index
>                index ||= Index::Index.new(:key => :id,
>                                        :path => INDEX_DIR,
>                                        :create_if_missing => true,
>                                        :default_field => "*")

I recommend centralizing the Index constructor, so as to not
duplicate all of those parameters and allowing them to be changed in
one spot.

>                                        :create_if_missing => true,
>                                        :default_field => "*")
>                index_writer.delete(self.id.to_s)
>                index_writer.optimize()
>                index_writer.close()
>             end

Again, the table name should be part of the key for all operations
above.

>             end
This to_doc is where a lot of fun can be had.  There are many options
that need to be parameterized by the developer at the model level.
For example, how a field is indexed is crucial.  You're storing and
tokenizing every field, including the "id" field.  You definitely do
not want to tokenize the "id" field.  Adding the table name is needed
also, untokenized.  Each field should allow flexibility on how it is
(or is not) indexed, including whether to store/tokenize the field or
not.  Storing fields is unnecessary in the ActiveRecord sense, since
what you're returning from the search method are records from the
database, not documents from the index.  Making the analyzer
controllable is necessary at a global level for the index, and
overridable on a per-field level too.

A common technique with Lucene when field-level searching granularity
is not relevant is to create an aggregate field, say "contents" where
all text is indexed.  With Ferret, you could do this by iterating
over all fields that should be indexed/tokenized using the "contents"
as the field name for all fields of the record.  Then searches would
occur only against "contents".  While Dave likes the default field to
be "*", I personally find distributing a query expression across all
fields tricky and error-prone, especially given that different fields
may be analyzed differently.  Consider a query for "foo bar".  With
two fields "title" and "body", how do you expand that query across
all fields?  Not trivial.  This is why I like the aggregate
"contents" field technique, which can work in conjunction with fields
indexed individually also, so a query for "foo bar" would search the
"contents" field by default, but someone could do "title:foo
body:bar" to refine things.

I think this is enough, and perhaps too much(!), feedback for
now :)   Sorry if it seems overly picky, but I think this is a very
important addition to the Rails and ActiveRecord.  The magic that is
Lucene is very special, with I'm thrilled that it has now entered the
Ruby world.  I want to help Ferret and its integration into places
like ActiveRecord goes as smoothly as possible and keeps the
outstanding reputation that Lucene has in the Java (and C# and
Python, etc) world.  There are many ways to use Lucene inefficiently
- I'll be here doing what I can to help oversee that things are done
in the best possible way.

	Erik
tlockney (Guest)
on 2005-12-05 06:44
(Received via mailing list)
great job on this Kasper. I took a look at this a few days ago and
started
playing with it this weekend. I've taken a few of Erik's suggestions and
started
trying to implement them. I don't know if you've already started working
on
enhancing it, but I'd be very interested in contributing my changes.
It'll
probably be a few days before I can get back in and finish things up,
though.
(The Portland Ruby Brigade has their monthly meeting on Tuesday, so
that's one
nights work missed.
;~)

Here's the changes I've started working on:

1. Adding configuration

    The notation I'm working on is something like this:

        acts_as_ferret :index_dir => "#{RAILS_ROOT}/index/", fields =>
{...}

    Still playing with the configuration of the fields. I've also
written it so
that the default is to index all fields with the default settings. In
addition,
it should be possible to simply pass an array to the fields parameter
and
default the settings for Storable, etc.

2. Adding the ability to pass Query objects to the find_by_contents
method.

I've been doing some refactoring along the way, too, and hope to add
some unit
tests eventually. One final suggestion, perhaps the name should be
changed to
acts_as_indexed?

Anyway, this is great work. I hope I can make worthwhile contributions
to this.

--
Thomas L.
dbalmain.ml (Guest)
on 2005-12-05 07:16
(Received via mailing list)
Hi Thomas,

For additionial ideas look here;

http://ferret.davebalmain.com/trac/wiki/FerretOnRails

And of course, please feel free to add your improvements.

Cheers,
Dave
erik (Guest)
on 2005-12-05 11:39
(Received via mailing list)
On Dec 4, 2005, at 11:39 PM, Thomas L. wrote:
> (The Portland Ruby Brigade has their monthly meeting on Tuesday, so
> that's one
> nights work missed.
> ;~)

You Portland Rubyists really know how to party!   I went to the event
during OSCON in August - what a blast.

> 1. Adding configuration
>
>     The notation I'm working on is something like this:
>
>         acts_as_ferret :index_dir => "#{RAILS_ROOT}/index/", fields
> => {...}

So you're thinking that each model may have its own index?   I wasn't
sure if one index per model made sense or whether a single index,
globally configured through environment.rb and friends, made the most
sense.  Using one index would allow some future clever things such as
querying without the table name allowing results to come back with
objects spanning multiple models.

I'm leaning towards preferring a single index, such that
the :index_dir configuration would be done via environments.rb
globally, not per model.

> 2. Adding the ability to pass Query objects to the find_by_contents
> method.

Cool.  Maybe this should be renamed to find_by_ferret?  If a String
is passed in, it gets parsed (with the options hash allowing control
over the parsing), and if a Query is passed in then it is used as-is.

> I've been doing some refactoring along the way, too, and hope to
> add some unit
> tests eventually. One final suggestion, perhaps the name should be
> changed to
> acts_as_indexed?

I like it being acts_as_ferret personally.  "indexed" is overloaded
within the relational database domain, so it could be construed as
having to do with DB indexes.

> Anyway, this is great work. I hope I can make worthwhile
> contributions to this.

Thanks for your efforts!   I'm glad to see this all coming together.

	Erik
weibel (Guest)
on 2005-12-05 13:05
(Received via mailing list)
Hi all

First of all I'd like to take the oppertunity to thank you all for the
great
response. Personally I feel that this approach to Ferret/Rails
integration will
be a good thing to investigate further. People need quality search.

I think that we should agree on where to put the input for this project.
The
page on David B.s wiki is a good start - thanks for that David.
http://ferret.davebalmain.com/trac/wiki/FerretOnRails

I needed this code for a specific task on my job and there is still many
things
to do to make it general usable.

I will comment on different peoples input below.

Thanks to David for giving direct input for enhancing the quality of the
code
and explaining index.flush() to me. It's good to have the author of
ferret
giving direct input as I'm not really sure where the pitfalls in the
implementation are speed/quality wise.

As both David and Eric Hatcher has pointed out the current
implementation will
only index one model per application. My view on this issue is that I
would like
to have one index for all models as opposed to multiple index files;
that is ONE
Ferret index per application.

I will also need to implement a method for rebuilding the index. This
will come
in handy both when in development mode and probably also in production.

Eric pointed out that there will be problems with transactions and I
must admit
that I don't have any viable ideas of how to approach this issue. I have
thought
of turning transactions off for the SQL tables in question - if that's
possible
at all.

Eric also had problems with the name index.db. Instead I suggest
index.frt

The current search method should be worked on. At the moment it fires
quite a
few SQL select statements. There is also a need for the implementation
of
pagination.

The to_doc method is one way to approach things when building the index.
I
actually thought of Erics suggestion about an aggregate field which
sounds
practical. There should be a way of configuring which fields goes where.

I have had many ideas of what other things to implement. One of them is
that
hard core Lucene folks will probably not put up with the limitations of
a
specific implementation if it makes things difficult. One of the things
I like
about Active Recored in Rails is the find_by_sql() method which lets you
do
whatever you want on the SQL side. A similar approach could be
implemented with
Ferret. find_by_fql() - if there is such a term as Ferret Query
Language.

Also the many possibilities for fine tuning should not be forgotten in
favour of
simplicity. There should allways be a way to make the configuration
exactly as
you would like it. I favour the configuration approach Thomas L.
has
suggested.

Lastly: I really appreciate your contributions and I feel that with our
combined
efforts it will be possible to build a quality solution. In time
acts_as_ferret
could become the prefered choice for Ferret/Rails integration.

Kasper
tlockney (Guest)
on 2005-12-05 18:25
(Received via mailing list)
Erik H. <erik@...> writes:

>
> On Dec 4, 2005, at 11:39 PM, Thomas L. wrote:
> > (The Portland Ruby Brigade has their monthly meeting on Tuesday, so
> > that's one
> > nights work missed.
> > ;~)
>
> You Portland Rubyists really know how to party!   I went to the event
> during OSCON in August - what a blast.

Well, that was my first PRX.rb event since I had just moved here, so I
can't
take credit for all that...

> >     The notation I'm working on is something like this:
> >
> >         acts_as_ferret :index_dir => "#{RAILS_ROOT}/index/", fields
> > => {...}
>
> So you're thinking that each model may have its own index?

Actually, I guess I didn't indicate very well what was going to be
optional
configuration and what was fixed. I only put that there to indicate that
you
*could* have one index per model. I left out the part that would allow
you to
configure it globaly. I tend to agree with you, in fact, that one global
index
makes the most sense.

>
> > 2. Adding the ability to pass Query objects to the find_by_contents
> > method.
>
> Cool.  Maybe this should be renamed to find_by_ferret?

sounds reasonable to me.

> If a String is passed in, it gets parsed (with the options hash allowing
> control over the parsing), and if a Query is passed in then it is used
> as-is.

That's pretty much what I was aiming for.

> I like it being acts_as_ferret personally.  "indexed" is overloaded
> within the relational database domain, so it could be construed as
> having to do with DB indexes.

Seems reasonable to me.

Thomas
Thomas L. (Guest)
on 2005-12-14 02:43
(Received via mailing list)
Since it's been over a week and I've only had time to tinker here and
there on
my proposed changes to the acts_as_ferret plugin, I thought it was time
to just
post what I had so far and let others weigh in on it or take their own
stab at
making it more complete. I've posted my updated version along with some
brief
notes at the bottom of the ferret wiki page here:
http://ferret.davebalmain.com/trac/wiki/FerretOnRails

I'm still actively working on this, but I've only been able to do it in
fits and
spurts so far. I appologize for the ugliness of some of the code, I'm
still
trying to figure out how to do all the dynamic "magic" necessary for
this sort
of thing.
David B. (Guest)
on 2005-12-14 04:29
(Received via mailing list)
Great work Thomas,

I just notices two things in my quick glance. Firstly, you need to
change Document::Field::Index::NO to
Document::Field::Index::UNTOKENIZED for the :ferret_class and :id
fields. My fault as I made the same mistake in my code above.

Also, I don't know if you meant to use symbols but you shouldn't use
':' in a field name as it will through off the query parser. Get rid
of the '"' around :ferret_class and :id and you'll be fine.

I made both these changes on the wiki already.

One other change you may like to make is to allow Query objects to be
passed to the find_by_contents method as well as Strings, but I'll
leave that one up to you for the moment.

Hope that helps,
Dave
jennyw (Guest)
on 2005-12-14 08:04
(Received via mailing list)
It's so great that people are working on this! Ferret is great and I
look forward to seeing it better integrated with Rails.

Thomas -- I tried this code but experienced a few problems with it. I
never got it to work, and gave up since it's not exaclty what I need
(the documents I'm storing in Ferret don't exactly match my model
objects, but are a composite of them). Still, I have some feedback that
might (or might not) be helpful.

In addition to what David mentioned, I noticed that you use the method
class_variable_set in the method acts_as_ferret. This isn't available in
Ruby 1.8.2. Moreover, I'm not sure why you're using this here since the
variable names are not dynamic. I just changed these to:

            @@fields_for_ferret = Array.new
            @@class_index_dir = configuration[:index_dir]

Also, I noticed that the indentation on the class method append_features
was a bit off ... it looked like super was the beginning of a block.
Just a minor thing.

Also, I'm confused about the name for the SingletonMethods module. What
is the singleton that's being referred to here? This isn't a criticism
-- I'm just confused, since it seems to me that these methods get added
to your model classes and are available to each instance. Are they named
such because each model has a single instance of the index?

Also, I was wondering -- since ferret_create is aliased as
ferret_update, shouldn't it first call a delete before adding itself to
the index? For example, something like:

        def ferret_create
          begin
            ferret_delete
          rescue nil
          end
          ferret_index << self.to_doc
        end
        alias :ferret_update :ferret_create

Also, a question for David -- is auto_flush => true supposed to remove
the lock automatically after writes?  I ask because I also tried the
code that Kasper originally posted, and I kept getting locking errors
unless I closed the index after updates (and I also wasn't quite able to
get that code to work before giving up). I was running both a Web
instance and trying to get at it with console, which is similar, I
think, to what would happen with multiple FCGI processes.

Thanks to everyone for your efforts, especially David for Ferret itself!

Jen
David B. (Guest)
on 2005-12-14 08:49
(Received via mailing list)
On 12/14/05, jennyw <removed_email_address@domain.invalid> wrote:
>         end
>         alias :ferret_update :ferret_create

Hi Jenny,

Glad to hear you like Ferret.

Note that I've add a key option to the index;

    @@index ||= Index::Index.new(:key => [:id, :ferret_class],

This will ensure that the index is kept unique for these fields, ie
every time I do an update the old document will be automatically
deleted. This only happens when you set the key option.

> Also, a question for David -- is auto_flush => true supposed to remove
> the lock automatically after writes?

Yes, that is the way it is supposed to work.

> I ask because I also tried the
> code that Kasper originally posted, and I kept getting locking errors
> unless I closed the index after updates (and I also wasn't quite able to
> get that code to work before giving up). I was running both a Web
> instance and trying to get at it with console, which is similar, I
> think, to what would happen with multiple FCGI processes.

Have you tried it with the latest version of Ferret? 3.0 had a few
bugs but 3.1 should be fine. Let me know if you are still getting lock
errors. :-)

Cheers,
Dave
jennyw (Guest)
on 2005-12-14 09:22
(Received via mailing list)
jennyw wrote:

> Also, a question for David -- is auto_flush => true supposed to remove
> the lock automatically after writes?  I ask because I also tried the
> code that Kasper originally posted, and I kept getting locking errors
> unless I closed the index after updates (and I also wasn't quite able
> to get that code to work before giving up). I was running both a Web
> instance and trying to get at it with console, which is similar, I
> think, to what would happen with multiple FCGI processes.

Oops! Never mind about the locking problem ... it turns out I had an
older version of Ferret installed that probably didn't support
auto_flush.

Jen
Abdur-Rahman A. (Guest)
on 2005-12-14 10:07
(Received via mailing list)
I am rewriting parts of the plug (ill contribute it around next week), I
wanted to use search, with some special arguments for ferret, and
arguments for find. So that when search its done, it calls find with the
found id's and conditions/include enc. And return whats needed. I am
hessitating about ferret_search (no risk of being reimplemented by
someone else) or search (very common, maybe could someday became a
method for rails itself), what is your opinion? I was thinking of
fetching the ferret query first and then the database entry's (from
mysql for example). But I can't really think of what would be faster
(searching ferret first or activerecords), really depends on the use of
conditions...
Finn S. (Guest)
on 2005-12-14 19:43
(Received via mailing list)
Thomas L. wrote:
> of thing.
It's great that you guys are working on this. I have been following the
developments with a fair amount of interest and am hoping to integrate
some of this work with my own code on a project I am working on. A
couple of questions:

Has anyone considered a universal search across multiple models yet? How
would this work considering the fact that currently the code is per
model?

What about indexing fields that are not contained in the model? For
example: say I have an Article model with a belongs_to relationship to
an Author model. I would like the author's name to be indexed along with
the contents of the article in the ferret document. I guess this may be
more of a ruby programming issue than a ferret issue. It seems that the
general practice is to keep track of fields to be used/indexed/inspected
as an array of symbols. In my notional article example that might be:

[:title, :document]

I'd prefer it to look more like:

[:title, :document, :author.name]

but ":author.name" is going to be problematic, is it not?

Any thoughts on these issues? Let me know if I have not been clear
enough.

-F
unknown (Guest)
on 2005-12-14 20:37
(Received via mailing list)
First post here!  Here's my question:

I have several related Category objects that all belong_to a Job
object.  When a new Job object is to be created a user will have to
click on the CSS tabs that I have setup with link_to Action Methods.
I do not want the data from the forms to be persisted until all the
sections are complete and the user clicks "Create Project"  Also I
want the Controller to dynamically store/update each view's session
when any tab is arbitrarily selected

For Example, the form tabs resemble this:

Art Details   |   Dev Details   |   Marketing Details

So when I am finished with "Art Details" and click on "Dev Details",
I want to store that form data in a session - the same for other tabs
when the new view is selected via clicking on a new tab.

I considered using a pseudo-cart type of object to store the Projects
"Details" objects and their associated attributes, but this doesn't
really Details for this model because the child Objects of Project
will not know about their association or foreign keys until they are
persisted.  Moreover, it would seem logical that I just store the
post variables in some object, but then how would I restore those
values in the fields if they go back to a previous tab?

Here's my Object model

   Project
     |__
       |
     ArtDetails belongs to Project
     DevDetails belongs_to Project
     MarketingDetails belongs_to Project

Any suggestions?  TIA!
Erik H. (Guest)
on 2005-12-14 20:49
(Received via mailing list)
On Dec 14, 2005, at 3:06 AM, Abdur-Rahman A. wrote:
> activerecords), really depends on the use of conditions...
My recommendation is to index the fields you want to use as search
criteria into Ferret rather than trying to mix and match Ferret and
ActiveRecord searches.  Optimizing the two will be tricky - would it
be quicker to search with Ferret and then pull from the DB or
constrain the set by the DB first then full-text search on those?
My hunch is that no database will have better performance than the
potential fully optimized Ferret.  It's certainly true in the Java
Lucene that it is as fast and usually faster than a relational
database for querying.

If you do go the route of searching with ActiveRecord first and using
those results to constrain the Ferret search, consider using a Filter
(not sure how that is implemented in Ferret, but in Java Lucene there
are overloaded search methods that accept a Filter).

	Erik
David B. (Guest)
on 2005-12-14 21:07
(Received via mailing list)
On 12/15/05, Erik H. <removed_email_address@domain.invalid> wrote:
> If you do go the route of searching with ActiveRecord first and using
> those results to constrain the Ferret search, consider using a Filter
> (not sure how that is implemented in Ferret, but in Java Lucene there
> are overloaded search methods that accept a Filter).

Filters are implemented in Ferret the same way as they are in Java.
They're unit tested but I haven't used them very much and I don't
suspect many other people have yet either. But they're there if you
need them. You pass a filter object as one of the options to any of
the search methods.

Dave
Julian 'Julik' Tarkhanov (Guest)
on 2005-12-14 21:19
(Received via mailing list)
On 14-dec-2005, at 19:48, Erik H. wrote:

>> can't really think of what would be faster (searching ferret first
> database for querying.
>
> If you do go the route of searching with ActiveRecord first and
> using those results to constrain the Ferret search, consider using
> a Filter (not sure how that is implemented in Ferret, but in Java
> Lucene there are overloaded search methods that accept a Filter).

Maybe someone can help me finish http://www.julik.nl/code/active-
search/classes/ActiveSearch/FerretIndexer.html? I am sotring out the
kinks but I am stumbling upon

RuntimeError: could not obtain lock:

and I should admit I am absolutely lost in how to handle concurrency
with Ferret.



--
Julian 'Julik' Tarkhanov
me at julik.nl
Thomas L. (Guest)
on 2005-12-15 04:12
(Received via mailing list)
David B. <dbalmain.ml@...> writes:

> Also, I don't know if you meant to use symbols but you shouldn't use
> ':' in a field name as it will through off the query parser. Get rid
> of the '"' around :ferret_class and :id and you'll be fine.

Yeah, I realized this one a little while after I pasted it. I had them
as
strings and had reverted back to the ":" prefixed names in an attempt to
see if
that was causing a problem I was having. I guessed I pasted it a little
too soon.

> I made both these changes on the wiki already.

Great!

>
> One other change you may like to make is to allow Query objects to be
> passed to the find_by_contents method as well as Strings, but I'll
> leave that one up to you for the moment.

Yeah, that was the other thing I had started working on but didn't want
to paste
in yet. I had an implementation of it, but it was ugly, so I'm reworking
it a
bit and hope to have that in place over the weekend.

>
> Hope that helps,
> Dave

Thanks again for developing Ferret. I've been waiting for this ever
since I
first started playing with Ruby and saw Erik's registered (though, sadly
never
completed) rlucene project.

Thomas
Thomas L. (Guest)
on 2005-12-15 04:18
(Received via mailing list)
David B. <dbalmain.ml@...> writes:

> Also, I don't know if you meant to use symbols but you shouldn't use
> ':' in a field name as it will through off the query parser. Get rid
> of the '"' around :ferret_class and :id and you'll be fine.

Now that I think about it, I was confused for a bit about the keys
defined and
was having trouble doing lookups. It turned out to be a different
problem, but
in my search for a way to fix it, I changed those fields names to match
(I even
tried just using symbols, but it seems that ferret didn't like that too
much
(should symbols be an allowable option for a field name?).

Ferrets truely a great piece of work and the documentation is already
quite
good, but I think there's a lot more needed to make it fully accessible.
Hopefully as more of us dig in, we can add to what's there. I guess
that's a
topic for the ferret mailing list, though. ;~)

Thomas
Thomas L. (Guest)
on 2005-12-15 04:21
(Received via mailing list)
jennyw <jennyw@...> writes:

>
> It's so great that people are working on this! Ferret is great and I
> look forward to seeing it better integrated with Rails.
>
> Thomas -- I tried this code but experienced a few problems with it. I
> never got it to work, and gave up since it's not exaclty what I need
> (the documents I'm storing in Ferret don't exactly match my model
> objects, but are a composite of them). Still, I have some feedback that
> might (or might not) be helpful.

As I (think I) mentioned in my note on the wiki, the code I put there
definitely
was buggy. I just wanted to put it out in case anyone else wanted to
start
taking a stab at it. I'll have a newer version sometime next week, I
hope.

> In addition to what David mentioned, I noticed that you use the method
> class_variable_set in the method acts_as_ferret. This isn't available in
> Ruby 1.8.2. Moreover, I'm not sure why you're using this here since the
> variable names are not dynamic. I just changed these to:
>
>              <at>  <at> fields_for_ferret = Array.new
>              <at>  <at> class_index_dir = configuration[:index_dir]

I'm not sure why I did that either. :-/ Guess I was just trying to get
anything
to work at that point. I'll implement your fix.

>
> Also, I noticed that the indentation on the class method append_features
> was a bit off ... it looked like super was the beginning of a block.
> Just a minor thing.

I fixed a few indentation problems when I added it to the wiki, but must
have
missed that one. Thanks.

>
> Also, I'm confused about the name for the SingletonMethods module. What
> is the singleton that's being referred to here?

I adopted that from the plugin howtos on the rails wiki:
http://wiki.rubyonrails.org/rails/pages/HowToWrite...
David B. (Guest)
on 2005-12-15 07:01
(Received via mailing list)
On 12/15/05, Julian 'Julik' Tarkhanov <removed_email_address@domain.invalid> 
wrote:
> >> maybe could someday became a method for rails itself), what is
> > My hunch is that no database will have better performance than the
> search/classes/ActiveSearch/FerretIndexer.html? I am sotring out the
> kinks but I am stumbling upon
>
> RuntimeError: could not obtain lock:
>
> and I should admit I am absolutely lost in how to handle concurrency
> with Ferret.

Using the latest version of ferret and setting :auto_flush => true
should solve this problem. Have you tried that? It only works in
Index::Index though and it's not necessary for and IndexSearcher. If
you use IndexWriter and IndexReader directly you have to handle it
yourself.
Julian 'Julik' Tarkhanov (Guest)
on 2005-12-15 07:28
(Received via mailing list)
On 15-dec-2005, at 5:59, David B. wrote:

>
> Using the latest version of ferret and setting :auto_flush => true
> should solve this problem. Have you tried that? It only works in
> Index::Index though and it's not necessary for and IndexSearcher. If
> you use IndexWriter and IndexReader directly you have to handle it
> yourself.

David, thanks for the advice - I'll try that and report the results.
Basically, it feels sort of _odd_ - doing this macro-style Ferret
binging. Ferret is so vast and powerful that
this would be not enough to make use of all of it's features. Maybe
you can send me some advice off-list how I could
probably expand the API of the FerretIndexer to give more access to
the most needed Ferret features in a convenient way (without making
it too big because the whole idea of the plugin is a one-liner
integration into a model, not a document cluster with 10 million
entries in it.

If someone else wants to shed some light (or help with code) I would
be glad to get some help, I am swamped now and won't be able to get
to it until at least next week.

--
Julian 'Julik' Tarkhanov
me at julik.nl
David B. (Guest)
on 2005-12-15 07:55
(Received via mailing list)
Hi Julian,

I'm really busy porting everything in Ferret to C at the moment. Next
year though I should have some time to play around with integrating it
into Rails. Until then I'll try and be as helpful as possible to
others trying to do the same thing. Good luck! :-)

Cheers,
Dave
albert ramstedt (Guest)
on 2005-12-15 13:05
(Received via mailing list)
Hello!

I have been following this thread carefully, ferret just got a little
easier to dive into. Kudos to you guys, and especially to the authors of
ferret! This was just what we needed here at our little webdev shop.

Now I have a problem you guys might know a solution to. I have managed
to get the code from the wiki working, with a little bit of tweaking,
but it does not seem to build queries correctly when it gets fed with
UTF-8 characters. Is this a fault on my side or a known issue with
ferret? I looked at the trac but it seemed it should support UTF-8? I
must have overlooked something...

I didnt dare to touch the wiki, but here is a somewhat altered version
of the plugin, and it should be fully functional. I added some small
things, since we wanted a counter for the Paginator. I know though that
doing a full-out-search just to count might not be the best way to
count, so if anyone has a suggestion to better this, please share! :)

Oh, and I added a rake task to rebuild the index, but it relies on the
INDEX_PATH being set in the environment.rb

Here it is

# CODE for acts_as_ferret.rb
require 'active_record'
require 'ferret'

module FerretMixin
  module Acts #:nodoc:
     module ARFerret #:nodoc:

        def self.append_features(base)
           super
           base.extend(MacroMethods)
        end

        # declare the class level helper methods
        # which will load the relevant instance methods defined below
when invoked

        module MacroMethods

           def acts_as_ferret
              extend FerretMixin::Acts::ARFerret::ClassMethods
              class_eval do
                 include FerretMixin::Acts::ARFerret::ClassMethods

                 after_create :ferret_create
                 after_update :ferret_update
                 after_destroy :ferret_destroy
              end
           end

        end

        module ClassMethods
           include Ferret
           INDEX_PATH = "#{RAILS_ROOT}/db/ferret"
           def self.reloadable?; false end

           # Finds instances by file contents.
           def find_by_ferret(query, options = {})
              @@index_searcher ||= Search::IndexSearcher.new(INDEX_PATH)
              @@query_parser   ||=
QueryParser.new(@@index_searcher.reader.get_field_names.to_a)
              query = @@query_parser.parse(query)
              result = []
              conditions = {}
              conditions[:num_docs] = options[:limit] unless
options[:limit].blank?
              conditions[:first_doc] = options[:offset] unless
options[:offset].blank?

              hits = @@index_searcher.search(query, conditions)
              hits.each do |hit, score|
                   id = @@index_searcher.reader.get_document(hit)['id']
                 result << self.find(id) unless id.nil?
              end
              return result
           end

           def count_by_ferret(query)
                 @@index_searcher ||=
Search::IndexSearcher.new(INDEX_PATH)
              @@query_parser   ||=
QueryParser.new(@@index_searcher.reader.get_field_names.to_a)
              query = @@query_parser.parse(query)
              return @@index_searcher.search(query).total_hits
           end

           # private

           def ferret_create
              # code to update or add to the index
              @@index ||= Index::Index.new(:path => INDEX_PATH,
                                         :auto_flush => true)
              @@index << self.to_doc
           end
           def ferret_update
                @@index ||= Index::Index.new(:path => INDEX_PATH,
                                         :auto_flush => true)
             @@index.query_delete("+id:#{self.id}
+ferret_table:#{self.class.table_name}")
             @@index << self.to_doc
           end

           def ferret_destroy
              # code to delete from index
              @@index ||= Index::Index.new(:path => INDEX_PATH,
                                         :auto_flush => true)
              @@index.query_delete("+id:#{self.id}
+ferret_table:#{self.class.table_name}")
           end

           def to_doc
              # Churn through the complete Active Record and add it to
the Ferret document
              doc = Ferret::Document::Document.new
              doc << Ferret::Document::Field.new('ferret_table',
self.class.table_name, Ferret::Document::Field::Store::YES,
Ferret::Document::Field::Index::UNTOKENIZED)
              self.attributes.each_pair do |key,val|
                 if key == 'id'
                    doc << Ferret::Document::Field.new(key, val.to_s,
Ferret::Document::Field::Store::YES,
Ferret::Document::Field::Index::UNTOKENIZED)
                 else
                    doc << Ferret::Document::Field.new(key, val.to_s,
Ferret::Document::Field::Store::NO,
Ferret::Document::Field::Index::TOKENIZED)
                 end
              end
              return doc
           end
        end
     end
  end
end

# reopen ActiveRecord and include all the above to make
# them available to all our models if they want it
ActiveRecord::Base.class_eval do
  include FerretMixin::Acts::ARFerret
end

# END acts_as_ferret.rb

RAKE TASK in /lib/tasks/indexer.rake

include FileUtils

desc "Perform ferret index"
task :indexer => :environment do
    if !File.exist?(INDEX_PATH)
          puts "Creating index dir in #{INDEX_PATH}"
          FileUtils.mkdir_p(INDEX_PATH)
    end

    classes = []
    Dir.glob(File.join(RAILS_ROOT,"app","models","*.rb")).each do
|rbfile|
            bname = File.basename(rbfile,'.rb')
            classname = Inflector.camelize(bname)
            classes.push(classname)
    end
    classes.each do |class_obj|
        c = eval(class_obj)
        if c.respond_to?(:ferret_create)
            puts "REBUILDING #{c.name}"
            c.find_all.each{|cn|cn.save}
        end
    end
end
albert ramstedt (Guest)
on 2005-12-15 16:24
(Received via mailing list)
To answer my own question...

This is a hack to get unicode to work, and relies on the unicode gem.
Also, this, as opposed to my previous code listing, should work out of
the box... except that the constant INDEX_PATH must be set before,
preferable in environment.rb

# CODE for acts_as_ferret.rb
require 'active_record'
require 'ferret'
require 'unicode'

class UnicodeLowerCaseFilter < Ferret::Analysis::TokenFilter
     def next()
       t = @input.next()

       if (t == nil)
         return nil
       end

       t.term_text = Unicode::downcase(t.term_text)

       return t
     end
end

class SwedishTokenizer < Ferret::Analysis::RegExpTokenizer

    P     =     /[_\/.,-]/
    HASDIGIT     =     /\w*\d\w*/


    def token_re()
     %r([[:alpha:]���åöä]+(('[[:alpha:]���åöä]+)+
       |\.([[:alpha:]���åöä]\.)+
       |(@|\&)\w+([-.]\w+)*
      )
       |\w+(([\-._]\w+)*\@\w+([-.]\w+)+
       |#{P}#{HASDIGIT}(#{P}\w+#{P}#{HASDIGIT})*(#{P}\w+)?
       |(\.\w+)+
       |
      )
       )x
     end
end

class SwedishAnalyzer < Ferret::Analysis::Analyzer
    def token_stream(field, string)
      return UnicodeLowerCaseFilter.new(SwedishTokenizer.new(string))
    end
end

module FerretMixin
  module Acts #:nodoc:
     module ARFerret #:nodoc:

        def self.append_features(base)
           super
           base.extend(MacroMethods)
        end

        # declare the class level helper methods
        # which will load the relevant instance methods defined below
when invoked

        module MacroMethods

           def acts_as_ferret
              extend FerretMixin::Acts::ARFerret::ClassMethods
              class_eval do
                 include FerretMixin::Acts::ARFerret::ClassMethods

                 after_create :ferret_create
                 after_update :ferret_update
                 after_destroy :ferret_destroy
              end
           end

        end

        module ClassMethods
           include Ferret
           def self.reloadable?; false end

           # Finds instances by file contents.
           def find_by_ferret(query, options = {})
              index_searcher ||= Search::IndexSearcher.new(INDEX_PATH)
              query_parser   ||=
QueryParser.new(index_searcher.reader.get_field_names.to_a, {:analyzer
=> SwedishAnalyzer.new()})
              query = query_parser.parse(query)
              result = []
              conditions = {}
              conditions[:num_docs] = options[:limit] unless
options[:limit].blank?
              conditions[:first_doc] = options[:offset] unless
options[:offset].blank?

              hits = index_searcher.search(query, conditions)
              hits.each do |hit, score|
                   id = index_searcher.reader.get_document(hit)['id']
                 result << self.find(id) unless id.nil?
              end
              return result
           end

           def count_by_ferret(query)
                 index_searcher ||=
Search::IndexSearcher.new(INDEX_PATH)
              query_parser   ||=
QueryParser.new(index_searcher.reader.get_field_names.to_a, {:analyzer
=> SwedishAnalyzer.new()})
              query = query_parser.parse(query)
              return index_searcher.search(query).total_hits
           end

           # private

           def ferret_create
              # code to update or add to the index
              index ||= Index::Index.new(:key => [:id, :ferret_table],
                                           :path => INDEX_PATH,
                                         :auto_flush => true,
                                         :analyzer =>
SwedishAnalyzer.new())
              index << self.to_doc
           end
           def ferret_update
                index ||= Index::Index.new( :key => [:id,
:ferret_table],
                                             :path => INDEX_PATH,
                                         :auto_flush => true,
                                         :analyzer =>
SwedishAnalyzer.new())
             index.query_delete("+id:#{self.id.to_s}
+ferret_table:#{self.class.table_name}")
             index << self.to_doc
           end

           def ferret_destroy
              # code to delete from index
              index ||= Index::Index.new(:key => [:id, :ferret_table],
                                           :path => INDEX_PATH,
                                         :auto_flush => true,
                                         :analyzer =>
SwedishAnalyzer.new())
              index.query_delete("+id:#{self.id.to_s}
+ferret_table:#{self.class.table_name}")
           end

           def to_doc
              # Churn through the complete Active Record and add it to
the Ferret document
              doc = Ferret::Document::Document.new
              doc << Ferret::Document::Field.new('ferret_table',
self.class.table_name, Ferret::Document::Field::Store::YES,
Ferret::Document::Field::Index::UNTOKENIZED)
              self.attributes.each_pair do |key,val|
                 if key == 'id'
                    doc << Ferret::Document::Field.new("id", val.to_s,
Ferret::Document::Field::Store::YES,
Ferret::Document::Field::Index::UNTOKENIZED)
                 else
                    doc << Ferret::Document::Field.new(key, val.to_s,
Ferret::Document::Field::Store::NO,
Ferret::Document::Field::Index::TOKENIZED)
                 end
              end
              return doc
           end
        end
     end
  end
end

# reopen ActiveRecord and include all the above to make
# them available to all our models if they want it
ActiveRecord::Base.class_eval do
  include FerretMixin::Acts::ARFerret
end

# END acts_as_ferret.rb

And the rake task:

include FileUtils

desc "Perform ferret index"
task :indexer => :environment do
    if !File.exist?(INDEX_PATH)
          puts "Creating index dir in #{INDEX_PATH}"
          FileUtils.mkdir_p(INDEX_PATH)
    end

    classes = []
    Dir.glob(File.join(RAILS_ROOT,"app","models","*.rb")).each do
|rbfile|
            bname = File.basename(rbfile,'.rb')
            classname = Inflector.camelize(bname)
            classes.push(classname)
    end
    classes.each do |class_obj|
        c = eval(class_obj)
        if c.respond_to?(:ferret_create)
            puts "REBUILDING #{c.name}"
            c.find_all.each{|cn|cn.save}
        end
    end
end
David B. (Guest)
on 2005-12-15 17:00
(Received via mailing list)
Hi Albert,

Perhaps you could do something like this in the find_by_ferret method
and get rid of your count_by_ferret method. Just an idea.

             total_hits = hits.each do |hit, score|
                id = @@index_searcher.reader.get_document(hit)['id']
                result << self.find(id) unless id.nil?
             end
             return result, total_hits

Cheers,
Dave
David B. (Guest)
on 2005-12-15 17:03
(Received via mailing list)
On 12/15/05, albert ramstedt <removed_email_address@domain.invalid> wrote:
> ferret? I looked at the trac but it seemed it should support UTF-8? I
> must have overlooked something...

The problem is that the analyzer doesn't understand UTF-8. You need to
write an analyzer that matches the characters in your character set.
Have at the analyzers and tokenizers included with Ferret. They're
quite simple. Basically you just need to come up with a regular
expression that matches what you consider tokens in your data. For
example, the whitespace tokenizer uses /\S+/. The letter tokenizer
uses /[:alpha:]+/. This is actually where the problem with UTF-8
handling is. [:alpha:] only matches the ascii alphabet in the current
Ruby regexp engine. That will change in Ruby 2.0.

HTH,
Dave
David B. (Guest)
on 2005-12-15 17:06
(Received via mailing list)
On 12/15/05, albert ramstedt <removed_email_address@domain.invalid> wrote:
> require 'unicode'
>
>     def token_re()
>      end
> end
>
> class SwedishAnalyzer < Ferret::Analysis::Analyzer
>     def token_stream(field, string)
>       return UnicodeLowerCaseFilter.new(SwedishTokenizer.new(string))
>     end
> end

Oh, very cool. Sorry, I just replied to your other email before I saw
this. Do you mind if I put it on the Ferret Wiki in the howtos
section? Even better if you could do it. ;-)

Thanks for posting this Albert. Hope my other code snippet helped.

Cheers,
Dave
Fabien F. (Guest)
on 2005-12-15 17:24
(Received via mailing list)
albert ramstedt <albert@...> writes:

>
> To answer my own question...
>
> This is a hack to get unicode to work, and relies on the unicode gem.
> Also, this, as opposed to my previous code listing, should work out of
> the box... except that the constant INDEX_PATH must be set before,
> preferable in environment.rb

Nice to see this addition. I'm wondering wether this will work for other
European languages besides Swedish though. Is there a way to make it
more universal?

Thanks.
David B. (Guest)
on 2005-12-15 20:19
(Received via mailing list)
On 12/16/05, Fabien F. <removed_email_address@domain.invalid> wrote:
> Nice to see this addition. I'm wondering wether this will work for other
> European languages besides Swedish though. Is there a way to make it
> more universal?

Hi Fabien,
As far as I know this will work for any european language, or any
language for that matter. You just need to include the required
characters in the regular expression. Once the data is split into
tokens, Ferret doesn't care what the string looks like. You can even
store binary data like images in a Ferret index if you want to. Now we
just need people to add the necessary characters for all the different
European languages. :-)

Dave



As far
albert ramstedt (Guest)
on 2005-12-15 21:19
(Received via mailing list)
Hi David,

The problem is, that i need that query to use the paginator, ie i need
the hits before i do the actual search with the limit and offset, and
since that query also translates into model objects, it hits the
database when it doesnt actually need to. But I agree, my solution is
not really that nice either.

Albert
Albert R. (Guest)
on 2005-12-15 21:22
(Received via mailing list)
Hi

Ofcourse you can add it to the wiki! The mail seems to have scrambled
the utf characters, so keep that in mind if you intend to use the
swedish tokenizer.

Albert
hui (Guest)
on 2005-12-16 07:15
(Received via mailing list)
It's so cool!
I am just looking for the CJK solutions,
Here is "JavaCC code for the Nutch lexical analyzer."
Inlucded in Nutch source code, so could anyone port it into ferret?
====================================================
/**
 * Copyright 2005 The Apache Software Foundation
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

/** JavaCC code for the Nutch lexical analyzer. */

options {
  STATIC = false;
  USER_CHAR_STREAM = true;
  OPTIMIZE_TOKEN_MANAGER = true;
  UNICODE_INPUT = true;
//DEBUG_TOKEN_MANAGER = true;
}

PARSER_BEGIN(NutchAnalysis)

package org.apache.nutch.analysis;

import org.apache.nutch.searcher.Query;
import org.apache.nutch.searcher.QueryFilters;
import org.apache.nutch.searcher.Query.Clause;

import org.apache.lucene.analysis.StopFilter;

import java.io.*;
import java.util.*;

/** The JavaCC-generated Nutch lexical analyzer and query parser. */
public class NutchAnalysis {

  private static final String[] STOP_WORDS = {
    "a", "and", "are", "as", "at", "be", "but", "by",
    "for", "if", "in", "into", "is", "it",
    "no", "not", "of", "on", "or", "s", "such",
    "t", "that", "the", "their", "then", "there", "these",
    "they", "this", "to", "was", "will", "with"
  };

  private static final Set STOP_SET =
StopFilter.makeStopSet(STOP_WORDS);

  private String queryString;

  /** True iff word is a stop word.  Stop words are only removed from
queries.
   * Every word is indexed.  */
  public static boolean isStopWord(String word) {
    return STOP_SET.contains(word);
  }

  /** Construct a query parser for the text in a reader. */
  public static Query parseQuery(String queryString) throws IOException
{
    NutchAnalysis parser =
      new NutchAnalysis(new FastCharStream(new
StringReader(queryString)));
    parser.queryString = queryString;
    return parser.parse();
  }

  /** For debugging. */
  public static void main(String[] args) throws Exception {
    BufferedReader in = new BufferedReader(new
InputStreamReader(System.in));
    while (true) {
      System.out.print("Query: ");
      String line = in.readLine();
      System.out.println(parseQuery(line));
    }
  }

}

PARSER_END(NutchAnalysis)

TOKEN_MGR_DECLS : {

  /** Constructs a token manager for the provided Reader. */
  public NutchAnalysisTokenManager(Reader reader) {
    this(new FastCharStream(reader));
  }

}

TOKEN : {					  // token regular expressions

  // basic word -- lowercase it
<WORD: ((<LETTER>|<DIGIT>|<WORD_PUNCT>)+ | <IRREGULAR_WORD>)>
  { matchedToken.image = matchedToken.image.toLowerCase(); }

  // special handling for acronyms: U.S.A., I.B.M., etc: dots are
removed
| <ACRONYM: <LETTER> "." (<LETTER> ".")+ >
    {                                             // remove dots
      for (int i = 0; i < image.length(); i++) {
	if (image.charAt(i) == '.')
	  image.deleteCharAt(i--);
      }
      matchedToken.image = image.toString().toLowerCase();
    }

  // chinese, japanese and korean characters
| <SIGRAM: <CJK> >

   // irregular words
| <#IRREGULAR_WORD: (<C_PLUS_PLUS>|<C_SHARP>)>
| <#C_PLUS_PLUS: ("C"|"c") "++" >
| <#C_SHARP: ("C"|"c") "#" >

  // query syntax characters
| <PLUS: "+" >
| <MINUS: "-" >
| <QUOTE: "\"" >
| <COLON: ":" >
| <SLASH: "/" >
| <DOT: "." >
| <ATSIGN: "@" >
| <APOSTROPHE: "'" >

| <WHITE: ~[] >                                   // treat unrecognized chars
                                                  // as whitespace
// primitive, non-token patterns

| <#WORD_PUNCT: ("_"|"&")>                        // allowed anywhere in words

| < #LETTER:					  // alphabets
    [
        "\u0041"-"\u005a",
        "\u0061"-"\u007a",
        "\u00c0"-"\u00d6",
        "\u00d8"-"\u00f6",
        "\u00f8"-"\u00ff",
        "\u0100"-"\u1fff"
    ]
    >

|  <#CJK:                                        // non-alphabets
      [
       "\u3040"-"\u318f",
       "\u3300"-"\u337f",
       "\u3400"-"\u3d2d",
       "\u4e00"-"\u9fff",
       "\uf900"-"\ufaff"
      ]
    >

| < #DIGIT:					  // unicode digits
      [
       "\u0030"-"\u0039",
       "\u0660"-"\u0669",
       "\u06f0"-"\u06f9",
       "\u0966"-"\u096f",
       "\u09e6"-"\u09ef",
       "\u0a66"-"\u0a6f",
       "\u0ae6"-"\u0aef",
       "\u0b66"-"\u0b6f",
       "\u0be7"-"\u0bef",
       "\u0c66"-"\u0c6f",
       "\u0ce6"-"\u0cef",
       "\u0d66"-"\u0d6f",
       "\u0e50"-"\u0e59",
       "\u0ed0"-"\u0ed9",
       "\u1040"-"\u1049"
      ]
  >

}


/** Parse a query. */
Query parse() :
{
  Query query = new Query();
  ArrayList terms;
  Token token;
  String field;
  boolean stop;
  boolean prohibited;

}
{
  nonOpOrTerm()                                   // skip noise
  (
    { stop=true; prohibited=false; field = Clause.DEFAULT_FIELD; }

                                                  // optional + or -
operator
    ( <PLUS> {stop=false;} | (<MINUS> { stop=false;prohibited=true; }
))?

                                                  // optional field
spec.
    ( LOOKAHEAD(<WORD><COLON>(phrase(field)|compound(field)))
      token=<WORD> <COLON> { field = token.image; } )?

    ( terms=phrase(field) {stop=false;} |         // quoted terms or
      terms=compound(field))                      // single or compound
term

    nonOpOrTerm()                                 // skip noise

    {
      String[] array = (String[])terms.toArray(new
String[terms.size()]);

      if (stop
          && field == Clause.DEFAULT_FIELD
          && terms.size()==1
          && isStopWord(array[0])) {
        // ignore stop words only when single, unadorned terms in
default field
      } else {
        if (prohibited)
          query.addProhibitedPhrase(array, field);
        else
          query.addRequiredPhrase(array, field);
      }
    }
  )*

  { return query; }

}

/** Parse an explcitly quoted phrase query.  Note that this may return a
single
 * term, a trivial phrase.*/
ArrayList phrase(String field) :
{
  int start;
  int end;
  ArrayList result = new ArrayList();
  String term;
}
{
  <QUOTE>

  { start = token.endColumn; }

  (nonTerm())*                                    // skip noise
  ( term = term() { result.add(term); }           // parse a term
    (nonTerm())*)*                                // skip noise

  { end = token.endColumn; }

  (<QUOTE>|<EOF>)

  {
    if (QueryFilters.isRawField(field)) {
      result.clear();
      result.add(queryString.substring(start, end));
    }
    return result;
  }

}

/** Parse a compound term that is interpreted as an implicit phrase
query.
 * Compounds are a sequence of terms separated by infix characters.
Note that
 * htis may return a single term, a trivial compound. */
ArrayList compound(String field) :
{
  int start;
  ArrayList result = new ArrayList();
  String term;
}
{
  { start = token.endColumn; }

  term = term() { result.add(term); }
  ( LOOKAHEAD( (infix())+ term() )
    (infix())+
    term = term() { result.add(term); })*

  {
    if (QueryFilters.isRawField(field)) {
      result.clear();
      result.add(queryString.substring(start, token.endColumn));
    }
    return result;
  }

}

/** Parse a single term. */
String term() :
{
  Token token;
}
{
  ( token=<WORD> | token=<ACRONYM> | token=<SIGRAM>)

  { return token.image; }
}


/** Parse anything but a term or a quote. */
void nonTerm() :
{}
{
  <WHITE> | infix()
}

void nonTermOrEOF() :
{}
{
  nonTerm() | <EOF>
}

/** Parse anything but a term or an operator (plur or minus or quote).
*/
void nonOpOrTerm() :
{}
{
  (LOOKAHEAD(2) (<WHITE> | nonOpInfix() | ((<PLUS>|<MINUS>)
nonTermOrEOF())))*
}

/** Characters which can be used to form compound terms. */
void infix() :
{}
{
  <PLUS> | <MINUS> | nonOpInfix()
}

/** Parse infix characters except plus and minus. */
void nonOpInfix() :
{}
{
  <COLON>|<SLASH>|<DOT>|<ATSIGN>|<APOSTROPHE>
}
Erik H. (Guest)
on 2005-12-16 13:16
(Received via mailing list)
On Dec 16, 2005, at 12:14 AM, hui wrote:
> It's so cool!
> I am just looking for the CJK solutions,
> Here is "JavaCC code for the Nutch lexical analyzer."
> Inlucded in Nutch source code, so could anyone port it into ferret?

There are several other Analyzers in Lucene that can deal with CJK
(and actually Korean doesn't really fit with Chinese and Japanese).
Lucene's StandardAnalyzer recognizes the CJK range just as the Nutch
one does, and there are also these additional ones (in the cjk and cn
directories):

	<http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/
analyzers/src/java/org/apache/lucene/analysis/>

Erik
This topic is locked and can not be replied to.