Forum: Ferret Road map of ferret

6185d95f8d8ea197f8da96d7e6f61272?d=identicon&s=25 Fernando Parisotto (Guest)
on 2008-08-19 21:29
(Received via mailing list)
Hi all,

I'm new on the list, and glad to participate.
I would like to make some questions about the ferret project...
- Is the http://ferret.davebalmain.com/ official page of the project?
(I'm
always getting 502 Bad Gateway)
- Where I can find the road map of the project?
- In the http://rubyforge.org/projects/ferret/ I see the last realize
was in
November 28, 2007, that is true?
- Is ferret discontinued?

Please don't take this questions as offensive, I really like to know
about
how ferret is reliable for a long life product.
Here on my company we are planning to make a big product with a indexing
engine, I would like to know if the ferret is "alive".
Thanks for the answers!

--
Atenciosamente - Best regards,

Fernando Luiz Parisotto
Be52651534949072e7f03fbc5fbf4c01?d=identicon&s=25 Nathan Li (nasi)
on 2008-08-20 04:51
I Think many people here have the same questions.

Fernando Parisotto wrote:
> - Is the http://ferret.davebalmain.com/ official page of the project?
> (I'm
> always getting 502 Bad Gateway)
> - Where I can find the road map of the project?
> - In the http://rubyforge.org/projects/ferret/ I see the last realize
> was in
> November 28, 2007, that is true?
> - Is ferret discontinued?
C0ddd0594d70623783c56111324ab70c?d=identicon&s=25 Nicholas Stewart (nick1123)
on 2008-08-20 14:48
I have the same questions about ferret.
7b74ccf93dde7dd2ac5d2980d14fdc7b?d=identicon&s=25 Paul Lynch (plynchnlm)
on 2008-08-27 16:33
(Received via mailing list)
I've been using Ferret in a project still under development, and it
works pretty well.  As far as I can tell, the project is dying, if not
already dead.  David Balmain is still the only listed developer, and
he seems to have moved on to other things.  However, since the
software is still meeting my project's needs, I am not terribly
bothered by that.  I suppose that eventually (in a few years?)
something will change enough that Ferret will stop working, and then
we'll have to find something else.

If you can find an alternative that has active development, I would
recommend you go with that.  (And if you find one, please post about
it.)  But, if you can't, Ferret will probably be good enough for a
while.

On Tue, Aug 19, 2008 at 3:24 PM, Fernando Parisotto
<fernando.parisotto@gmail.com> wrote:
>
>
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk@rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>



--
Paul Lynch
Aquilent, Inc.
National Library of Medicine (Contractor)
E0a4a37c590605f9578b8a862bca0da8?d=identicon&s=25 Eric Schulte (eschulte)
on 2008-08-27 17:21
(Received via mailing list)
I would also be interested in Ferret alternatives for IR in ruby, a
simple search on rubyforge returned mainly a bunch of projects that
look to be abandoned...

- Rise (does not appear to be actively developed)
- rubylucene (looks to be a dead project)
- Ruby Simple Indexer (also looks dead)
- Ruby Odeum (simple ruby-bindings for a fast inverted index)

If anyone knows of any ruby IR projects which are mature, and are
being actively developed I would love to hear about them.

Thanks -- Eric

On Wednesday, August 27, at 10:29, Paul Lynch wrote:
 > I've been using Ferret in a project still under development, and it
 > works pretty well.  As far as I can tell, the project is dying, if
not
 > already dead.  David Balmain is still the only listed developer, and
 > he seems to have moved on to other things.  However, since the
 > software is still meeting my project's needs, I am not terribly
 > bothered by that.  I suppose that eventually (in a few years?)
 > something will change enough that Ferret will stop working, and then
 > we'll have to find something else.
 >
 > If you can find an alternative that has active development, I would
 > recommend you go with that.  (And if you find one, please post about
 > it.)  But, if you can't, Ferret will probably be good enough for a
 > while.
 >
 > On Tue, Aug 19, 2008 at 3:24 PM, Fernando Parisotto
 > <fernando.parisotto@gmail.com> wrote:
 > > Hi all,
 > >
 > > I'm new on the list, and glad to participate.
 > > I would like to make some questions about the ferret project...
 > > - Is the http://ferret.davebalmain.com/ official page of the
project? (I'm
 > > always getting 502 Bad Gateway)
 > > - Where I can find the road map of the project?
 > > - In the http://rubyforge.org/projects/ferret/ I see the last
realize was in
 > > November 28, 2007, that is true?
 > > - Is ferret discontinued?
 > >
 > > Please don't take this questions as offensive, I really like to
know about
 > > how ferret is reliable for a long life product.
 > > Here on my company we are planning to make a big product with a
indexing
 > > engine, I would like to know if the ferret is "alive".
 > > Thanks for the answers!
 > >
 > > --
 > > Atenciosamente - Best regards,
 > >
 > > Fernando Luiz Parisotto
 > >
 > > _______________________________________________
 > > Ferret-talk mailing list
 > > Ferret-talk@rubyforge.org
 > > http://rubyforge.org/mailman/listinfo/ferret-talk
 > >
 >
 >
 >
 > --
 > Paul Lynch
 > Aquilent, Inc.
 > National Library of Medicine (Contractor)
 > _______________________________________________
 > Ferret-talk mailing list
 > Ferret-talk@rubyforge.org
 > http://rubyforge.org/mailman/listinfo/ferret-talk
66aa0fea07ad578baac27a146ff74a24?d=identicon&s=25 Marvin Humphrey (Guest)
on 2008-08-27 17:58
(Received via mailing list)
On Aug 27, 2008, at 8:20 AM, Eric Schulte wrote:

> If anyone knows of any ruby IR projects which are mature, and are
> being actively developed I would love to hear about them.

FWIW, I recently finished porting all module code in KinoSearch to C.
If we write binding code and port the test suite, it will be usable
from Ruby.

KinoSearch is sort of a sister project to Ferret.  The dev branch
implements many of the ideas that Dave Balmain and I designed together
for the Lucy project.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Bbe0862e17f3911349d9cad7b0a7b935?d=identicon&s=25 arvind gautam (Guest)
on 2008-08-27 18:07
(Received via mailing list)
How bout Sphinx?
Bf2b5bfd50445978f8ecfdecc8ab3e0d?d=identicon&s=25 William Morgan (Guest)
on 2008-08-27 18:12
(Received via mailing list)
Reformatted excerpts from Eric Schulte's message of 2008-08-27:
> If anyone knows of any ruby IR projects which are mature, and are
> being actively developed I would love to hear about them.

sphinxsearch.com

Much less useable API than Ferret, and you have to run it as a separate
server process, but it's fast, stable, and actively maintained.
754e1c849101a2b96e441e7b897abc37?d=identicon&s=25 Jeremy Hopple (Guest)
on 2008-08-27 18:22
(Received via mailing list)
As far as I know, Sphinx only can only index tables that have a unique
numeric id (e.g. and auto-incrementing int)....  I looked at using it,
but
we use md5 hashes for the id/primary key on the tables I want to
index... so
we were out of luck.
For what it's worth, I use Ferret 0.11.6 and love it.  I re-index about
~90
million rows (and growing) worth of "stuff" (title, description, author,
etc...) every night...  works like a champ.  Searching is fast (provided
you
don't want to sort on something other than relevance) and accurate.
E0a4a37c590605f9578b8a862bca0da8?d=identicon&s=25 Eric Schulte (eschulte)
on 2008-08-27 20:27
(Received via mailing list)
Thanks for all the info, I just found a very good related discussion
from ruby-forum which I thought I'd share

http://www.ruby-forum.com/topic/137629

On Wednesday, August 27, at 11:57, arvind gautam wrote:
 > How bout Sphinx?
 >
 > On Wed, Aug 27, 2008 at 11:20 AM, Eric Schulte
<schulte.eric@gmail.com>wrote:
 >
 > > I would also be interested in Ferret alternatives for IR in ruby, a
 > > simple search on rubyforge returned mainly a bunch of projects that
 > > look to be abandoned...
 > >
 > > - Rise (does not appear to be actively developed)
 > > - rubylucene (looks to be a dead project)
 > > - Ruby Simple Indexer (also looks dead)
 > > - Ruby Odeum (simple ruby-bindings for a fast inverted index)
 > >
 > > If anyone knows of any ruby IR projects which are mature, and are
 > > being actively developed I would love to hear about them.
 > >
 > > Thanks -- Eric
 > >
 > > On Wednesday, August 27, at 10:29, Paul Lynch wrote:
 > >  > I've been using Ferret in a project still under development, and
it
 > >  > works pretty well.  As far as I can tell, the project is dying,
if not
 > >  > already dead.  David Balmain is still the only listed developer,
and
 > >  > he seems to have moved on to other things.  However, since the
 > >  > software is still meeting my project's needs, I am not terribly
 > >  > bothered by that.  I suppose that eventually (in a few years?)
 > >  > something will change enough that Ferret will stop working, and
then
 > >  > we'll have to find something else.
 > >  >
 > >  > If you can find an alternative that has active development, I
would
 > >  > recommend you go with that.  (And if you find one, please post
about
 > >  > it.)  But, if you can't, Ferret will probably be good enough for
a
 > >  > while.
 > >  >
 > >  > On Tue, Aug 19, 2008 at 3:24 PM, Fernando Parisotto
 > >  > <fernando.parisotto@gmail.com> wrote:
 > >  > > Hi all,
 > >  > >
 > >  > > I'm new on the list, and glad to participate.
 > >  > > I would like to make some questions about the ferret
project...
 > >  > > - Is the http://ferret.davebalmain.com/ official page of the
project?
 > > (I'm
 > >  > > always getting 502 Bad Gateway)
 > >  > > - Where I can find the road map of the project?
 > >  > > - In the http://rubyforge.org/projects/ferret/ I see the last
realize
 > > was in
 > >  > > November 28, 2007, that is true?
 > >  > > - Is ferret discontinued?
 > >  > >
 > >  > > Please don't take this questions as offensive, I really like
to know
 > > about
 > >  > > how ferret is reliable for a long life product.
 > >  > > Here on my company we are planning to make a big product with
a
 > > indexing
 > >  > > engine, I would like to know if the ferret is "alive".
 > >  > > Thanks for the answers!
 > >  > >
 > >  > > --
 > >  > > Atenciosamente - Best regards,
 > >  > >
 > >  > > Fernando Luiz Parisotto
 > >  > >
 > >  > > _______________________________________________
 > >  > > Ferret-talk mailing list
 > >  > > Ferret-talk@rubyforge.org
 > >  > > http://rubyforge.org/mailman/listinfo/ferret-talk
 > >  > >
 > >  >
 > >  >
 > >  >
 > >  > --
 > >  > Paul Lynch
 > >  > Aquilent, Inc.
 > >  > National Library of Medicine (Contractor)
 > >  > _______________________________________________
 > >  > Ferret-talk mailing list
 > >  > Ferret-talk@rubyforge.org
 > >  > http://rubyforge.org/mailman/listinfo/ferret-talk
 > >
 > > --
 > > schulte
 > > _______________________________________________
 > > Ferret-talk mailing list
 > > Ferret-talk@rubyforge.org
 > > http://rubyforge.org/mailman/listinfo/ferret-talk
 > >
E0a4a37c590605f9578b8a862bca0da8?d=identicon&s=25 Eric Schulte (eschulte)
on 2008-08-27 20:37
(Received via mailing list)
On Wednesday, August 27, at 08:34, Marvin Humphrey wrote:
 > KinoSearch is sort of a sister project to Ferret.  The dev branch
 > implements many of the ideas that Dave Balmain and I designed
together
 > for the Lucy project.

What is the status of the Lucy project?  A ruby api into the venerable
library Lucene seems to be the obvious first step towards developing a
truly stable effective IR solution for Ruby.  The last update on the
Lucy webpage http://lucene.apache.org/lucy/ seems to be from 2006.

Also, I may be missing something obvious here, but I don't understand
why there is no ruby API directly to the Lucene Java library, why
would the only Lucene/Ruby API be to the C-port of lucene?

Much Thanks -- Eric
66aa0fea07ad578baac27a146ff74a24?d=identicon&s=25 Marvin Humphrey (Guest)
on 2008-08-27 21:28
(Received via mailing list)
On Aug 27, 2008, at 11:36 AM, Eric Schulte wrote:

> What is the status of the Lucy project?

The dev branch of KinoSearch is basically Lucy.  When Dave became
unavailable, I didn't really have anyone else to bounce ideas off of
for Lucy (since it was a from-scratch project without a community), so
I returned to the established KS community -- but took the code base
in the direction that Dave and I had worked out.

My current plan is to make an official KinoSearch release for Perl,
write some experimental bindings for other languages, achieve
stability, then make KinoSearch the "maint" branch and Lucy the "dev"
branch.

> Also, I may be missing something obvious here, but I don't understand
> why there is no ruby API directly to the Lucene Java library,

If you want to use Lucene, just go with Solr.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
28c237c0c414b644082bfcde4e42b309?d=identicon&s=25 John Leach (Guest)
on 2008-08-28 12:12
(Received via mailing list)
On Wed, 2008-08-27 at 08:34 -0700, Marvin Humphrey wrote:
>
> FWIW, I recently finished porting all module code in KinoSearch to C.
> If we write binding code and port the test suite, it will be usable
> from Ruby.
>
> KinoSearch is sort of a sister project to Ferret.  The dev branch
> implements many of the ideas that Dave Balmain and I designed together
> for the Lucy project.

Hi Marvin,

In my experience the Ruby community is crying out for a "drop-in"
replacement for Ferret.  Sphinx is great, but different.  Xapian looks
good but doesn't have the Ruby maturity of Ferret yet (especially
considering acts_as_ferret).  I keep coming across people using Ferret
successfully but have little niggles here and there.

Is KinoSearch something that could be a Ferret replacement?  Or the
foundations of a Ferret replacement?  What are the differences between
it and Ferret?

Out of interest, what are the differences between it and the planned
Lucy project (would be good to hear more about what your plans were for
Lucy. Maybe it'll inspire somebody else?)

Do you happen to know if Dave is likely to work on Ferret again someday?
I think we've seen some commits from him recentlyish but no word I've
seen. Hope all is well.

Thanks,

John.
--
http://johnleach.co.uk
8c297b29249f8f4b9f68b9aecdea16d3?d=identicon&s=25 lists.jc.michel@symetrie.com (Guest)
on 2008-08-28 12:27
(Received via mailing list)
Hi,

Le 28 août 08 à 12:11, John Leach a écrit :
> In my experience the Ruby community is crying out for a "drop-in"
> replacement for Ferret.  Sphinx is great, but different.  Xapian looks
> good but doesn't have the Ruby maturity of Ferret yet (especially
> considering acts_as_ferret).  I keep coming across people using Ferret
> successfully but have little niggles here and there.

The best would probably be to have some of us dig into ferret and
help to fix the remeaining bugs!

I'd like to experiment with beanstalkd
http://xph.us/software/beanstalkd/
which - I've been told - is a better alternative to Drb for
background indexing.

Still using ferret on many websites, and it's so simple to use, why
use something else ?
4d6a47158a7c8a032e5f6a4da8976d7d?d=identicon&s=25 Erik Hatcher (Guest)
on 2008-08-28 14:28
(Received via mailing list)
On Aug 27, 2008, at 11:20 AM, Eric Schulte wrote:
> If anyone knows of any ruby IR projects which are mature, and are
> being actively developed I would love to hear about them.

disclaimer: highly opinionated response follows....  :)

Solr is the way to go for Ruby projects*.  solr-ruby, if I do say so
myself, ain't half bad.  It's downright beautiful to interact with
Solr via Ruby: <http://wiki.apache.org/solr/solr-ruby>.  I have plenty
of wishes for where solr-ruby could still evolve, so it's not done
yet.   * pragmatically I realize that another moving piece, especially
a JVM, isn't a good fit for many current production deployment
environments.  See below for my answer to that...

Ferret is awesome, let me be clear about that!   I have always loved
it's power, even beyond Lucene Java in some cases.  But I've stuck
with Lucene through the tough times and it's always been good to me.
Solr's goodness on top of Lucene Java make it extremely compelling for
every environment, be it Ruby, Python, Java itself, what have you.
I've always been fonder of the JVM than native C stuff, and when
Ferret went that direction I stuck with Java.

acts_as_solr, however, hasn't yet reached its potential - and my
little hack that kick started it wasn't really beneficial to the
community, my apologies - since I basically "abandoned" it.  But it
ain't half bad either thanks to Thiago's hard work, and does make cake
work out of RDBMS <-> Solr, whereas it takes something this ugly to do
it in Java: <http://wiki.apache.org/solr/DataImportHandler> (oh Ruby
how I love you!).

Solr is incredibly powerful, beyond the features I think almost all of
the other open source search engines offer.  It's scalability evolves
almost daily, as does the pluggability capabilities of it.

And for those JRuby folks out there.... well, I guess there aren't
(m)any of those on the ferret list, but think about the
possibilities... SolrJRuby!  Wow.

  Erik
4d6a47158a7c8a032e5f6a4da8976d7d?d=identicon&s=25 Erik Hatcher (Guest)
on 2008-08-28 15:51
(Received via mailing list)
On Aug 27, 2008, at 3:28 PM, Marvin Humphrey wrote:
> On Aug 27, 2008, at 11:36 AM, Eric Schulte wrote:
>
>>
>> Also, I may be missing something obvious here, but I don't understand
>> why there is no ruby API directly to the Lucene Java library,

Mainly because Ruby has been too slow to have something pure.  Ferret
is about as close as it gets to Lucene Java compatibility, and really
only diverged from the file format because of wise practical reasons.


>> If you want to use Lucene, just go with Solr.

+1

Solr is great in Ruby environments to.  Really it is.  Sure, there's
this JVM beast, and deployment issues, and all that, but they
generally aren't that painful.  And the benefits are totally worth it.

  Erik
36feb4959db6ab8259a44962f0fa761f?d=identicon&s=25 Jens Krämer (jkraemer)
on 2008-08-28 16:06
(Received via mailing list)
Hi!

On 27.08.2008, at 20:20, Eric Schulte wrote:
> Thanks for all the info, I just found a very good related discussion
> from ruby-forum which I thought I'd share
>
> http://www.ruby-forum.com/topic/137629

well, in this discussion there's (besides some useful information)
some pretty biased statements from several people who obviously must
have had a frustrating time with Ferret, or just didn't get it working
right out of the box and decided it was cheaper to make their clients
switch search technology (and possibly losing features) than to fix
their deployment. I never had somebody from engine yard contact me
regarding their massive ferret deployment problems, not sure how hard
they really tried to get over them.

Imho it's not very likely that it's Ferret's fault that, while all
around the world people are running ferret based apps fine, *every*
client of engine yard experiences the same set of problems...

So here's my very own biased opinion just to complete the picture :)

I use Ferret in several productive projects with several customers,
and also choose it for new projects like the soon-to-be-released new
full text search for the german selfhtml.org portal or the search
feature at www.fahrrad-xxl.de, which tightly integrates aaf with rdig
(shameless plug: selfhtml.org search will be powered by Stellr [1] ;-).

I have absolutely no problem with Ferret not being very actively
maintained, because it works for me just like it is. Honestly, I
*never* had ferret segfault in any one one of my own production apps.
(But I admit I saw it segfault in other places, maybe I just don't do
the right things to make it crash...)

So why do I stick to Ferret while others declare it a 'dead' project?
Ferret's flexibility and feature set plus the level of Rails
integration it offers by means of aaf is very unlikely to be reached
by any other combination of search engine lib + Rails plugin in the
near future.
Having that said, I'm really interested how the KinoSearch/Lucy stuff
will go on...

Solr, while being an interesting project without doubt, won't ever
reach the level of Rails integration that's possible with
acts_as_ferret, simply because it's server doesn't run in the context
of the rails app with model classes and all that stuff. It's an
independent server indexing whatever you throw over the fence via http
+xml. That framework independence is a great plus under some
circumstances (and my Stellr project scratches exactly that itch in a
much more lightweight and undoubtedly less scalable manner), but
sometimes it's also a bad thing.

How to use a custom analyzer with solr? You have to code it in Java
(or you do your analysis before feeding the data into java land, which
I wouldn't consider good app design). But even if you do that then you
have
a) half a java project (I don't want that)
and b) no way to use your existing rails classes in that custom
analyzer (I *have* analyzers using rails models to retrieve synonyms
and narrower terms for thesaurus based query expansion)

Not to speak of Sphinx here, which offers even less integration with
your Rails application because it's tied directly to the database and
doesn't support stuff like real incremental indexing. It's easy to be
several times faster when you leave out most of the features...

Of course there are lots of use cases where Sphinx or Solr are
perfectly valid choices, because their feature set suits the
requirements and/or you're comfortable with running a servlet
container in your production env and spreading your application logic
across several languages.

Here's what I would do *if* I experienced severe problems with Ferret
in any of my projects:

Take aaf, replace Ferret with Lucene or even make it modular to decide
at run time which one to use, run the DRb server (or the whole app,
that depends) under JRuby and call it acts_as_lucene :-)
Et voila - great Rails integration plus Lucene's maturity. But as long
as Ferret's working fine for me that's really unlikely to happen...
Unless somebody wants to sponsor that project, of course ;)


Cheers,
Jens

[1] http://rubyforge.org/projects/stellr

--
Jens Krämer
Finkenlust 14, 06449 Aschersleben, Germany
VAT Id DE251962952
http://www.jkraemer.net/ - Blog
http://www.omdb.org/     - The new free film database
4d6a47158a7c8a032e5f6a4da8976d7d?d=identicon&s=25 Erik Hatcher (Guest)
on 2008-08-28 17:22
(Received via mailing list)
On Aug 28, 2008, at 9:52 AM, Jens Kraemer wrote:
> So here's my very own biased opinion just to complete the picture :)

Hey, software should be opinionated!   That's totally fair :)

> (shameless plug: selfhtml.org search will be powered by Stellr
> [1] ;-).

Stellr - great name.  Interesting... that's pretty sweet.

> Solr, while being an interesting project without doubt, won't ever
> reach the level of Rails integration that's possible with
> acts_as_ferret, simply because it's server doesn't run in the
> context of the rails app with model classes and all that stuff.

What advantage does Ferret have in terms of ActiveRecord integration
that Solr wouldn't have?

If you're talking about custom analyzers being in Ruby, more on that
below.

> It's an independent server indexing whatever you throw over the
> fence via http+xml.

Solr can index CSV as well now a relational database directly (with
the new DataImportHandler).

It also responds with Ruby hash structure (just add &wt=ruby to the
URLs, or use solr-ruby which does that automatically and hides all
server communication from you anyway).

> How to use a custom analyzer with solr? You have to code it in Java
> (or you do your analysis before feeding the data into java land,
> which I wouldn't consider good app design).

Most users would not need to write a custom analyzer.  Many of the
built-in ones are quite configurable.  Yes, Solr does require schema
configuration via an XML file, but there have been acts_as_solr
variants (good and bad thing about this git craze) that generate that
for you automatically from an AR model.

> But even if you do that then you have
> a) half a java project (I don't want that)

That's totally fair, and really the primary compelling reason for a
Ferret over Solr for pure Ruby/Rails projects.  I dig that.

But isn't Ferret is like 60k lines of C code too?!

> and b) no way to use your existing rails classes in that custom
> analyzer (I *have* analyzers using rails models to retrieve synonyms
> and narrower terms for thesaurus based query expansion)

You could leverage client-side query expansion with Solr... just take
the users query, massage it, and send whatever query you like to
Solr.   Solr also has synonym and stop word capability too.

However, there is also no reason (and I have this on my copious-free-
time-TOOD-list) that JRuby couldn't be used behind the scenes of a
Solr analyzer/tokenizer/filter or even request handler... and do all
the cool Ruby stuff you like right there.  Heck, you could even send
the Ruby code over to Solr to execute there if you like ;)

> Here's what I would do *if* I experienced severe problems with
> Ferret in any of my projects:
>
> Take aaf, replace Ferret with Lucene or even make it modular to
> decide at run time which one to use, run the DRb server (or the
> whole app, that depends) under JRuby and call it acts_as_lucene :-)
> Et voila - great Rails integration plus Lucene's maturity. But as
> long as Ferret's working fine for me that's really unlikely to
> happen... Unless somebody wants to sponsor that project, of course ;)

Just using Solr and fixing up acts_as_solr to meet your needs (if it
doesn't) would be even easier than all that :)  Solr really is a
better starting point than Lucene directly, for caching, scalability,
replication, faceting, etc.

I'd be curious to see scalability comparisons between Ferret and Solr
- or perhaps more properly between Stellr and Solr - as it boils down
to number of documents, queries per second, and faceting and
highlighting speed.  I'm betting on Solr myself (by being so into it
and basing my professional life on it).

  Erik
96519a64eb06e187e71ab794d61f3e7c?d=identicon&s=25 Sheldon Maloff (sheldonmaloff)
on 2008-08-28 17:34
(Received via mailing list)
That is one awesome rebuttal, Jens. I read that forum topic below, and
while I have a great respect for Ezra (from his fine book Deploying
Rails Applications), I must say I disagree with him with respect to
Ferret/AAF combination.

We run Ferret/AAF as a DRb server in production and on our staging
servers and I've never seen a Ferret segfault. That said, we're not
high search load like Google, but even when hit with heavy load
testing, I haven't experienced a Ferret segfault, nor corrupt indexes.

Now, corrupt indexes in development is another issue. In development,
you are not running a DRb server. Each mongrel is hitting the index
directly. You typically have only one mongrel running in development.
But if you open an interactive script/console session, and play with
your models side-by-side a running mongrel, you WILL corrupt your
Ferret index. That's because both the mongrel and the script/console
will be writers to the same index, something that Ferret doesn't
support. Heck, running a rake db:migrate along side a running mongrel
will cause index corruption, for the same reason: multiple writers.

I'm wondering if that's why so many people experience Ferret indexing
problems in development? It's not immediately obvious that you're in a
multiple writer scenario some times.

For now, I'm sticking with the Ferret/AAF combination until one or the
other falls over completely.

Sheldon Maloff
Developer
http://ideas.veer.com
66aa0fea07ad578baac27a146ff74a24?d=identicon&s=25 Marvin Humphrey (Guest)
on 2008-08-28 18:25
(Received via mailing list)
On Aug 28, 2008, at 3:11 AM, John Leach wrote:

> Is KinoSearch something that could be a Ferret replacement?

Yes.  The projects are roughly comparable.

I'd be happier if Ferret's ultimate successor was named "Lucy",
though, because then more credit would flow to Dave.

> What are the differences between it and Ferret?

 From a high level, they're pretty similar.  Analyzer, QueryParser,
IndexReader, and all that.

There are superficial differences in the implementations of individual
classes.  For instance, Ferret provides several different Tokenizer
classes; KinoSearch provides one, based on a regex pattern matching
one token.

     # KinoSearch version of WhiteSpaceTokenizer
     tokenizer = Tokenizer.new(:pattern => "\\S+")

At a low level, things start to diverge.  For instance, all metadata
in the KinoSearch index file format is encoded as JSON, so it's human-
readable for easy spelunking and debugging.  Also, it's easier to
override methods in KinoSearch, so you can do things like implement
SearchServer/SearchClient or MockScorer or KSx::Highlight::Summarizer
in pure Perl; I believe the mechanism will work similarly with Ruby
bindings.

> what are the differences between it and the planned Lucy project

Personally, I think of them as the same project.  KinoSearch is at
version 0.x and will soon become version 1.0.  Lucy will be version 2
-- KinoSearch's successor.

Lucy has never had a high-level API -- the work Dave and I did was all
on the low-level core.  That core has now been fully implemented in
the KinoSearch dev branch.

What happens between version 1 and 2 depends on how the rollout of
version 1 goes.

> Do you happen to know if Dave is likely to work on Ferret again
> someday?

I know he would like to.  However, I hope to persuade him to return to
his work on Lucy.  :)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
C9dd93aa135988cabf9183d3210665ca?d=identicon&s=25 Jens Krämer (Guest)
on 2008-08-28 19:31
(Received via mailing list)
Hi!

On 28.08.2008, at 18:24, Marvin Humphrey wrote:
[..]
> There are superficial differences in the implementations of
> individual classes.  For instance, Ferret provides several different
> Tokenizer classes; KinoSearch provides one, based on a regex pattern
> matching one token.
>
>    # KinoSearch version of WhiteSpaceTokenizer
>    tokenizer = Tokenizer.new(:pattern => "\\S+")

That's pretty simple ;) With Ferret I can use custom tokenizers to
inject additional terms at the same offset (i.e., synonyms), is there
another way to achieve that with KinoSearch?

[..]
>> Do you happen to know if Dave is likely to work on Ferret again
>> someday?
>
> I know he would like to.  However, I hope to persuade him to return
> to his work on Lucy.  :)

whatever, as long as it's as powerful and easy to use as Ferret and
has ruby bindings I'm all for it :)

Cheers,
Jens

--
Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49351467660 | Telefax +493514676666
kraemer@webit.de | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold
36feb4959db6ab8259a44962f0fa761f?d=identicon&s=25 Jens Krämer (jkraemer)
on 2008-08-28 19:31
(Received via mailing list)
On 28.08.2008, at 17:17, Erik Hatcher wrote:

> If you're talking about custom analyzers being in Ruby, more on that
> below.

It's not only custom analyzers, but the fact that acts_as_ferret's DRb
runs with the full Rails application loaded, so i.e. to bulk index a
number of records aaf just hands the server the ids and class name of
the records to index, and the server does the rest. It's debatable if
one approach is better than the other, in terms of index server load
it might even be better to do as much as possible on the client side,
but still it's a much tighter coupling than you get with the
application agnostic interfaces of solr or stellr.

I must admit that I have a hard time to come up with another example
besides my synonym/thesaurus analysis stuff where this might useful,
but I think there are more use cases where such a tight integration
might come in handy.

>> It's an independent server indexing whatever you throw over the
>> fence via http+xml.
>
> Solr can index CSV as well now a relational database directly (with
> the new DataImportHandler).
>
> It also responds with Ruby hash structure (just add &wt=ruby to the
> URLs, or use solr-ruby which does that automatically and hides all
> server communication from you anyway).

Yeah, I know, but anyway there is a strict line between your
application and Solr, which doesn't know a thing about the application
using it.

>> How to use a custom analyzer with solr? You have to code it in Java
>> (or you do your analysis before feeding the data into java land,
>> which I wouldn't consider good app design).
>
> Most users would not need to write a custom analyzer.  Many of the
> built-in ones are quite configurable.  Yes, Solr does require schema
> configuration via an XML file, but there have been acts_as_solr
> variants (good and bad thing about this git craze) that generate
> that for you automatically from an AR model.

Glad you mentioned this ;) I don't want to configure an analyzer via
xml when I can throw my own together with 4 or 5 lines of easy to read
ruby code. Same for index structure. Philosophical mismatch between
the Java and Ruby worlds I think :)

>> But even if you do that then you have
>> a) half a java project (I don't want that)
>
> That's totally fair, and really the primary compelling reason for a
> Ferret over Solr for pure Ruby/Rails projects.  I dig that.
>
> But isn't Ferret is like 60k lines of C code too?!

true, but I don't have to compile that every time I deploy my app...

>> and b) no way to use your existing rails classes in that custom
>> analyzer (I *have* analyzers using rails models to retrieve
>> synonyms and narrower terms for thesaurus based query expansion)
>
> You could leverage client-side query expansion with Solr... just
> take the users query, massage it, and send whatever query you like
> to Solr. Solr also has synonym and stop word capability too.

yeah, I could do that. But that's moving analysis stuff into my
application, which is quite contrary to the purpose of analyzers -
encapsulate this logic and make it pluggable into the search engine
library. So less style points for this solution...

> However, there is also no reason (and I have this on my copious-free-
> time-TOOD-list) that JRuby couldn't be used behind the scenes of a
> Solr analyzer/tokenizer/filter or even request handler... and do all
> the cool Ruby stuff you like right there.  Heck, you could even send
> the Ruby code over to Solr to execute there if you like ;)

that sounds sexy ;)

> Just using Solr and fixing up acts_as_solr to meet your needs (if it
> doesn't) would be even easier than all that :)  Solr really is a
> better starting point than Lucene directly, for caching,
> scalability, replication, faceting, etc.

Depends on whether you need these features or not. From my experience,
lots of projects don't need these things anyway, because they're
running on a single host and nearly every other part of the
application is slower than search... Maybe it's because I'm quite
involved with the topic and am familiar with lucene's API, but to me
Solr looks like an additional layer of abstraction and complexity
which I only want to have when it really gives me a feature I need.
Plus the last time I checked Lucene didn't need xml configuration
files ;)

In development environments and especially when it comes to automated
tests / CI it's also quite comfortable not having to run a separate
server but using the short cut directly to the index, which isn't
possible with Solr.

> I'd be curious to see scalability comparisons between Ferret and
> Solr - or perhaps more properly between Stellr and Solr - as it
> boils down to number of documents, queries per second, and faceting
> and highlighting speed.  I'm betting on Solr myself (by being so
> into it and basing my professional life on it).

This would be interesting, but I wouldn't be that disappointed with
Stellr ending up second given the little amount of time I've spent
building it so far. Just out of curiosity, do you have some kind of
performance testing suite for Solr which I could throw at Stellr?


Cheers,
Jens

--
Jens Krämer
Finkenlust 14, 06449 Aschersleben, Germany
VAT Id DE251962952
http://www.jkraemer.net/ - Blog
http://www.omdb.org/     - The new free film database
4d6a47158a7c8a032e5f6a4da8976d7d?d=identicon&s=25 Erik Hatcher (Guest)
on 2008-08-28 20:08
(Received via mailing list)
On Aug 28, 2008, at 1:02 PM, Jens Kraemer wrote:
>> What advantage does Ferret have in terms of ActiveRecord
>> integration that Solr wouldn't have?
>>
>> If you're talking about custom analyzers being in Ruby, more on
>> that below.
>
> It's not only custom analyzers, but the fact that acts_as_ferret's
> DRb runs with the full Rails application loaded, so i.e. to bulk
> index a number of records aaf just hands the server the ids and
> class name of the records to index, and the server does the rest.

Gotcha.  Meaning the search server is pulling from the DB directly.
That's what the DataImportHandler in Solr does as well.  It'd be a
simple single HTTP request to Solr (once the DB stuff is configured,
of course) to have it do full or incremental DB indexing.

>
> Glad you mentioned this ;) I don't want to configure an analyzer via
> xml when I can throw my own together with 4 or 5 lines of easy to
> read ruby code. Same for index structure. Philosophical mismatch
> between the Java and Ruby worlds I think :)

Don't get me wrong... I'm a Ruby fanatic myself!   XML makes me ill,
generally speaking (it has its uses, but for configuration it is just
plain wrong).

For using the built-in tokenizer/filters, a smarter acts_as_solr could
generate the right config based on a model specifying parameters for
analysis.

>>> But even if you do that then you have
>>> a) half a java project (I don't want that)
>>
>> That's totally fair, and really the primary compelling reason for a
>> Ferret over Solr for pure Ruby/Rails projects.  I dig that.
>>
>> But isn't Ferret is like 60k lines of C code too?!
>
> true, but I don't have to compile that every time I deploy my app...

My point was that Ferret isn't just Ruby, just a counter point to your
"half a java project".  No one has to recompile Solr either.

> encapsulate this logic and make it pluggable into the search engine
> library. So less style points for this solution...

I was just saying :)   It's debatable exactly where in the client-
server spectrum synonym expansion belongs... and it really depends on
the needs of the project.  Nothing wrong with a client doing some user
input massaging before a query hits the search server.

>> However, there is also no reason (and I have this on my copious-
>> free-time-TOOD-list) that JRuby couldn't be used behind the scenes
>> of a Solr analyzer/tokenizer/filter or even request handler... and
>> do all the cool Ruby stuff you like right there.  Heck, you could
>> even send the Ruby code over to Solr to execute there if you like ;)
>
> that sounds sexy ;)

Should be fairly trivial to wire JRuby in.  The DataImportHandler
already has scripting language support for data transformation:
<http://wiki.apache.org/solr/DataImportHandler#head...
 > (shield your eyes from the XML wrapping it!), so I believe JRuby
should already work in that context.  This is sort of like the Mapper
stuff I built into solr-ruby, transforming data from domain to search
engine "documents".

>>
> Solr looks like an additional layer of abstraction and complexity
> which I only want to have when it really gives me a feature I need.
> Plus the last time I checked Lucene didn't need xml configuration
> files ;)

I hear ya about the XML config files.  And always to be fair to Solr
here, you really only need to set things up from a basic example
configuration that covers most scenarios already - so it really isn't
necessary to even touch XML config except for tweaking little things.

But Solr's advantages over just Lucene are built out of experiences
that most Lucene projects eventually build anyway.  Caching - really
important for faceting, which is a need that every project I touch
these days needs.  Replication - really really important for
scalability of massive querying load.   It's really not such a big
chunk over Lucene to bite off... and in almost all respects it is even
simpler to use Solr than Lucene anyway.

> In development environments and especially when it comes to
> automated tests / CI it's also quite comfortable not having to run a
> separate server but using the short cut directly to the index, which
> isn't possible with Solr.

Not true.  Solr can work embedded.  There is a base SolrServer
abstraction, with an implementation that runs embedded (inside the
same JVM) versus over HTTP.  Exactly the same interface for both
operations, using a very simple API (SolrJ, much like Lucene's basic
API actually).

>> I'd be curious to see scalability comparisons between Ferret and
>> Solr - or perhaps more properly between Stellr and Solr - as it
>> boils down to number of documents, queries per second, and faceting
>> and highlighting speed.  I'm betting on Solr myself (by being so
>> into it and basing my professional life on it).
>
> This would be interesting, but I wouldn't be that disappointed with
> Stellr ending up second given the little amount of time I've spent
> building it so far. Just out of curiosity, do you have some kind of
> performance testing suite for Solr which I could throw at Stellr?

No, I don't have those kinds of tests myself.   While I can speak to
Solr's performance based on what I hear from our clients and the
reports in the mailing lists, I don't consider myself a performance
savvy person myself.

I'm curious - what are the numbers of documents being put into Ferret
indexes out there?   millions?   hundreds of millions?  billions?  And
are folks doing faceting?  Does Ferret have faceting support?

  Erik
66aa0fea07ad578baac27a146ff74a24?d=identicon&s=25 Marvin Humphrey (Guest)
on 2008-08-28 20:11
(Received via mailing list)
On Aug 28, 2008, at 10:10 AM, Jens Krämer wrote:

> With Ferret I can use custom tokenizers to inject additional terms
> at the same offset (i.e., synonyms), is there another way to achieve
> that with KinoSearch?

Synonym support isn't part of the public API right now, but since the
basic principle is the same in KinoSearch as it is in Ferret and
Lucene, it shouldn't be hard to add.

I don't think we'd do this by extending Tokenizer; I think we'd want
SynonymFilter/SynonymMap classes akin to the ones provided by Solr.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
36feb4959db6ab8259a44962f0fa761f?d=identicon&s=25 Jens Krämer (jkraemer)
on 2008-08-28 21:06
(Received via mailing list)
On 28.08.2008, at 20:03, Erik Hatcher wrote:

>> index a number of records aaf just hands the server the ids and
>> class name of the records to index, and the server does the rest.
>
> Gotcha.  Meaning the search server is pulling from the DB directly.
> That's what the DataImportHandler in Solr does as well.  It'd be a
> simple single HTTP request to Solr (once the DB stuff is configured,
> of course) to have it do full or incremental DB indexing.

With the slight difference that custom model logic defined in the
rails model class is still involved to preprocess data, index values
calculated at indexing time or even have certain records refuse being
indexed based on their current state. Having per document boosts
depending on some value from the database (i.e. record popularity) is
also a classic... Aaf never just pulls data from the db, it always
uses rails model objects. Doesn't make indexing faster of course...

[..]
> XML makes me ill, generally speaking (it has its uses, but for
> configuration it is just plain wrong).

FULL ACK :)

>>> But isn't Ferret is like 60k lines of C code too?!
>>
>> true, but I don't have to compile that every time I deploy my app...
>
> My point was that Ferret isn't just Ruby, just a counter point to
> your "half a java project".  No one has to recompile Solr either.

but the custom analyzer implemented in Java... By saying 'half a java
project' I didn't mean solr, but the parts of my application logic
that have to be implemented in Java in order to be plugged into solr.
But the JRuby route looks promising here of course.

>> encapsulate this logic and make it pluggable into the search engine
>> library. So less style points for this solution...
>
> I was just saying :)   It's debatable exactly where in the client-
> server spectrum synonym expansion belongs... and it really depends
> on the needs of the project.  Nothing wrong with a client doing some
> user input massaging before a query hits the search server.

[..]

>>>
>> API, but to me Solr looks like an additional layer of abstraction
>> and complexity which I only want to have when it really gives me a
>> feature I need. Plus the last time I checked Lucene didn't need xml
>> configuration files ;)
>
> I hear ya about the XML config files.  And always to be fair to Solr
> here, you really only need to set things up from a basic example
> configuration that covers most scenarios already - so it really
> isn't necessary to even touch XML config except for tweaking little
> things.

But I still have to read it in order to see if it fits my needs. Okay,
I'll stop whining about that xml now ;)

[..]
>> In development environments and especially when it comes to
>> automated tests / CI it's also quite comfortable not having to run
>> a separate server but using the short cut directly to the index,
>> which isn't possible with Solr.
>
> Not true.  Solr can work embedded.  There is a base SolrServer
> abstraction, with an implementation that runs embedded (inside the
> same JVM) versus over HTTP.  Exactly the same interface for both
> operations, using a very simple API (SolrJ, much like Lucene's basic
> API actually).

cool, but that won't work for Rails projects running on MRI and
accessing solr via solr-ruby.

>
> No, I don't have those kinds of tests myself.   While I can speak to
> Solr's performance based on what I hear from our clients and the
> reports in the mailing lists, I don't consider myself a performance
> savvy person myself.
>
> I'm curious - what are the numbers of documents being put into
> Ferret indexes out there?   millions?   hundreds of millions?
> billions?  And are folks doing faceting?  Does Ferret have faceting
> support?

not sure about the billions, but afair an earlier message in this
thread stated an index size of 90 million documents with aaf.
Altlaw.org has reported an index size of > 4GB with around 700k
documents last fall. The selfhtml.org index has approximately 1
million forum entries indexed, index size around 2GB. Stellr doesn't
ever use more than around 50MB of RAM during indexing and searching
this index. I know RAM is cheap and all, but RAM size still has a
quite large influence on the price of the server you rent for your
app, at least here in germany.

Without doubt Solr has much more references in the area of such large
installations than ferret/aaf. I for myself never saw aaf as a drop-in
solution for indexes of this size, but more as an easy to use out of
the box solution for the average rails app with maybe several
thousands or tens of thousands records, but I'm happy to see it still
works in larger scale setups.

Heck, it all began with a simple full text search for my blog ;)

Regarding the faceting - it's not built into ferret, and aaf doesn't
support it either since I didn't need it yet, and nobody else
requested this feature so far. All in all I think the average usage
scenarios of solr and aaf are quite different atm...

I'll try to find the time to benchmark the selfhtml.org data set with
solr and stellr. I'll report my findings here.

Cheers,
Jens

--
Jens Krämer
Finkenlust 14, 06449 Aschersleben, Germany
VAT Id DE251962952
http://www.jkraemer.net/ - Blog
http://www.omdb.org/     - The new free film database
4d6a47158a7c8a032e5f6a4da8976d7d?d=identicon&s=25 Erik Hatcher (Guest)
on 2008-08-28 22:15
(Received via mailing list)
On Aug 28, 2008, at 3:02 PM, Jens Kraemer wrote:
> popularity) is also a classic... Aaf never just pulls data from the
> db, it always uses rails model objects. Doesn't make indexing faster
> of course...

All great points.  ActiveRecord is much more pleasant than any other
database access that I've ever worked with.  I don't generally work
with databases personally, though.  The bulk of my full-text searching
experiences don't involve databases at all.

I suppose the Java counterpart would be Hibernate Search - surely
involving a lot more hideous XML and @annotations - ewww.

>> basic API actually).
>
> cool, but that won't work for Rails projects running on MRI and
> accessing solr via solr-ruby.

Fair point.

Again, the answer comes back to JRuby ;)  Forget MRI.   Good point
about solr-ruby - it is specifically designed for Solr over HTTP.  It
wouldn't take much to refactor it to work with embedded Solr via JRuby
though.  But if JRuby is a given, it'd be just as easy to work with
SolrJ's API directly.

Though for testing purposes, solr-ruby is easily mocked.  solr-ruby
touts great (98% or something like that) code coverage with unit
tests, many of those tests are against solr-ruby's API with Solr
itself mocked.  And there are tests that fire up Solr in the
background and test that way too for full functional tests.   So for
unit testing purposes, having Solr running isn't needed, but it
launches plenty fast enough for testing end-to-end if desired.

> ever use more than around 50MB of RAM during indexing and searching
> this index. I know RAM is cheap and all, but RAM size still has a
> quite large influence on the price of the server you rent for your
> app, at least here in germany.

90 million is impressive for sure.

RAM - well, when Ferret/Stellr does faceting we'll revisit that
discussion :)   Solr loves RAM!  It still can run in modest
environments, but the more RAM you can give it to use for caches
(depending on your needs) the better it is.

> Without doubt Solr has much more references in the area of such
> large installations than ferret/aaf. I for myself never saw aaf as a
> drop-in solution for indexes of this size, but more as an easy to
> use out of the box solution for the average rails app with maybe
> several  thousands or tens of thousands records, but I'm happy to
> see it still works in larger scale setups.

Indeed!  ferret: +1 - no question!

> Heck, it all began with a simple full text search for my blog ;)

Same for me (though I abandoned it when I realized that regular
blogging and server maintenance weren't for me).

> Regarding the faceting - it's not built into ferret, and aaf doesn't
> support it either since I didn't need it yet, and nobody else
> requested this feature so far. All in all I think the average usage
> scenarios of solr and aaf are quite different atm...

I'm really surprised by that.  Faceting is the major feature that
attracts folks to Solr.  It's critical for all of our customers.

But yeah, no question that Lucene/Solr and Ferret/Stellr can happily
coexist and aren't necessarily competition for every project.  But
there definitely are those areas of overlap where a project could go
with either solution.  And I would definitely not try to shoehorn Solr
into a project where it didn't fit and Ferret worked fine.  I'm
pragmatic like that.

> I'll try to find the time to benchmark the selfhtml.org data set
> with solr and stellr. I'll report my findings here.

Awesome.  If you have the data in some easily digestible format, I'd
be happy to toss it into Solr and report back numbers from my
development machine.  Drop me a line offline if you'd like.

  Erik
This topic is locked and can not be replied to.