Ferret vs. mysql fulltext


#1

hi,
with current state of ferret, can anyone compare speeds of mysql
fulltext
search vs. ferret indexing search. and do I have to query db after
taking
results from ferret?
thanks in advance


#2

Hi,

Ferret is not only faster (as I have benchmarked a few times) as data
gets larger but its also more accurate because of its query analyser
(you can use google tike search query’s). There are two options, you can
store everything in ferret (and not need a database anymore) or store
only the index (fields you need to index) and retrieve the other value’s
from mysql.

At this moment I am trying to write a better plugin for ferret so you
can specify what needs to be index, use the find (instead of an special
method) with additional options. And automaticly query database for
additional fields.


#3

On 12/13/05, Abdur-Rahman A. removed_email_address@domain.invalid wrote:

Hi,

Ferret is not only faster (as I have benchmarked a few times) as data
gets larger but its also more accurate because of its query analyser
(you can use google tike search query’s).

This is great to know. I’m surprised. Ferret is going to by much much
faster soon. I’m rewriting it all in C.

At this moment I am trying to write a better plugin for ferret so you
can specify what needs to be index, use the find (instead of an special
method) with additional options. And automaticly query database for
additional fields.

Please keep us updated as to how this is going. I’d like to add more
stuff like this to the Ferret Wiki. You might like to look at this
page if you haven’t already;

http://ferret.davebalmain.com/trac/wiki/FerretOnRails

Far from a perfect solution so please feel free to add to it. :slight_smile:

Cheers,
Dave


#4

thanks all for the great work.


#5

Hi David,

I thinks you should be carefull replacing ‘ferret’ as a database till
its really mature. (Indexes can be recreated anytime with the original
data). Mysql has proven itself as a mature database sollution and has
many tools for maintaining and managing. Ferret in my opinion can’t
replace that (I don’t even think lucene can). It lacks certain
management tools that are needed for a database, however current
databases lack advanced query parsers (and thats good because it only
makes the database complexer). I know about linking lucene to existing
databases with very good result, this should be possible with ferret or
not?


#6

I think storing data only in ferret is a bad idea as tables have
relations
with other tables etc.


#7

On 12/13/05, Abdur-Rahman A. removed_email_address@domain.invalid wrote:

databases with very good result, this should be possible with ferret or not?
Sure. I wouldn’t replace a database with Ferret in most instances and
probably not in a Rails app since rails makes it so easy to use a
database. I was just trying to say it was possible to use Ferret or
Lucene as a data store. :slight_smile:


#8

David,

Are you trying to make a lucene compatible project? or a similar
project? Because I think with the possibilities of ruby, in time it
would be possible to go beyond what possible in java…
Really great project, I hope to be able to contribute, my C skill are a
little old (10 years orso) maybe I can help you out on the ruby end for
improvements…


#9

Agreed. I meant it’s probably not worth storing the data in Ferret.
Just use it for the indexing and keep your data in the database.

((On a side note, it is possible for some applications to do away with
the database and use Ferret as the only data store. I think that’s how
Erik H.'s blog software Blogscene works.))


#10

On Dec 13, 2005, at 9:30 AM, Abdur-Rahman A. wrote:

Are you trying to make a lucene compatible project? or a similar
project? Because I think with the possibilities of ruby, in time it
would be possible to go beyond what possible in java…

Could you elaborate in what ways you feel Ferret could go beyond what
is possible with Java Lucene? How does Java hold Lucene back?

Genuinely curious,
Erik


#11

On Dec 13, 2005, at 8:04 AM, David B. wrote:

((On a side note, it is possible for some applications to do away with
the database and use Ferret as the only data store. I think that’s how
Erik H.'s blog software Blogscene works.))

If only I had that e-mail-to-blog gateway, I’d be blogging all the time!

Yes, http://www.blogscene.org/erik is powered entirely by a Lucene
index, a servlet, and some Velocity templates. The original blog
entries reside in blosxom-style text files, but at runtime only
Lucene is used.

It really depends on the scenario, but in general I don’t recommend
using Lucene (or Ferret) as the definitive data source. The primary
reason is that an index is optimized for how it is going to be
searched, and you may later want to change how text is tokenized and
thus what terms are indexed. Having the original data around to be
able to re-index with different settings is a good thing. It’s also
possible to store the original data in Lucene and pull it out for
reindexing purposes - but that is trickier.

Erik

#12

Erik,

I am sorry, I just exited about ruby in general. But I thing with
language like ruby and a project like lucene, it?s my personal opinion
that LOC makes a difference. Things like mixins and the way ruby you
program in ruby makes things just a bit easier. I took me 4/5 days to
understand and work with lucene (great book b.t.w.) and it only took me
a 10 days to learn most of edge rails and many other plugins by reading
code (yes not docs, code LOL)…

Lucene is a great product, and will continue on java (you can’t kill
java, its really usable for many things). But ruby just makes it easy to
program, and with the integration with c. Well things are optimized. I
have only been rubying for a day or 20. But it amazes my howmuch a
language can make a difference…

So I have to revise my statement a bit, but I think, in time, melting
Ferret and ActiveRecord together could make it a better product then
lucene : ) But that future talk…

Well, I am amazed to see you here : ) what is your opinion?

Abdur-Rahman


#13

I just got done reviewing some of the info in the ferret wiki. It looks
like
some great work - thanks!

I’m building an app that is going have some search capability and I was
planning
on using mysql with fulltext searches, but looking at ferret has got me
wondering if there might not be a better way.

Specifically, I was wondering about the idea of using an in memory index
for
increasing the speed of searches.

The data i’m storing will be most utilized when it is relatively new.
After it’s
a few days old, people won’t need it as much. So putting all this data
in the
same database may not make sense (if it’s relatively easy to split it
into
‘fresh’ and ‘stale’ databases).

Would it make sense to consider using an in-memory cache of documents
for the
newest data while having a disk-based index for when people want to
search for
older documents? Or would the performance gains not be worth the effort?

-kevin


#14

Hi Onur,

I can’t offer any input on speed comparisons between Ferret and MySQL
fulltext search. I will say this though. If the results that MySQL
fulltext search returns are good enough then use it. But if you care
about the relevancy of your results and you want to be able to run
advanced queries like boolean queries or phrase queries, you’ll want
to go with Ferret, and it should be fast enough.

As for having to query the database, that will depend how you want to
use Ferret. You can store the data in the Ferret index if you like, in
which case you won’t have to query the database. I think it’s better
just to keep the data in one spot though.

HTH,
Dave


#15

I just wanted to add that I think the ideal solution would be for me to
be able
to define a single index that did both – that is, that would cache
documents
in memory while keeping full index in disk.

It would be great as well if I could specify how I wanted the cache to
work –
say, by giving it a regular expression or some query to tell it what
should be
cached in memory. Maybe I could also specify a limit on the total memory
it
should use for cache.

I might, for example, want to have it cache documents based on a certain
user or
customer id rather than cache them by date. Maybe whenever a new user
logs in I
modify the cache settings to include their documents in the cache – and
whenever someone logs out I flush theirs.

The value of this is that it hides the complexity from developers/users
and
makes it easy to use.

Sorry for the ‘stream of consciousness’ design reqs – I’m just dumping
the idea
now since I was thinking about it…


#16

Erik H. wrote:

It’s not quite comparable the difference between a full-text search
engine and a web framework.

Lucene is optimized heavily - it’s code is more C-like than Java-
like. Making Lucene more OO or taking advantage of all the fancy
Ruby ways of method trickery is likely to slow things down. The
entire idea of a full-text search engine is to be fast! (oh, and to
be easy on resources as well)

The java version is really heavy a.t.m. (just to mention it ;)), but
your quite right, search querie’s can’t be cached very easily. So
writing optimized code is very important.

Well, in all fairness to Lucene, it is orthogonal to the database
concern entirely. Of course Ferret + ActiveRecord > just Lucene, but
to make the comparison more fair, how about Lucene + Hibernate?
There are hooks for Hibernate to index with Lucene, even using Java
annotations to mark the fields to be indexed, and how they are to be
indexed. I see ActiveRecord + Ferret to be a great path to go, and
the acts_as_ferret initial implementation is on the right track. I
hope to delve into this area more myself in the future (though my
work does not currently involve relational databases, but will soon).

I am busy at the moment to create a plugin for rails, but ill be easy to
use to extend ActiveRecord. I am trying combine the database and Ferret
with a news methods that builds upon find (search), just ferret if a
query is present and fetch the rows using find.

to avoid porting every time Java Lucene changes (which is where the
guru creator Doug Cutting spends his effort), it would be a simple
recompilation (and perhaps some API glue).

Thats a very good idea, but compiling java sound weird :). David have
you considered this? I wonder how will it would integrate…


#17

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Dec 13, 2005, at 6:15 AM, David B. wrote:

existing
databases with very good result, this should be possible with
ferret or not?

Sure. I wouldn’t replace a database with Ferret in most instances and
probably not in a Rails app since rails makes it so easy to use a
database. I was just trying to say it was possible to use Ferret or
Lucene as a data store. :slight_smile:

I treat the data I store in the Ferret index as a denormalized table
tuned for the queries it answers.

jeremy
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (Darwin)

iD8DBQFDnwuCAQHALep9HFYRAvqDAJ9q3QwWgxpjke4XMrxW4tZh4vbsgACfb48b
odJNj9m2MkZgyg180o/s9z8=
=O3sr
-----END PGP SIGNATURE-----


#18

On 12/14/05, Abdur-Rahman A. removed_email_address@domain.invalid wrote:

the acts_as_ferret initial implementation is on the right track. I
the rubylucene (formerly rucene) project at RubyForge once upon a

Thats a very good idea, but compiling java sound weird :). David have
you considered this? I wonder how will it would integrate…

Yes, Erik and I have discussed it already. It might be a better way to
do it but I can’t find the motivation. It’s a lot more interesting and
motivating for me trying to create something that runs faster than
Lucene. Besides being slightly faster, C is also lighter on resources
and makes for a much smaller download. I was and still am interested
in desktop search so these are all important to me. Speaking of Doug
Cutting, he has some words to say on this too;

http://nutch.sourceforge.net/blog/2005/02/open-source-desktop-search.html

So those are my reasons with taking the route I am, and since I’m
currently doing the work, I get to choose. :wink: If anyone wants to get
stuck into porting the PyLucene stuff I’m more than willing to lend
and hand. It’s definitely worth doing but it’s not really my cup of
tea.


#19

Hi Kevin,

I can’t quite tell from your description. Do you actually want to
store and retrieve the documents from a Ferret index? Or do you just
want to run the search on the index and then retrieve the results from
the database? Also, how large a document set are you expecting? If you
still have to retrieve the documents from the database I think Ferret
should be fine as is without the caching. If you are running into
performance problems after it’s implemented I could certainly help you
set up some caching.

Cheers,
Dave


#20

On 12/14/05, Jan P. removed_email_address@domain.invalid wrote:

Even in this early stage the rails community owes a great deal of
compliment to the ongoing efforts on ferret.

Especially the logos. :wink:
Thanks.