Ferret vs. mysql fulltext

Onur_T · December 13, 2005, 4:17pm

On 12/13/05, Abdur-Rahman A. [email protected] wrote:

David,

Are you trying to make a lucene compatible project? or a similar
project? Because I think with the possibilities of ruby, in time it
would be possible to go beyond what possible in java…

Very good question. At the moment I’m trying to stay compatible. But
if I get enough contributers I’ll consider forking off. Lucene is
quite a large project with a lot of contributers so it might be hard
to push ahead of them.

Really great project, I hope to be able to contribute, my C skill are a
little old (10 years orso) maybe I can help you out on the ruby end for
improvements…

Any help is appreciated. Just recommending Ferret is going to help the
project in the long run so I thank you for that. Also contributing to
the wiki is very important.

Thanks,
Dave

Onur_T · December 13, 2005, 8:46pm

On Dec 13, 2005, at 12:54 PM, Abdur-Rahman A. wrote:

The java version is really heavy a.t.m. (just to mention it ;)),
but your quite right, search querie’s can’t be cached very easily.
So writing optimized code is very important.

What do you mean by “heavy”? I guess I’m being a bit defensive
about Java Lucene. I’m not understanding your negatives to Java
Lucene other than your preference for Ruby. It still remains to be
seen how performant and optimized Ferret can be compared to Java
Lucene. My hunch is that porting to C will make it slightly faster
in spots, but whether it is worth the headaches of maintaining the
port is my question.

Pythonic API. In order to avoid porting every time Java Lucene
changes (which is where the guru creator Doug Cutting spends his
effort), it would be a simple recompilation (and perhaps some API
glue).

Thats a very good idea, but compiling java sound weird :). David
have you considered this? I wonder how will it would integrate…

PyLucene is fast. Super fast.

Erik

Onur_T · December 13, 2005, 9:13pm

What do you mean by “heavy”? I guess I’m being a bit defensive
about Java Lucene. I’m not understanding your negatives to Java
Lucene other than your preference for Ruby. It still remains to be
seen how performant and optimized Ferret can be compared to Java
Lucene. My hunch is that porting to C will make it slightly faster
in spots, but whether it is worth the headaches of maintaining the
port is my question.

I think I am sounding more negative then I am : ) I repeat I like lucene
for most of the project, but for something like a large scale search
engine, its maybe a better I think, to have a C implementation. Some
project we have used Clucene or lucene4c (I don’t remember, I was
projectleader) and it was much faster then using lucene. I was only
mentioning making the C port as it maybe faster to implement this.

PyLucene is fast. Super fast.

Erik, you are the expert, I am just trying to learn as I go along…
thnx for your feedback : )

Onur_T · December 13, 2005, 8:12pm

I’m not sure yet what’s best. I haven’t built that part of my app yet
and am
still working through the design. I’m just trying to think through the
best
approach for now. Do you have pointers to docs that can provide some
basic
‘rules of thumb’ for design - like when to store docs in a database and
run a
search on the index -v- when to store docs in the index directly?

I used Verity for search on an e-commerce site I helped build a few
years ago.
We stored the actual docs in a database (product descriptions, actually)
but
used verity for searching - it worked fine, but was a pain since
updating the
product catalog tables and the verity search index had to be closely
coordinated or you’d find search results for products that weren’t in
the
database…

Also, regarding creating an index in memory -v- creating it on disk –
are there
significant performance differences (eg, 20% - 50% faster or more) when
using an
in-memory index? Has anyone published test results?

Thanks again for your help and your efforts. My needs aren’t pressing,
I’m just
trying to figure out using ferret might benefit the app I’m building.

-k

Quoting David B. [email protected]:

Onur_T · December 13, 2005, 9:37pm

On 12/14/05, Kevin B. [email protected] wrote:

interested in.

However, I need to allow the ability to search for all results – both new
data and old. Once the database is large, then this “new data” may be only 1%
or less of the overall database. The new data may consist of several thousand
documents.

So if I do the math, you’re expecting to have several hundred thousand
documents? Ok, you’ve got my attention now.

I’m wondering if it might be useful to store all data in a disk-based index
while also storing the newest data in an in-memory index. This would allow me
to offer faster results when searching only the new data (which is what most
people will likely use) while still allowing people to search the entire
dataset if they want to.

In-memory or not, it will certainly be faster to search a smaller
document set so splitting the index in two might not be a bad idea.
Perhaps you could have a daily process which reindexes the recent
document set.

Of course, this is only a good idea if it provides a significantly faster
response time for searching the in-memory index.

The in memory part won’t make the big difference. Having a smaller
index might. I’d recommend doing the simplest thing possible and
refactoring if necessary. It should’t be hard to add a second
in-memory index later. Up to you though.

Dave

Onur_T · December 14, 2005, 12:54am

What you could considere is using something like cacheAR for the latest
queries or for popular queries…

Onur_T · December 13, 2005, 8:00pm

Free Search | Ramblings about Lucene, Nutch, Hadoop & other stuff

So those are my reasons with taking the route I am, and since I’m
currently doing the work, I get to choose. If anyone wants to get
stuck into porting the PyLucene stuff I’m more than willing to lend
and hand. It’s definitely worth doing but it’s not really my cup of
tea.

haha : ) wel, your doing a great job, ill continue to use ferret! I
don’t have the client request a.t.m. for taking on such a project. Maybe
in after a couple of months…

Onur_T · December 14, 2005, 1:07am

On 12/14/05, Abdur-Rahman A. [email protected] wrote:

What you could considere is using something like cacheAR for the latest
queries or for popular queries…

I’m not really sure but I think you’d probably just use cacheAR to
cache the popular documents. I don’t know if I mentioned already but I
haven’t had enough time to work with much of the rails stuff yet.
Soon.

Onur_T · December 13, 2005, 8:00pm

So those are my reasons with taking the route I am, and since I’m
currently doing the work, I get to choose. If anyone wants to get
stuck into porting the PyLucene stuff I’m more than willing to lend
and hand. It’s definitely worth doing but it’s not really my cup of
tea.

My kudos for these honest words!! A motivated developer is often the
most important thing.

Even in this early stage the rails community owes a great deal of
compliment to the ongoing efforts on ferret.

regards
Jan

Onur_T · December 13, 2005, 6:30pm

On Dec 13, 2005, at 11:28 AM, Abdur-Rahman A. wrote:

I am sorry, I just exited about ruby in general. But I thing with
language like ruby and a project like lucene, it?s my personal
opinion that LOC makes a difference. Things like mixins and the way
ruby you program in ruby makes things just a bit easier. I took me
4/5 days to understand and work with lucene (great book b.t.w.) and
it only took me a 10 days to learn most of edge rails and many
other plugins by reading code (yes not docs, code LOL)…

It’s not quite comparable the difference between a full-text search
engine and a web framework.

Lucene is optimized heavily - it’s code is more C-like than Java-
like. Making Lucene more OO or taking advantage of all the fancy
Ruby ways of method trickery is likely to slow things down. The
entire idea of a full-text search engine is to be fast! (oh, and to
be easy on resources as well)

Lucene is a great product, and will continue on java (you can’t
kill java, its really usable for many things). But ruby just makes
it easy to program, and with the integration with c. Well things
are optimized. I have only been rubying for a day or 20. But it
amazes my howmuch a language can make a difference…

The folks that would be coding under the covers of Ferret or Lucene
are a highly specialized group of folks. Likewise with the core code
of Rails. Most users don’t need to see what is underneath - it just
works.

Indeed the language makes a difference, but also the goal of the
effort. A full-text search engine has some very specialized needs
and even the most basic data structures in high level languages like
Hash and Array are only used if they are fast enough, otherwise
alternatives are created. This is definitely the case with Lucene.

So I have to revise my statement a bit, but I think, in time,
melting Ferret and ActiveRecord together could make it a better
product then lucene : ) But that future talk…

Well, in all fairness to Lucene, it is orthogonal to the database
concern entirely. Of course Ferret + ActiveRecord > just Lucene, but
to make the comparison more fair, how about Lucene + Hibernate?
There are hooks for Hibernate to index with Lucene, even using Java
annotations to mark the fields to be indexed, and how they are to be
indexed. I see ActiveRecord + Ferret to be a great path to go, and
the acts_as_ferret initial implementation is on the right track. I
hope to delve into this area more myself in the future (though my
work does not currently involve relational databases, but will soon).

Well, I am amazed to see you here : ) what is your opinion?

I’ve been a Ruby fan for ages, ever since catching a Dave T.
presentation in '02. I’ve dreamed of RubyLucene for years, creating
the rubylucene (formerly rucene) project at RubyForge once upon a
time but not doing much with it beyond some low-level I/O proof of
concept tests.

I’m ecstatic that Ferret exists! I do have some reservations on the
effort to port it all to C, as I’d really like the effort to aim
towards the architecture PyLucene has, where it uses GCJ against Java
Lucene, and then wraps it, using SWIG, into a Pythonic API. In order
to avoid porting every time Java Lucene changes (which is where the
guru creator Doug Cutting spends his effort), it would be a simple
recompilation (and perhaps some API glue).

Erik

Onur_T · December 13, 2005, 8:49pm

On Dec 13, 2005, at 1:58 PM, Jan P. wrote:

Even in this early stage the rails community owes a great deal of
compliment to the ongoing efforts on ferret.

Hear hear! Kudos to Dave for Ferret and I fully encourage him to
choose the development path he wants to go on. I hope he succeeds in
making a faster Lucene, for sure, regardless of what language he
creates it for.

Erik

Onur_T · December 13, 2005, 8:49pm

On 12/14/05, Kevin B. [email protected] wrote:

I’m not sure yet what’s best. I haven’t built that part of my app yet and am
still working through the design. I’m just trying to think through the best
approach for now. Do you have pointers to docs that can provide some basic
‘rules of thumb’ for design - like when to store docs in a database and run a
search on the index -v- when to store docs in the index directly?

I don’t know if you caught the other thread on Ferret but as we were
discussing, it’s usually better to store the documents in the database
and use ferret for finding the relevent documents. In rails, the way
to go is probably use something like this;

http://ferret.davebalmain.com/trac/wiki/FerretOnRails

The main reason you’d store stuff in the index is to allow result
searching. For example, if you wanted to sort your search results by
create_date then you’d need to store create_date in the index. There
are a few other times I can think of that you might want to store
documents in an index but they don’t apply to a rails app.

I used Verity for search on an e-commerce site I helped build a few years ago.
We stored the actual docs in a database (product descriptions, actually) but
used verity for searching - it worked fine, but was a pain since updating the
product catalog tables and the verity search index had to be closely
coordinated or you’d find search results for products that weren’t in the
database…

You need to be careful of this with Ferret too. This is the problem
the acts_as_ferret ActiveRecord hocks are trying to solve. It still
requires a bit of work. I haven’t played with rails for a while now
but when I get the chance I’ll try and come up with something better.

Also, regarding creating an index in memory -v- creating it on disk – are there
significant performance differences (eg, 20% - 50% faster or more) when using an
in-memory index? Has anyone published test results?

Thanks again for your help and your efforts. My needs aren’t pressing, I’m just
trying to figure out using ferret might benefit the app I’m building.

This is kind of a catch-22. If you can store your index in memory then
it is probably small enough that it won’t need to be stored in memory.
With the C version I’m working on the difference is only about 20%-30%
so not worth worrying about in my opinion.

HTH,
Dave

Onur_T · December 13, 2005, 9:22pm

Thanks - all this info is right on. Great!

Quoting David B. [email protected]:

This is kind of a catch-22. If you can store your index in memory then
it is probably small enough that it won’t need to be stored in memory.
With the C version I’m working on the difference is only about 20%-30%
so not worth worrying about in my opinion.

My situation is potentially different. The data I am storing is
text-based and
somewhat time sensitive. That is, the newest data is what most users
will be
interested in.

However, I need to allow the ability to search for all results – both
new
data and old. Once the database is large, then this “new data” may be
only 1%
or less of the overall database. The new data may consist of several
thousand
documents.

I’m wondering if it might be useful to store all data in a disk-based
index
while also storing the newest data in an in-memory index. This would
allow me
to offer faster results when searching only the new data (which is what
most
people will likely use) while still allowing people to search the entire
dataset if they want to.

Of course, this is only a good idea if it provides a significantly
faster
response time for searching the in-memory index.

-k

Onur_T · December 13, 2005, 9:28pm

On Dec 13, 2005, at 3:11 PM, Abdur-Rahman A. wrote:

scale search engine, its maybe a better I think, to have a C
implementation. Some project we have used Clucene or lucene4c (I
don’t remember, I was projectleader) and it was much faster then
using lucene. I was only mentioning making the C port as it maybe
faster to implement this.

Java Lucene is powering search in some very very heavy duty places,
not to mention some top secret ones.

For example, Doug is using Nutch (an open source “Google”, with
Lucene as a core component) to revamp the infrastructure behind The
Internet Archive. Yahoo Research Labs and others have funded Doug’s
Nutch efforts. I just want to be clear about Java Lucene being as
“enterprise” savvy as anyone needs. CLucene was a valiant effort,
and supposedly is slightly speedier in some cases, but also not up to
date with the latest Java Lucene API. lucene4c hasn’t gotten off the
ground.

Java Lucene is the most up to date version available and has many
features not found in the ports that haven’t kept up. PyLucene just
released a version up to date with Java Lucene’s Subversion trunk
(mostly by just recompiling, though there were some tweaks to the GCJ/
SWIG pieces apparently as well). All the ports, Ferret included,
will always be playing catch-up with Java Lucene. If the maintainers
of the ports take a break, they will be behind.

I don’t want to discourage folks from porting Lucene at all. But I’m
guardedly optimistic about a port being as good as Java Lucene. It
truly is one of the few gems in the Java open source world with very
little quality competition.

Erik