In Search of Search!

vanquisher · May 30, 2007, 12:59pm

I am implementing a search on my site and was wondering which will be
the best way to go about it.
We want a full text search and an advance search.

We will have huge amounts of data that we would want to search. - on
multiple tables.
I went through the plugins acts_as_ferret and acts_as_solr.
but ferret seems to have a locking problem at high load and solr needs
a java server.
So what do you recommend? how do we go about this?
What will be the overheads of solr? extra costs/complexity?

Details of my database: My site is about surveys on colleges, and
hence will have surveys_table, comments_table, testimonial_table, etc.
I want to give an option for the users to search these surveys!

Please suggest!
I know this topic has been discussed many a times…but I couldnt find
the answer I wanted.

Thanks in advance!

vanquisher · May 30, 2007, 3:34pm

I’m just theorizing here, but these are my 2 cents.

Because Ferret is stored in file system-based tables, you’ll always
have locking problems at high load. But it’s still a much more
elegant solution than anything else I can think of. My question is
how high of a load until you start to get real problems with this? I
know that in data warehousing they’ve gotten around some of these
bottlenecks by caching searches. They take the top 10% of searches
and store these on a separate system and the rest go through the main
system. Would a similar approach be fruitful? Say the search cache
is updated every hour or so and it contains 10% of the searches, maybe
50-60% of the search load. All this is stored in the RDBMS, so you
don’t have locking issues on those.

Leaving that paradigm, I wonder if there could be anything else non-db
specific. I don’t know. To get around the locking issue, you’d
probably best set things in a database. So, you could use something
like MATCH in MySQL. If you were going to do this, however, maybe
you’d want to stem your searches:

sudo gem install stemmer

in the controller:
params[:q] = params[:q].split.map{|word|
word.downcase.stem + ‘*’
}.uniq.join(’ ')

I don’t know though. Taking this kind of approach leaves you open to
all the gotchas that you’ll have to build from scratch. You need to
start getting tricky with finding phrases, etc.

Sorry if I’m not leading you anywhere with these musings. Good luck

vanquisher · May 30, 2007, 3:52pm

On May 30, 3:58 am, vanquisher [email protected] wrote:

What will be the overheads of solr? extra costs/complexity?

Details of my database: My site is about surveys on colleges, and
hence will have surveys_table, comments_table, testimonial_table, etc.
I want to give an option for the users to search these surveys!

Please suggest!
I know this topic has been discussed many a times…but I couldnt find
the answer I wanted.

Thanks in advance!

I don’t have specific answers, but you can read abou DRb server and A-
A-F here. It’s a great list, the authors of ferret and acts_as are
quite active:

http://rubyforge.org/pipermail/ferret-talk/2007-May/

Full archive:
http://rubyforge.org/pipermail/ferret-talk/

vanquisher · May 30, 2007, 5:10pm

Thanks for your insights!
My guess is that it will take sometime to hit high load…also…I am
not sure what ‘high load’ means?
I mean how much is high load for ferret?

I will consider these suggestions…and get back to you incase I need
more answers!

Also, can anyone give me insights on solr?

vanquisher · May 30, 2007, 5:28pm

vishwas

I mean how much is high load for ferret?

FYI, the Ferred DRb server is used at technorati for one of their
project.
See their comments:

FWIW, I’m running this in production with about 5 updates/sec
and 20-30 searches/second without problems.
src: [AAF] remote indexing via DRb with acts_as_ferret - Ferret - Ruby-Forum

Note: the Drb server is part of the aaf (Acts_as_ferret plugin), and
it’s a no-brainer to use.
It
1/ lets you launch a Ferret server (like a DB server, f.ex), and
2/ redirects all the aaf calls (ex: User.find_by_content(“John”) to
the server.
see:
http://projects.jkraemer.net/acts_as_ferret/wiki/DrbServer

For more specific questions, there is a dedicated Ferret mailing list :

http://www.ruby-forum.com/forum/5

Alain R.

http://blog.ravet.com

vanquisher · May 30, 2007, 10:13pm

On May 30, 2007, at 3:58 AM, vanquisher wrote:

So what do you recommend? how do we go about this?
Thanks in advance!
I’ve had too many problems with ferret in production to recommend
using it. But I have had great luck with sphinx and acts_as_sphinx.
sphinx is really fast and builds indexes in seconds not hours. And I
have not had any locking issues with it.

Cheers-
– Ezra Z.
– Lead Rails Evangelist
– [email protected]
– Engine Y., Serious Rails Hosting
– (866) 518-YARD (9273)

vanquisher · May 31, 2007, 12:41am

Ezra

I’ve had too many problems with ferret in production to recommend
using it.

Was it with recent versions of Ferret/AAF, and the Drb server?

Alain

vanquisher · May 31, 2007, 1:40am

On May 30, 2007, at 3:41 PM, Alain R. wrote:

Ezra

I’ve had too many problems with ferret in production to recommend
using it.

Was it with recent versions of Ferret/AAF, and the Drb server?

Alain

Alain-

Unfortunately yes. The drb server does help though. But I’ve had
tons of segfaults and index corruptions with the latest ferret. It
works great in development or with small amounts of data, but once
apps started to get a serious amount of data the indexes get
corrupted randomly and caused segfaults. It’s happened in close to 10
apps.

I really like ferret’s integration with rails via acts_as_ferret
though, and would use it again if the segfaults and index corruption
was fixed.

Cheers-
– Ezra Z.
– Lead Rails Evangelist
– [email protected]
– Engine Y., Serious Rails Hosting
– (866) 518-YARD (9273)

vanquisher · May 31, 2007, 7:02am

Ok, thanks a ton!
So, I was thinking I would start with Ferret and see how it goes
before shifting to Solr.

But, how easy is it to transition between these? what about the
indexing/other specific data required for each search type?

Also, one of the Rails expert, suggest
HyperEstrier[acts_as_searchable] as it seems to be built with speed
and sacalability.
So is this recommended?

vanquisher · May 31, 2007, 3:04am

If the requirement to run Java isn’t surmountable, then Solr is a
really great solution. The way you posed your data set, it sounds
like scalability is going to be a major issue. Which means you
probably are going to need to put search on it’s own server anyway.
So going the Solr route is’t so bad.

Also, I was blown away with how easy Solr is to get up and running.
It really is out of the box. We as a community have heard the line
“Java is heavy” so much, that we forget how good Java can be in
certain instances. Java to build webapps is heavy stack, but Java to
run Solr is very easy. Just fire up Jetty and you are done. I
wouldn’t just discount Solr because of requiring Java. We are using
it with a Rails front end called Solr Flare:
Flare - Solr - Apache Software Foundation.
And we did another project where we wanted to search files uploaded
into a Joomla based CMS. We tweaked Solr to parse .pdf, .xls, .doc,
and .ppt files for search, and it was all very easy. And the
interface between Joomla and Solr was very simple to do.

I used acts_as_ferret for search on a community content site, and it
works okay. It didn’t blow me away, but it was super easy to setup
and use. If you are just trying to get something in place, then it
might be the baby step you may need.

Eric P.