Full text search using tsearch2


#1

Hi,

I implemented a full text search using postgres > 8.3.
My performance is very good to index (using triggers/after_filter or
crontab time based index update) and to search for something. With
Postgres i can index only lexemes, other solutions, like sphinx based
or aaf creates the index based in length. So everything bigger than
2/3 chars is indexed. I think to use a lexeme is better to avoid
falses positives in a real world.

My questions are:

  • Any good reason to dont use tsearch and choose a aaf/*sphinx
    solution ?
  • There is any way to make tsearch index the text by length and not by
    lexeme (i know, probably this question i should ask in a postgres
    forum)

Regards,

Victor


#2

On Tue, Dec 2, 2008 at 11:43 AM, VP removed_email_address@domain.invalid wrote:

My questions are:

  • Any good reason to dont use tsearch and choose a aaf/*sphinx
    solution ?
  • There is any way to make tsearch index the text by length and not by
    lexeme (i know, probably this question i should ask in a postgres
    forum)

Victor,

As a long-time PostgreSQL advocate… I’d encourage you to look at
using Sphinx and/or UltraSphinx.

Why? Configuration is way less complicated for your development and
production environments. I like knowing that I have the advanced
features of PostgreSQL available to me… but I’ll side with working
within Ruby as much as possible on our projects. Sphinx will allow you
to evolve your indexing without needing to muck around the database,
which should ease pains with deployments as well.

If you get to a point where Sphinx is falling over from not being to
handle your load, you can evaluate using Tsearch2, but in the
meantime, I’d just consider it premature optimization without much
benefits int he short-term.

Having said that… I haven’t worked with tsearch2 in 3+ years. :wink:

Good luck!

Robby

Robby R.
Chief Evangelist, Partner

PLANET ARGON, LLC
design // development // hosting


http://www.robbyonrails.com/
aim: planetargon

+1 503 445 2457
+1 877 55 ARGON [toll free]
+1 815 642 4068 [fax]


#3

Hey Robby,

Thanks for your feedback,

Victor,

As a long-time PostgreSQL advocate… I’d encourage you to look at
using Sphinx and/or UltraSphinx.

In fact, I used ultrasphinx, and with an average load, it wasnt
uncommon to have to restart ultrasphinx daemons. Another point that,
i tried to explain in my question, is the way that the data is
indexed. In ultrasphinx, the document is broken in tokens and
everything bigger then 2 chars is indexed. So in my opinion, it
generates too much “false positives” (sorry guys, i could not find
another word)… there is any way to make *sphinx index lexeme ?

Why? Configuration is way less complicated for your development and
production environments. I like knowing that I have the advanced
features of PostgreSQL available to me… but I’ll side with working
within Ruby as much as possible on our projects. Sphinx will allow you
to evolve your indexing without needing to muck around the database,
which should ease pains with deployments as well.

I think when i remove an extra daemon, i’m drying my system
architecture, right ?
I mean, if you dont really spend some hours reading (ultra|think)
sphinx code, it will be always a black box in your system that you
dont understand. So am I loosing something ? maybe the learn courve to
tsearch and all postgres caveats are bigger then really learn how
*sphinx based solutions works ?

Best regards,

Victor


#4

Whow, so you read Postgres code before using it? :wink:
Same goes for Ruby, Rails, and 99% of what tools we use.

lol… no, i mean at least the contrib/tsearch folder i read! ok the
http://www.postgresql.org/docs/8.3/static/textsearch.html as well, and
too be honest its better documented then the sphinx plugins :slight_smile:

Thanks for the tip about ultrasphinx, but i would like to know if its
possible to do opposite, to make it work with lexemes. Like for
example if you look for run, you will find run, runs, ran and running.
All of these words are forms of the same lexeme: RUN

I dont wanna be troll or blame the plugins, i heard a lot of good
things about thinksphinx, not so good things about ultrasphinx and
almost nothing about tsearch, that for me is working fine, except that
some developers here comes with the questions: “if i type “te” it dont
bring me my “test” post and ultrasphinx does”. But well in real world
users wont search for “te” and if they search for te, they will find
everything like te,teahupo, tea, teta, tendency, etc, etc, etc.

Regards,

Victor


#5

Thanks for the tip about ultrasphinx, but i would like to know if its
possible to do opposite, to make it work with lexemes. Like for
example if you look for run, you will find run, runs, ran and running.
All of these words are forms of the same lexeme: RUN

hmm, I don’t know about “ran”
but if you enable star search, then ru* wil at least find
run, runs, and running.
This may be possible too, but I can’t say for sure, since I didn’t
need this kind of search
Maybe have a look at the docs:
http://www.sphinxsearch.com/docs/current.html#searching

I dont wanna be troll or blame the plugins, i heard a lot of good
things about thinksphinx, not so good things about ultrasphinx and
almost nothing about tsearch, that for me is working fine, except that
some developers here comes with the questions: “if i type “te” it dont
bring me my “test” post and ultrasphinx does”. But well in real world
users wont search for “te” and if they search for te, they will find
everything like te,teahupo, tea, teta, tendency, etc, etc, etc.

lo, I agree. It was hard enough to read through lots of tutorials
and documents. We started with ultrasphinx at some time back and
it worked good enough for our needs. For new projects I would
think about changing that to thinksphinx. Especially if I expect
the index to be combined from several tables.
But for basic search both should do well enough.


#6

On 4 Dec 2008, at 13:39, VP wrote:

possible to do opposite, to make it work with lexemes. Like for

sphinx can do stemming. as for searching for prefixes, that’s optional.

Fred


#7

In ultrasphinx, the document is broken in tokens and
everything bigger then 2 chars is indexed. So in my opinion, it
generates too much “false positives” (sorry guys, i could not find
another word)… there is any way to make *sphinx index lexeme ?

Sure, you can configure min_infix_len, min_word_len
and whatever in your sphinx configuration.

I mean, if you dont really spend some hours reading (ultra|think)
sphinx code, it will be always a black box in your system that you
dont understand.

Whow, so you read Postgres code before using it? :wink:
Same goes for Ruby, Rails, and 99% of what tools we use.


#8

thanks for the tips. As documentation thinksphinx has nice railscast
about. But geocoding searches for example are not documented until now
(or at least, until last week hehe)

Regards,

Victor

On Dec 4, 3:02 pm, Frederick C. removed_email_address@domain.invalid