[VOTE] Should stop-words be filtered by default?

david · April 6, 2007, 9:46am

Hey folks,

A lot of confusion has been caused by having stop-words filtered by
the default analyzer. There have been a few suggestions to remove this
feature so I thought I’d put it to a vote. Making this change would
not be backwards compatible and would require users to either rebuild
their indexes or change their code to keep the same stop-words
settings. However, it would save a lot of confusion for people
starting out with Ferret.

So what do people think? Should stop-words be filtered by default?

david · April 6, 2007, 10:50am

On Apr 6, 2007, at 09:45, David B. wrote:

So what do people think? Should stop-words be filtered by default?

yes… but i guess the problem is, that most people doesn’t know about
analyzers and therefore will not see the relation between stop-words
and the standard analyzer.

afaic, the default behavior should be with stop-word-filtering, because
searching is about full-text-search in the first place. as soon as
you want
to do special things like searching for names, you should understand
ferrets fields and analyzers, because if you don’t any result will be
coincidental anyway.

Ben

david · April 6, 2007, 12:15pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

+1

Most it’s the typical ‘it cannot be so hard, there is a plugin for it’
problem - if you do text searches, you should know the basics (and
stop-words). This problem even has the first place in the gotchas list.
I hate to answer with ‘RTFM’ but this is one of the cases where it
applies. We are talking about removing a sane default just because no
one gets it.

Greetings
Florian

Benjamin K. wrote:

you want

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGFhu38RlGMqQ8m7oRAhmmAJ4l200bUi2UHuXXS7WconbelN6Q7ACfb1SQ
SGPIFWwEHFEcBhrHeu28TL4=
=B+YM
-----END PGP SIGNATURE-----

david · April 6, 2007, 3:04pm

Hello David,

2007/4/6, David B. [email protected]:

So what do people think? Should stop-words be filtered by default?

I too would prefer that stop words be filtered by default. If you
don’t know what you’re doing, then it’s just normal that you would be
bitten by your work.

Bye !

FranÃ§ois Beausoleil
http://blog.teksol.info/
http://piston.rubyforge.org/

david · April 6, 2007, 3:34pm

On Apr 6, 2007, at 12:45 AM, David B. wrote:

Should stop-words be filtered by default?

-1

Queries return more relevant documents on average if you don’t filter
stopwords. The default setting should be the one that produces the
best search results. Adding a stop filter should be part of
performance tuning.

Marvin H.
Rectangular Research
http://www.rectangular.com/

david · April 6, 2007, 3:39pm

I concur with Marvin on this point. It is very often confusing even
for me when using Lucene and the StandardAnalyzer with stop words
removed by default. The list is English and thus biased and often
wrong anyway. Less magic by default

Erik

david · April 6, 2007, 4:25pm

The list is English and thus biased and often
wrong anyway. Less magic by default

1

Jonathan

david · April 6, 2007, 7:14pm

Excerpts from David B.'s message of Fri Apr 06 00:45:42 -0700 2007:

So what do people think? Should stop-words be filtered by default?

I also vote to turn them off by default. Their usefuless to retrieval
performance is limited to specific and uncommon situations, whereas
their ability to confuse people is not.

david · April 6, 2007, 6:51pm

Erik H. wrote:

I concur with Marvin on this point. It is very often confusing even
for me when using Lucene and the StandardAnalyzer with stop words
removed by default. The list is English and thus biased and often
wrong anyway. Less magic by default

Stop-words are a form of optimisation. Premature optimisation is evil.
Therefore applying stopwords by default (that is, before one knows
anything about the performance and space constraints of the specific
application context) is evil.

Having them available, should one need to reduce the index size, is
extremely useful, but my vote goes to switching them off by default.

david · April 8, 2007, 1:52pm

On Apr 6, 2007, at 6:57 PM, William M. wrote:

Excerpts from David B.'s message of Fri Apr 06 00:45:42 -0700
2007:

So what do people think? Should stop-words be filtered by default?

I also vote to turn them off by default. Their usefuless to retrieval
performance is limited to specific and uncommon situations, whereas
their ability to confuse people is not.

Very well put.

I, too, vote for off.
(gee, wasn’t I the one who started this? =)

david · April 6, 2007, 7:55pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I also believe the default behaviour should be filtering the stop words.
It’s not like it’s hard to change it.

Tiago M.

Florian G. wrote:

Florian

searching is about full-text-search in the first place. as soon as
http://rubyforge.org/mailman/listinfo/ferret-talk

Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGFoLRxFuRTtCTMvIRArfOAJ4wZaHmPNXCfSE+EAYZlopYnIcsrgCdElSy
70SWqMpkoAsqm41LVUOel34=
=aJUE
-----END PGP SIGNATURE-----

david · April 8, 2007, 6:28pm

Florian G. wrote:

Every fulltext search I use
has a stopword-list by default. Mysql FULLTEXT for example even needs to
be recompiled if you want to change them.
This is a massive, massive drawback. For web-apps on shared hosts, in
the past I’ve had to resort to appending characters to each word to
evade stop-word and minimum-length filtering, precisely because of this
inane default, and you can imagine what that does to performance.

I also want to argue that the
use of stopwords is very common.
That doesn’t make it correct. I see enough queries on this list alone
from people surprised by the stop-word behaviour, or needing to change
it because they need to support a language other than English, to
believe that they should be dropped by default.

For example, if I have an index of
1.000 english documents and search for ‘and’, chances are high that I
get a result set of 1000 hits - which is unusable.
So what? The inverse isn’t usable either - if ‘and’ is a stop-word, and
you only search for ‘and’, you’ll get no results at all.

Stopwords are more of a result than an performance optimization.
That’s just not the case - stop-words exist primarily to reduce the
index size. Their effect on the result set is a product of the way you
construct a stop-word list - by picking the words which impart the
smallest amounts of information.

I cannot find it at the moment, but there was the point that ‘premature’
optimization is bad. This may be wise for your own application, but the
libraries in use should be a) mature and b) optimized.
I believe that point was mine. However, I was not referring to
performance - traditionally stop-words have been used as a storage space
reduction strategy, with typical results being a reduction in index size
of between 20 and 30 percent. There may well be a correlated
performance bump, but that’s tangential.

I’m not arguing that stop-words should not be available if you want
them. I’m not even arguing against supplying a decent set of stop-words
for as many different languages as possible. I am trying to argue that
they should not be turned on by default.

david · April 8, 2007, 10:22pm

I vote to take it out as a default. I will surely continue to use it
but only in certain places and it would be nicer to switch it on rather
than have to remember to switch it off.

david · April 8, 2007, 3:08pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

William M. wrote:

Excerpts from David B.'s message of Fri Apr 06 00:45:42 -0700 2007:

So what do people think? Should stop-words be filtered by default?

I also vote to turn them off by default. Their usefuless to retrieval
performance is limited to specific and uncommon situations, whereas
their ability to confuse people is not.

Do you have any proof for this assumption? Every fulltext search I use
has a stopword-list by default. Mysql FULLTEXT for example even needs to
be recompiled if you want to change them. I also want to argue that the
use of stopwords is very common. For example, if I have an index of
1.000 english documents and search for ‘and’, chances are high that I
get a result set of 1000 hits - which is unusable. I am unable to see
the corner-case in this scenario. We are not talking about performance
here - we are talking about sane results. Stopwords are more of a result
than an performance optimization.
If you want to query phrases, i would be wise to use ferrets
phrase-query instead of killing the stopwords.

I cannot find it at the moment, but there was the point that ‘premature’
optimization is bad. This may be wise for your own application, but the
libraries in use should be a) mature and b) optimized.

Greetings
Florian
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGGOjt8RlGMqQ8m7oRArS1AJ0bz7nvEniqilGUFmY+IFQEzzHMpQCfVBpT
VzDUFW9MVtbQwVOkF/UiRoA=
=WzGq
-----END PGP SIGNATURE-----

david · April 9, 2007, 6:42am

On 4/9/07, Alex Y. [email protected] wrote:

I’m not arguing that stop-words should not be available if you want
them. I’m not even arguing against supplying a decent set of stop-words
for as many different languages as possible. I am trying to argue that
they should not be turned on by default.

Well, it looks like the people who want stop-word filtering off by
default have the slight edge (at 7 to 5 by my count). I will probably
change this default one day, however, I don’t think it is important
enough to change now as it would force a lot Ferret users to rebuild
their indexes. I’ll wait until there is a more important update
already forcing users to rebuild their indexes. Perhaps Ferret 2.0?

Thanks for your input guys.
Dave

david · April 9, 2007, 3:39am

David B. wrote:

Hey folks,

A lot of confusion has been caused by having stop-words filtered by
the default analyzer. There have been a few suggestions to remove this
feature so I thought I’d put it to a vote. Making this change would
not be backwards compatible and would require users to either rebuild
their indexes or change their code to keep the same stop-words
settings. However, it would save a lot of confusion for people
starting out with Ferret.

So what do people think? Should stop-words be filtered by default?

I suggest that we can keep this default filter with stopwords
All the users who tried to use this plugin in their app would like to
practise it, then they will know such a feature through self learning or
faq here :),if u turned it as non stop words maybe some of them won’t
realize it and implement it by themselves to purify the query

david · April 9, 2007, 5:27pm

On Apr 8, 2007, at 6:24 PM, Alex Y. wrote:

I’m not arguing that stop-words should not be available if you want
them. I’m not even arguing against supplying a decent set of stop-
words
for as many different languages as possible. I am trying to argue
that
they should not be turned on by default.

Full f****** ack!

– Andy