-----BEGIN PGP SIGNED MESSAGE-----
William M. wrote:
Excerpts from David B.'s message of Fri Apr 06 00:45:42 -0700 2007:
So what do people think? Should stop-words be filtered by default?
I also vote to turn them off by default. Their usefuless to retrieval
performance is limited to specific and uncommon situations, whereas
their ability to confuse people is not.
Do you have any proof for this assumption? Every fulltext search I use
has a stopword-list by default. Mysql FULLTEXT for example even needs to
be recompiled if you want to change them. I also want to argue that the
use of stopwords is very common. For example, if I have an index of
1.000 english documents and search for ‘and’, chances are high that I
get a result set of 1000 hits - which is unusable. I am unable to see
the corner-case in this scenario. We are not talking about performance
here - we are talking about sane results. Stopwords are more of a result
than an performance optimization.
If you want to query phrases, i would be wise to use ferrets
phrase-query instead of killing the stopwords.
I cannot find it at the moment, but there was the point that ‘premature’
optimization is bad. This may be wise for your own application, but the
libraries in use should be a) mature and b) optimized.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
-----END PGP SIGNATURE-----