Problem with stop words

I am seeing trouble with searches for ‘you’ not returning anything. It
appears that ‘you’ is a stop word to the standard analyzer:

require 'rubygems'
require 'ferret'

index = Ferret::I.new(:or_default => false)
index << 'you'
puts index.search('you')

returns no hits.

I assumed from the docs that StandardAnalyzer was using stop words
as defined by:

Ferret::Analysis::ENGLISH_STOP_WORDS

but when I print that to the console I get:

[“a”, “an”, “and”, “are”, “as”, “at”, “be”, “but”, “by”, “for”, “if”,
“in”,
“into”, “is”, “it”, “no”, “not”, “of”, “on”, “or”, “s”, “such”, “t”,
“that”,
“the”, “their”, “then”, “there”, “these”, “they”, “this”, “to”,
“was”,
“will”, “with”]

I don’t see ‘you’ in there.

Supplying my own stop words seems to fix the problem:

STOP_WORDS = [“a”, “the”, “and”, “or”]
index = Ferret::I.new(:or_default => false, :analyzer =>
Ferret::Analysis::StandardAnalyzer.new(STOP_WORDS))

index << ‘you’
puts index.search(‘you’)

this returns a hit.

I am running the latest Windows build, but I’ve seen the same behavior
on Linux with the latest builds. I am happy with my solution, but it
seems odd that ‘you’ should be standard stop word.

On 24.10.2006, at 23:28, Scott Persinger wrote:

I am seeing trouble with searches for ‘you’ not returning anything. It
appears that ‘you’ is a stop word to the standard analyzer:

I assumed from the docs that StandardAnalyzer was using stop words
as defined by:

Ferret::Analysis::ENGLISH_STOP_WORDS

I don’t see ‘you’ in there.

StandardAnalyzer actually uses
Ferret::Analysis::FULL_ENGLISH_STOP_WORDS by default. (Note the ‘FULL_’)

Supplying my own stop words seems to fix the problem:

Standard stop words are just a one-size-fit-all reasonable default.
For maximum control you should always supply your own list of stop
words.

I am running the latest Windows build, but I’ve seen the same behavior
on Linux with the latest builds. I am happy with my solution, but it
seems odd that ‘you’ should be standard stop word.

Depends on how you look at it. ‘You’ is definitely not the least
adequate candidate for a stop word. Then again, it’s not included in
Ferret::Analysis::ENGLISH_STOP_WORDS.

Cheers,
Andy

On 10/24/06, Andreas K. [email protected] wrote:

I don’t see ‘you’ in there.

StandardAnalyzer actually uses
Ferret::Analysis::FULL_ENGLISH_STOP_WORDS by default. (Note the ‘FULL_’)

My apologies. This had been fixed in the documentation a while ago. I
just have updated the docs on the Ferret homepage for a while.

Supplying my own stop words seems to fix the problem:

Standard stop words are just a one-size-fit-all reasonable default.
For maximum control you should always supply your own list of stop
words.

I am running the latest Windows build, but I’ve seen the same behavior
on Linux with the latest builds. I am happy with my solution, but it
seems odd that ‘you’ should be standard stop word.

Depends on how you look at it. ‘You’ is definitely not the least
adequate candidate for a stop word. Then again, it’s not included in
Ferret::Analysis::ENGLISH_STOP_WORDS.

Cheers,
Andy

Thanks Andy. Actually the reason for the two English stop-word lists
is that they come from two different sources. ENGLISH_STOP_WORDS is
the list taken from Lucene. FULL_ENGLISH_STOP_WORDS is taken from
Martin Porter’s website[1]. I hope that clears things up a little. You
are quite right in saying you should probably use your own list of
stop words for best results.

Cheers,
Dave

[1] http://snowball.tartarus.org/