Tokenizers?

Hi everyone. First a quick word - I am relatively new to Ruby and Ruby
on Rails, but I love learning about it and using it. Currently I am
working on extending Boxroom (file repository RoR app) for the CARE
Indonsia intranet, where I work as an intern. I am using ferret, and
it’s working great.

I noticed that if a file contains something like this
“applications/entries”, this will be parsed as one word, and
“applications” as a query will not yield anything, you have to search
for “applications*”… This isn’t entirely logical, since . , etc
presumably are not included. I am quite new to search engines, and not
sure exactly about the terminology - does this have something to do
with a tokenizer? Where do I change the settings for this? Right now
my code is very simple, just a few lines of using inserting the new
files, and of searching for them (I love the automatic markup as
well!) and I don’t want to make my code very complex by using lower
level functions, but is there a way I could easily configure the
“tokenizing” behaviour (let me know if my terminology is wrong) to
split for example “applications/entries” into two words, searchable by
themselves?

Thank you very much!

Stian

Hi!

On Wed, Jan 17, 2007 at 01:05:14PM +0700, Stian H. wrote:
[…]

but is there a way I could easily configure the
“tokenizing” behaviour (let me know if my terminology is wrong) to
split for example “applications/entries” into two words, searchable by
themselves?

your terminology is correct, the tokenizer is responsible of splitting
document content into single terms.

You can get an idea of how this works at
http://ferret.davebalmain.com/api/classes/Ferret/Analysis.html

If you want to use a custom tokenizer you’ll have to write your own
analyzer which then makes use of this tokenizer. Don’t be afraid, this
is really easy:

def MyAnalyzer < Ferret::Analysis::Analyzer
def token_stream(field, str)
return
StemFilter.new(LowerCaseFilter.new(StandardTokenizer.new(str)))
end
end
(from
http://ferret.davebalmain.com/api/classes/Ferret/Analysis/Analyzer.html)

hope this gets you started.

Cheers,
Jens


webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66