Forum: Ferret Tokenizers?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Stian H. (Guest)
on 2007-01-17 08:07
(Received via mailing list)
Hi everyone. First a quick word - I am relatively new to Ruby and Ruby
on Rails, but I love learning about it and using it. Currently I am
working on extending Boxroom (file repository RoR app) for the CARE
Indonsia intranet, where I work as an intern. I am using ferret, and
it's working great.

I noticed that if a file contains something like this
"applications/entries", this will be parsed as one word, and
"applications" as a query will not yield anything, you have to search
for "applications*"... This isn't entirely logical, since . , etc
presumably are not included. I am quite new to search engines, and not
sure exactly about the terminology - does this have something to do
with a tokenizer? Where do I change the settings for this? Right now
my code is very simple, just a few lines of using inserting the new
files, and of searching for them (I love the automatic markup as
well!) and I don't want to make my code very complex by using lower
level functions, but is there a way I could easily configure the
"tokenizing" behaviour (let me know if my terminology is wrong) to
split for example "applications/entries" into two words, searchable by
themselves?

Thank you very much!

Stian
Jens K. (Guest)
on 2007-01-17 11:41
(Received via mailing list)
Hi!

On Wed, Jan 17, 2007 at 01:05:14PM +0700, Stian H. wrote:
[..]
> but is there a way I could easily configure the
> "tokenizing" behaviour (let me know if my terminology is wrong) to
> split for example "applications/entries" into two words, searchable by
> themselves?

your terminology is correct, the tokenizer is responsible of splitting
document content into single terms.

You can get an idea of how this works at
http://ferret.davebalmain.com/api/classes/Ferret/A...

If you want to use a custom tokenizer you'll have to write your own
analyzer which then makes use of this tokenizer. Don't be afraid, this
is really easy:

def MyAnalyzer < Ferret::Analysis::Analyzer
  def token_stream(field, str)
    return
      StemFilter.new(LowerCaseFilter.new(StandardTokenizer.new(str)))
  end
end
(from
http://ferret.davebalmain.com/api/classes/Ferret/A...)

hope this gets you started.

Cheers,
Jens


--
webit! Gesellschaft für neue Medien mbH          www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer 
removed_email_address@domain.invalid
Schnorrstraße 76                         Tel +49 351 46766  0
D-01069 Dresden                          Fax +49 351 46766 66
This topic is locked and can not be replied to.