Uninitialized constant UNTOKENIZED

casper_the_ghost · August 29, 2006, 9:39pm

I’m getting “uninitialized constant UNTOKENIZED” when I try to do
something like the following:

class Url < ActiveRecord::Base
acts_as_ferret :fields => {‘name’ => {},
‘description’ => {},
‘url’ => {:index =>
Ferret::Document::Field::Index::UNTOKENIZED},
‘url_type’ => {}}
end

I am running ferret 0.10.1 and “bleeding edge version” of acts_as_ferret
I got from
svn://projects.jkraemer.net/acts_as_ferret/trunk/plugin/acts_as_ferret.

Any ideas?

casper_the_ghost · August 29, 2006, 9:52pm

Caleb schrieb:

I am running ferret 0.10.1 and “bleeding edge version” of acts_as_ferret
I got from
svn://projects.jkraemer.net/acts_as_ferret/trunk/plugin/acts_as_ferret.

i don’t know much about the acts_as_ferret plugin, but the CONST you’re
talking about was part of the 0.9 ferret source tree and is gone from
ferret 0.10 on…

Ben

casper_the_ghost · August 29, 2006, 10:02pm

Caleb schrieb:

Thanks. Then, I guess my question is, “How do I set something to be not
be tokenized?” To further explain, I would like to be able to type in
only a part of a URL (ie: microsoft) and get back any URL that contains
that query (ie: www.microsoft.com, www.foo.com/microsoft,
microsoft.foobar.com, etc.).

again… i can just talk about ferret, not acts_as_ferret… you do now
pass a hash describing all fields as a so called fieldinfo-hash to the
index. . see
http://ferret.davebalmain.com/api/classes/Ferret/Index/FieldInfos.html
for more information. i guess this will be fixed within the next days
and you will get a message from jens as soon as he gets back to his
computer

Ben

casper_the_ghost · August 30, 2006, 12:10am

Hi!

On Tue, Aug 29, 2006 at 10:01:11PM +0200, Benjamin K. wrote:

http://ferret.davebalmain.com/api/classes/Ferret/Index/FieldInfos.html
for more information. i guess this will be fixed within the next days
and you will get a message from jens as soon as he gets back to his
computer

right

Caleb, not tokenizing the url field won’t help you much with your
problem. An untokenized field’s content is indexed ‘as is’, so indexing
‘www.gnu.org’ will leave you with that exact term in the index, and a
search for ‘gnu’ won’t find that. Even searching for ‘gnu*’ won’t find
it, since the term starts with ‘www.’ and a wildcard at the beginning of
the query term (like ‘gnu’) is not allowed due to the way the index
works.

Better would be to use a custom tokenizer that splits the contents for
this field at ‘.’ and ‘/’ (and maybe strips out the ‘www’, as that will
be shared by most URLs and won’t help much when it comes to searching)
so that ‘gnu org’ would be indexed. now a search for ‘gnu’ will find
what you want.

If you aren’t keen on implementing your own tokenizer, define a method
that pre-processes the url and splits it like described, and index the
return value of this method:

so you’d use

class Url < ActiveRecord::Base
acts_as_ferret :fields => {:name => {},
:description => {},
:url_parts => { :index => :untokenized },
:url_type => {}}
def url_parts
# split url and remove common terms
self.url.split(/[/.]/) - [ ‘www’, ‘html’ ]
end
end

Concerning the new 0.10 FieldInfo properties: you can use all the
properties and values described at
http://ferret.davebalmain.com/api/classes/Ferret/Index/FieldInfo.html
in your call to acts_as_ferret, they will be passed straight through to
Ferret upon index creation.

Jens

–
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

casper_the_ghost · August 30, 2006, 1:26am

Jens K. wrote:

so you’d use

class Url < ActiveRecord::Base
acts_as_ferret :fields => {:name => {},
:description => {},
:url_parts => { :index => :untokenized },
:url_type => {}}
def url_parts
# split url and remove common terms
self.url.split(/[/.]/) - [ ‘www’, ‘html’ ]
end
end

Thanks! That gives me the information I need. One question remains,
however. I definately want to do custom tokenizing for urls. However,
when I use the code quoted above, the url_parts method is never executed
(ie: if I put a breakpoint there, the code never hits it). How can I
get ferret to reference ‘url_parts’ and call that method?

casper_the_ghost · August 29, 2006, 9:55pm

Benjamin K. wrote:

Caleb schrieb:

i don’t know much about the acts_as_ferret plugin, but the CONST you’re
talking about was part of the 0.9 ferret source tree and is gone from
ferret 0.10 on…

Ben

Thanks. Then, I guess my question is, “How do I set something to be not
be tokenized?” To further explain, I would like to be able to type in
only a part of a URL (ie: microsoft) and get back any URL that contains
that query (ie: www.microsoft.com, www.foo.com/microsoft,
microsoft.foobar.com, etc.).

casper_the_ghost · August 30, 2006, 8:56am

On Wed, Aug 30, 2006 at 01:26:17AM +0200, Caleb wrote:

self.url.split(/[\/.]/) - [ 'www', 'html' ]
end
end
Thanks! That gives me the information I need. One question remains,
however. I definately want to do custom tokenizing for urls. However,
when I use the code quoted above, the url_parts method is never executed
(ie: if I put a breakpoint there, the code never hits it). How can I
get ferret to reference ‘url_parts’ and call that method?

it should get called whenever acts_as_ferret indexes a record, since it
is referenced in the :fields hash. what does aaf log when you create a
new Url record ?

Jens

–
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

casper_the_ghost · September 6, 2006, 7:10pm

Jens K. wrote:

it should get called whenever acts_as_ferret indexes a record, since it
is referenced in the :fields hash. what does aaf log when you create a
new Url record ?

Sorry for the delayed response. I find it annoying when thread are
started that could be helpful to others and the author doesn’t take time
to indicate what ultimately solved the problem. So, I won’t do that
here.

You’re right, url_parts IS being called when a Url is CREATED. I was
thinking that it would be called upon SEARCHING. I guess that wouldn’t
make sense unless you wanted to re-index everything on every search (not
a good idea).

So, the url_parts method works as expected.