Dynamic fields and AAF

davidstokar · September 18, 2006, 5:28pm

Hi,

I have a model which has properties, these are your standard name/value
pairs, but also have attributes that affect how I want to store them in
ferret. I was using 0.9.5 with 0.2 of aaf, which seemed fine, I just
copied and pasted (yes, I know, ick) the to_doc method and added code to
iterate though the properties that that model had, and add relavent
fields to the document.

It seems that this will be a bit harder now with the FieldInfos. Has
anyone else done this, and is there a recognised way of doing it?

David

davidstokar · September 18, 2006, 7:58pm

On Mon, Sep 18, 2006 at 05:28:45PM +0200, David S. wrote:

Hi,

I have a model which has properties, these are your standard name/value
pairs, but also have attributes that affect how I want to store them in
ferret. I was using 0.9.5 with 0.2 of aaf, which seemed fine, I just
copied and pasted (yes, I know, ick) the to_doc method and added code to
iterate though the properties that that model had, and add relavent
fields to the document.

instead copy’n paste you could just call super:

def to_doc
doc = super

custom code here

doc
end

It seems that this will be a bit harder now with the FieldInfos. Has
anyone else done this, and is there a recognised way of doing it?

imho adding arbitrary fields should work, you just can’t specify any
special per-field storage/indexing options, since the defaults
determined at index creation will be used.

With aaf this means
:store => :no,
:index => :tokenize

changing the characteristics of a field for a special document doesn’t
seem to be possible any more. Was that what you did until now, i.e.
tokenize or store a field’s value sometimes, and sometimes not ?

Jens

–
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

davidstokar · September 19, 2006, 8:50am

Jens K. wrote:

instead copy’n paste you could just call super:

def to_doc
doc = super

custom code here

doc
end

Ah, I had missed out on that, I don’t really understand how super works
in ruby. I had been trying to rename the method and create a new one
aliased to it which didn’t work. I’m still a bit confused as to_doc is
created by the mixin as an instance method, is there still a superclass
version? Anyway thanks for that tip, I’ll try it.

changing the characteristics of a field for a special document doesn’t
seem to be possible any more. Was that what you did until now, i.e.
tokenize or store a field’s value sometimes, and sometimes not ?

Yes. Some are strings (tokenize), some are integers (dont tokenize,
ideally use a different analyser), and some are choices from lists
(either untokenized String or treat as integer index of choice). Dates
are treated as integers, and we may want to include some strings in the
DB so they can be displayed in the search results.

David

davidstokar · September 19, 2006, 11:22am

On Tue, Sep 19, 2006 at 08:50:29AM +0200, David S. wrote:

in ruby. I had been trying to rename the method and create a new one
aliased to it which didn’t work. I’m still a bit confused as to_doc is
created by the mixin as an instance method, is there still a superclass
version? Anyway thanks for that tip, I’ll try it.

ah, good point. But this should still work if you do the override after
calling acts_as_ferret.

changing the characteristics of a field for a special document doesn’t
seem to be possible any more. Was that what you did until now, i.e.
tokenize or store a field’s value sometimes, and sometimes not ?

Yes. Some are strings (tokenize), some are integers (dont tokenize,
ideally use a different analyser), and some are choices from lists
(either untokenized String or treat as integer index of choice). Dates
are treated as integers, and we may want to include some strings in the
DB so they can be displayed in the search results.

difficult. you could declare one field per type of data (in terms of
indexed/stored) you possibly run into, and in your to_doc then decide
which data has to go into which field. doesn’t sound really nice to mee,
but might work. For searching you would then always have to search all
these fields, of yourse.

Jens

–
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

davidstokar · September 19, 2006, 12:59pm

David B. wrote:

Is there any reason you need them all to be in the same field? Or am I
misunderstanding you? You do realize that different fields can have
different properties right?

Yes, I want them all in different fields, named after the property, that
way you could search for someone’s name by ‘name:Bob’ or their year of
matriculation with ‘matriculation:1978’. The problem is that on creation
of the index I do not know what properties will be associated with users
so cannot define their field infos. Previously I was able to just
specify the properties when adding that field to the document.

David

davidstokar · September 19, 2006, 3:26pm

without reading the whole thread:

you know that users have properties, right?
theses properties are like key value pairs. one could have a property
like hobby: ‘cars’, another user might have a property like
place-of-birth:
‘Hamburg, Germany’
users might build their property key-value dynamically. You don’t
know
which user chooses to inform you about which property
couldn’t you use rubys reflection, inflection whatever features to
iterate over the properties of which a user has many from and then
inflect
the key-value pairs to put them into the index?
this would mean that the field list of the index might grow to a
great
number. don’t know how this would affect ferret. this further means that
you
need to know which fields one is able to search for. you would need to
build
something like an extended search form with all of these fields or
inform
the user about which fields he might use in his queries with effect. he
should also be informed that only because of the existance of this field
a
user might not have provided this information. maybe it’s only one user
that
informed you about his place-of-birth.

cheers,
Jan

davidstokar · September 19, 2006, 3:28pm

imho the described problem of a growing field list is one of the reasons
for
the popularity of tags. Simply let the user choose how to tag himself,
his
question, comment whatever and don’t care about the field. it’s fulltext
search for a reason. imho you’ve got two sides in things like this: 1.
predefine a field list, that would be filled in by most users and
therefore
is valueable information for your search, 2. choose tags for the stuff
where
users should be able to freely decide about the categorization.

cheers,
Jan

davidstokar · September 19, 2006, 11:24am

On 9/19/06, David S. [email protected] wrote:

David

Hi David,

Is there any reason you need them all to be in the same field? Or am I
misunderstanding you? You do realize that different fields can have
different properties right?

Cheers,
Dave

davidstokar · September 19, 2006, 3:28pm

On 9/19/06, David S. [email protected] wrote:

so cannot define their field infos. Previously I was able to just
specify the properties when adding that field to the document.

David

I’m assuming the matriculation field is always going to be a number.
It won’t change at a later date. So you can just set up the field
whenever you use it for the first time.

require 'rubygems'
require 'ferret'
i = Ferret::I.new
puts i.field_infos
if not i.field_infos[:matriculation]
  i.field_infos.add_field(:matriculation,
                          :index => :untokenized)
end
puts i.field_infos
i << {:matriculation => 1978}

Of course you only need to do this for fields which vary from the
norm. Whatever properties you instantiated the FieldInfos with will be
used for fields added with the FieldInfos#add_field method unless
otherwise specified. So if most of your fields are number or date
fields you’d create the FieldInfos like this:

fis = FieldInfos.new(:index => :untokenized_omit_norms, :term_vector

=> :no)

Now when you add a text field you’ll need to explicitly set it to
tokenized and store term vectors:

if not i.field_infos[:content]
  i.field_infos.add_field(:content,
                          :term_vector => :with_positions_offsets,
                          :index => :yes)
end

Let me know if this helps or not.

Cheers,
Dave

davidstokar · September 19, 2006, 5:52pm

David B. wrote:

On 9/19/06, David S. [email protected] wrote:

so cannot define their field infos. Previously I was able to just
specify the properties when adding that field to the document.

David

I’m assuming the matriculation field is always going to be a number.
It won’t change at a later date. So you can just set up the field
whenever you use it for the first time.

I’ve considered this. I use aaf, and this requires the model that
describes what fields are allowed on objects to have access to the index
models indexer, this isn’t too bad. The only problem is when the index
is created by something like rebuild_index, which needs to be extended
to create all the extra fields.

I don’t want to add the fields to fields_for_ferret, as that would mean
calling #{fieldname}_for_ferret for each possible property, rather than
taking the properties defined on that user, and adding them.

Would the fields_for_ferret solution be the correct way, somehow
populating this out of the database and then overriding the
foo_to_ferret methods to look in a cache?

This was really easy with the old API. It seems a shame that it is so
hard now.

David

davidstokar · September 19, 2006, 6:04pm

David B. wrote:

I’m assuming the matriculation field is always going to be a number.
It won’t change at a later date. So you can just set up the field
whenever you use it for the first time.
require 'rubygems'
require 'ferret'
i = Ferret::I.new
puts i.field_infos
if not i.field_infos[:matriculation]
  i.field_infos.add_field(:matriculation,
                          :index => :untokenized)
end
puts i.field_infos
i << {:matriculation => 1978}

Oh, I didn’t really read this last time.

It looks like this might be handy,

http://ferret.davebalmain.com/api/classes/Ferret/Index/Index.html only
lists the IndexReader as having the field_infos.

How much overhead would it be to write an “add_value” method that is
called, say 10 times per doc, which will lookup the field we’re going to
add in the index, and add it if it isn’t already there?

Is this what the old code did anyway?

David

davidstokar · September 20, 2006, 12:42pm

On 9/20/06, David S. [email protected] wrote:

if not i.field_infos[:matriculation]
http://ferret.davebalmain.com/api/classes/Ferret/Index/Index.html only
lists the IndexReader as having the field_infos.

How much overhead would it be to write an “add_value” method that is
called, say 10 times per doc, which will lookup the field we’re going to
add in the index, and add it if it isn’t already there?

Not a lot. It’s a hash lookup so it’s fast and it should be rare
(after a while at least) that new fields are added. ie, it’s probably
not going to happen for every document.

Is this what the old code did anyway?

David

The old code created a completely new FieldInfos object for every
document you add to the index. It then merges the field_infos objects
when the documents are merged. In other words it was a lot more
complex. This is one of the reasons for the API change. Even after
adding the add_value method, I’d guess that the newer version of
Ferret will still index a lot faster.

Cheers,
Dave

davidstokar · September 20, 2006, 11:22am

David S. wrote:

How much overhead would it be to write an “add_value” method that is
called, say 10 times per doc, which will lookup the field we’re going to
add in the index, and add it if it isn’t already there?

Ok, I’ve done this. But it was causing problems when called from
rebuild_index, as there isn’t an index at that point, and I was calling
ferret_index on my model, which was creating a new index which couldnt
get a write lock for my new fields.

I have solved this by giving to_doc an optional index parameter that is
passed in when rebuild is running, but if it is nil, it will call
Model.ferret_index.

It seems like an incorrect separation for the index to be passed in to
the to_doc method. Have you any suggestions on how to make this nicer?

David

davidstokar · September 20, 2006, 3:56pm

Jens K. wrote:

I could change the way rebuild_index works so that it uses and
initializes the Ferret index instance returned by ferret_index. So you
could access the index instance in to_doc when being called by
rebuild_index, too.

That sounds good.

The other thing I noticed was that if you wanted to create a field that
is created by rebuild_index, but isn’t actually put in there by the
standard to_doc you can specifiy the fields along with :ignore => true,
for example { :index => :untokenized, :ignore => true }. I want to do
this as there is a field that I want to include many times on a
document, and returning an array from foo_for_ferret didn’t add a field
for each.

David, are you supposed to be able to set several values for a field in
the document?

Thanks for all you guy’s support.

David

davidstokar · September 20, 2006, 10:03pm

On 9/20/06, David S. [email protected] wrote:

David, are you supposed to be able to set several values for a field in
the document?

I think I know what you are asking here but I’m not sure. You can do
this in Ferret:

index << {:content = "yada yada yada", :tags => ["ruby", "rails",

“ferret”]}

So :tags has multiple values. But you can’t do this:

doc = Ferret::Document.new
doc[:tag] = "ruby"
doc[:tag] = "rails"
doc[:tag] = "ferret"

You should do this:

doc[:tag] = ["ruby", "rails", "ferret"]

Or this:

doc[:tag] = ["ruby"]
doc[:tag] << "rails"
doc[:tag] << "ferret"

After all, Ferret::Document is just a Hash with a boost field.

Perhaps I have just misunderstood you completely so please let me know
if I did.

Cheers,
Dave

davidstokar · September 20, 2006, 3:38pm

Hi!

On Wed, Sep 20, 2006 at 11:22:52AM +0200, David S. wrote:

I have solved this by giving to_doc an optional index parameter that is
passed in when rebuild is running, but if it is nil, it will call
Model.ferret_index.

It seems like an incorrect separation for the index to be passed in to
the to_doc method. Have you any suggestions on how to make this nicer?

I could change the way rebuild_index works so that it uses and
initializes the Ferret index instance returned by ferret_index. So you
could access the index instance in to_doc when being called by
rebuild_index, too.

Jens

–
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

davidstokar · September 21, 2006, 10:31am

David B. wrote:

So :tags has multiple values. But you can’t do this:
doc = Ferret::Document.new
doc[:tag] = "ruby"
doc[:tag] = "rails"
doc[:tag] = "ferret"
You should do this:
doc[:tag] = ["ruby", "rails", "ferret"]

That is exactly what I mean. And it looks like that is another way I can
simplify my code with the new API. I can return an array from
foo_for_ferret and have all the individual values counted.

Previously I did basically
networks.each { |net| doc << Field.new(‘network’, net.name) }

Thanks.

David