Hey guys,
Now that the Lucy[1] project has Apache approval and is about to
begin, the onus is no longer on Ferret to strive for Lucene
compatability. (We’ll be doing that in Lucy). So I’m starting to think
about ways to improve Ferret’s API. The first part that needs to be
improved is the Document API. It’s annoying having to type all the
attributes to initialize a field just to change the boost. So;
field = Field.new(:name, "data...", Field::Store::YES,
Field::Index::TOKENIZED, Field::TermVector::NO, false, 5.0)
would become;
field = Field.new(:name, “data…”, :index =>
Field::Index::TOKENIZED, :boost => 5.0)
It’d also be nice to replace the Parameter objects with symbols;
field = Field.new(:name, “data…”, :index => :tokenized, :boost =>
5.0)
Of course, this raises the question, why do we need to specify that
field :name is tokenized every time we create a :name field? Isn’t it
always going to be the same? What if we use a different value the next
time we and a :name field? Well the answer to this last question is a
specific set of rules;
1. Once you choose to index a field, that field is always indexed
from that point forward.
2. Once you store term vectors, always store term vectors
3. Once you store positions, always store positions
4. Once you store offsets, always store offsets
5. Once you store norms, always store norms
So currently if you add a field like this (I’ll use the newer notation
as it’s easier to type);
doc << Field.new(:field, "data...", :index => :yes, :term_vector
=> :with_positions_offsets)
And later add a field like this;
doc << Field.new(:field, "diff...", :index => :no, :term_vector =>
:no)
This field will be indexed and it’s term vectors will be stored
regardless. This is good because if you are using TermVectors in a
particular field then you probably expect them to be there for all
instances of that field. The problem is that earlier documents will
have been added without storing term vectors. Now I don’t know the
exact thinking behind these rules but it seems to me that it would be
better to just keep whatever rule you used when you first added the
document. If you want to add term vectors later, then re-index.
So here’s my radical api change proposal. You set a fields properties
when you create the index and Document becomes (almost) a simple Hash
object. Actually, you may not have realized this, but you can almost
do this currently in Ferret. Once you add the first instance of a
field, that field’s properties are set. From then on you and just add
documents as Hash objects and each field will have the same properties
as in that first document that was added. (This isn’t true of the
Store or boost properties. These are set on a per document basis.)
So here is a possible way example of the way I’d implement this;
# the following might even look better in a YAML file.
field_props = {
:default => {:store => :no, :index => :tokenized, :term_vector
=> :no},
:fields => {
:id => {:store => :yes, :index => :no},
:title => {:store => :yes, :term_vector =>
:with_positions_offsets},
[:created_on, :updated_on] => {:store => :yes, :index =>
:untokenized}
}
}
index = Index.new(:field_properties => field_props)
# ...
# And if later, you want to add a new field
index.add_field(:image, {:store => :compressed, :index => :no})
Now you would just create Hashes instead of Documents. The only
exception would be if you needed to set the boost for a particular
field or document. So you would have this;
index << {:title => "title", :data => "data..."}
# boost a field
index << {:title => Field.new("important title", 50.0), :data =>
“normal data”}
# boost a document
index << Document.new({:title => “important doc”, :data => “data”},
100.0)
So what do you all think? These are just ideas at the moment and it’d
be a while before I could actually implement them. And don’t worry,
I’ll do my best to keep backwards compatibility. Please give me your
feedback.
Cheers,
Dave