Forum: Ferret Proposal of some radical changes to API

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
David B. (Guest)
on 2006-06-04 07:43
(Received via mailing list)
Hey guys,

Now that the Lucy[1] project has Apache approval and is about to
begin, the onus is no longer on Ferret to strive for Lucene
compatability. (We'll be doing that in Lucy). So I'm starting to think
about ways to improve Ferret's API. The first part that needs to be
improved is the Document API. It's annoying having to type all the
attributes to  initialize a field just to change the boost. So;

    field = Field.new(:name, "data...", Field::Store::YES,
Field::Index::TOKENIZED, Field::TermVector::NO, false, 5.0)

would become;

   field = Field.new(:name, "data...", :index =>
Field::Index::TOKENIZED, :boost => 5.0)

It'd also be nice to replace the Parameter objects with symbols;

   field = Field.new(:name, "data...", :index => :tokenized, :boost =>
5.0)

Of course, this raises the question, why do we need to specify that
field :name is tokenized every time we create a :name field? Isn't it
always going to be the same? What if we use a different value the next
time we and a :name field? Well the answer to this last question is a
specific set of rules;

    1. Once you choose to index a field, that field is always indexed
from that point forward.
    2. Once you store term vectors, always store term vectors
    3. Once you store positions, always store positions
    4. Once you store offsets, always store offsets
    5. Once you store norms, always store norms

So currently if you add a field like this (I'll use the newer notation
as it's easier to type);

    doc << Field.new(:field, "data...", :index => :yes, :term_vector
=> :with_positions_offsets)

And later add a field like this;

    doc << Field.new(:field, "diff...", :index => :no, :term_vector =>
:no)

This field will be indexed and it's term vectors will be stored
regardless. This is good because if you are using TermVectors in a
particular field then you probably expect them to be there for all
instances of that field. The problem is that earlier documents will
have been added without storing term vectors. Now I don't know the
exact thinking behind these rules but it seems to me that it would be
better to just keep whatever rule you used when you first added the
document. If you want to add term vectors later, then re-index.

So here's my radical api change proposal. You set a fields properties
when you create the index and Document becomes (almost) a simple Hash
object. Actually, you may not have realized this, but you can almost
do this currently in Ferret. Once you add the first instance of a
field, that field's properties are set. From then on you and just add
documents as Hash objects and each field will have the same properties
as in that first document that was added. (This isn't true of the
Store or boost properties. These are set on a per document basis.)

So here is a possible way example of the way I'd implement this;

    # the following might even look better in a YAML file.
    field_props = {
        :default => {:store => :no, :index => :tokenized, :term_vector
=> :no},
        :fields => {
            :id => {:store => :yes, :index => :no},
            :title => {:store => :yes, :term_vector =>
:with_positions_offsets},
            [:created_on, :updated_on] => {:store => :yes, :index =>
:untokenized}
        }
    }
    index = Index.new(:field_properties => field_props)

    # ...
    # And if later, you want to add a new field
    index.add_field(:image, {:store => :compressed, :index => :no})

Now you would just create Hashes instead of Documents. The only
exception would be if you needed to set the boost for a particular
field or document. So you would have this;

    index << {:title => "title", :data => "data..."}
    # boost a field
    index << {:title => Field.new("important title", 50.0), :data =>
"normal data"}
    # boost a document
    index << Document.new({:title => "important doc", :data => "data"},
100.0)

So what do you all think? These are just ideas at the moment and it'd
be a while before I could actually implement them. And don't worry,
I'll do my best to keep backwards compatibility. Please give me your
feedback.

Cheers,
Dave

[1] - http://wiki.apache.org/jakarta-lucene/LucyProposal
Marvin H. (Guest)
on 2006-06-05 08:37
(Received via mailing list)
On Jun 3, 2006, at 8:42 PM, David B. wrote:

> Now that the Lucy[1] project has Apache approval and is about to
> begin, the onus is no longer on Ferret to strive for Lucene
> compatability. (We'll be doing that in Lucy).

We'll take this up more aggressively once some under-appreciated
volunteers at Apache create mailing lists and other infrastructure
for Lucy, but I doubt we'll want to have Lucy's API mirror Lucene's
100%.  Do we really want separate Hits and HitIterator classes, for
instance? Other, more substantial issues are on the table too as far
as I'm concerned, such as whether deletions should handled by the
IndexReader rather than the IndexWriter.  APIs are really hard to
change once defined, and not taking hard-won lessons from Lucene into
account would be a crime.

Lucy will definitely need a define-fields-once interface, so although
you're proposing stuff specifically for Ferret here, I'm studying it
with an eye towards using it with Lucy.  My inclination is to start
with define-fields-once, then add dynamic field definitions later if
we have to.

> Of course, this raises the question, why do we need to specify that
> field :name is tokenized every time we create a :name field? Isn't it
> always going to be the same?

The primary argument for allowing dynamic field definitions I've
seen, is not that the definition might change, but that each document
might contain previously undefined fields which are unknowable in
advance.  The CNET/Solr folks, Yonik and Hoss, really, really care
about that.

I think the idea of dynamic field definitions is weird. (A database
that allows you to change the table definition with each INSERT?
Huh?)  I'm sure that CNet could have been done another way if dynamic
field definitions hadn't been available, but they're committed now.  :(

>     5. Once you store norms, always store norms
It's actually messier than that, isn't it?  Just because you've
started marking a field as indexed doesn't mean that Lucene goes back
to all the documents that you've already processed and indexes that
field.  Same deal with TermVectors, etc.

At least in SQL, when you add a field to a table it goes and adds a
default value for every row.

> The problem is that earlier documents will
> have been added without storing term vectors. Now I don't know the
> exact thinking behind these rules but it seems to me that it would be
> better to just keep whatever rule you used when you first added the
> document. If you want to add term vectors later, then re-index.

'Zactly!

> So here's my radical api change proposal. You set a fields properties
> when you create the index and Document becomes (almost) a simple Hash
> object.

KinoSearch thinks of documents like hashes, too.  Lucene, however,
thinks of documents like arrays.

> Actually, you may not have realized this, but you can almost
> do this currently in Ferret. Once you add the first instance of a
> field, that field's properties are set. From then on you and just add
> documents as Hash objects and each field will have the same properties
> as in that first document that was added. (This isn't true of the
> Store or boost properties. These are set on a per document basis.)

Why not set Store once and for all per-field?  And heck, why not
start with a default boost, but allow it to be overridden?

> So here is a possible way example of the way I'd implement this;
>
>     # the following might even look better in a YAML file.

Ooo, nifty idea!  How about a class whose sole purpose is to define
fields and generate the YAML file?  Or, if we're thinking future
Lucene 2.1 file format, some Lucene-readable index definition file?

>     }
>     index = Index.new(:field_properties => field_props)

This is nice and dense, but maybe a tad complicated.

KinoSearch's take on doing field defs has some problems too.  It was
a mistake to make spec_field() a method of InvIndexer (KinoSearch's
index writer/modifier class).  The index writer and reader classes
suffer from serious bloat no matter what, so anything that can be
shunted somewhere else should be.

Marvin H.
Rectangular Research
http://www.rectangular.com/
David B. (Guest)
on 2006-06-05 09:47
(Received via mailing list)
On 6/5/06, Marvin H. <removed_email_address@domain.invalid> wrote:
> 100%.  Do we really want separate Hits and HitIterator classes, for
> instance? Other, more substantial issues are on the table too as far
> as I'm concerned, such as whether deletions should handled by the
> IndexReader rather than the IndexWriter.  APIs are really hard to
> change once defined, and not taking hard-won lessons from Lucene into
> account would be a crime.

Thanks for pointing that out. I couldn't agree with you more. What I
meant was that Lucy would be striving to maintain "index file format"
compatibility (which I believe was the plan). I didn't make this very
clear, though, as I was talking about changes to the API but as I was
writing this I was thinking about what changes to the index file
format would allow.

> Lucy will definitely need a define-fields-once interface, so although
> you're proposing stuff specifically for Ferret here, I'm studying it
> with an eye towards using it with Lucy.  My inclination is to start
> with define-fields-once, then add dynamic field definitions later if
> we have to.

This sounds good to me.

> I think the idea of dynamic field definitions is weird. (A database
> that allows you to change the table definition with each INSERT?
> Huh?)  I'm sure that CNet could have been done another way if dynamic
> field definitions hadn't been available, but they're committed now.  :(

Actually, I fall into the category of people who like dynamic field
definitions.
I agree that they are not necessary but it certainly makes some things
easy. For instance, in a rails application you can add models to an
index and you get to specify within the model itself which of its
fields will be added to the index. The index itself doesn't need to
know which models will be indexed or how they will be indexed it just
needs to know to store the id field and the model name field and index
everything else. It's all about keeping it DRY.

The part I don't like about lucene is *sometimes* being able to change
a fields properties.

> >     5. Once you store norms, always store norms
> > have been added without storing term vectors. Now I don't know the
> KinoSearch thinks of documents like hashes, too.  Lucene, however,
> start with a default boost, but allow it to be overridden?
My plan exactly. In my experimental version of Ferret I have a fields
file along with the segments file. The fields file stores all the
field metadata such as store, index, term-vector and field boosts.
That way there is no need to maintain a separate FieldInfos file per
segment. (This will make merging a lot more difficult but I'm still
thinking about that one.)

> > So here is a possible way example of the way I'd implement this;
> >
> >     # the following might even look better in a YAML file.
>
> Ooo, nifty idea!  How about a class whose sole purpose is to define
> fields and generate the YAML file?  Or, if we're thinking future
> Lucene 2.1 file format, some Lucene-readable index definition file?

Now this idea I like. Perhaps even a simple question/answer app to
generate the index definition file. I'd guess that Lucene will
probably end up going with XML rather than YAML.

Cheers,
Dave
Marvin H. (Guest)
on 2006-06-05 20:51
(Received via mailing list)
On Jun 4, 2006, at 10:46 PM, David B. wrote:

> What I
> meant was that Lucy would be striving to maintain "index file format"
> compatibility (which I believe was the plan).

It's funny that we haven't actually settled that.  I used to think
index compatibility was really important, but I don't so much any more.

Index compatibility is DOA unless Lucene adopts bytecounts as string
headers, because it would be insanity for Lucy to deal with the
current format.  So we're talking compatibility no sooner than Lucene
2.1, and adapting Lucene will be a challenge.  I think the only way
to make up the lost speed is to pry in the KinoSearch merge model.  I
strongly suspect that that will prove to be a marked improvement over
not just the patched version, but the current release.

However... It's a lot of work, and I think I'm the only obvious
candidate with both the expertise and (maybe) the desire to do it,
unless you want to take it on.  Two stages out of four are complete.
The bytecounts patch was stage 1, and last night I supplied stage 2:
a Java port of KinoSearch's external sorting module.  Stage 3 is
adapting Lucene's indexing apparatus to write indexes by the segment
rather than the document -- porting KinoSearch's SegWriter module and
eliminating DocumentWriter and SegmentMerger would be a start.  The
last stage is adapting everything to be backwards compatible with
char-counts as string headers.

I'm not sure that I want to dedicate that much of my time to Lucene,
at least not right now.  The changes outlined above are pretty
major.  It's likely that some bugs will get introduced simply because
of the volume of code change, so that's an argument against making
any change at all unless there's a real benefit.  There would be --
the KinoSearch merge model is faster -- but politically speaking,
selling the whole package to the Lucene community would be a PITA.
Not only do I have to argue that the tangible benefits justify the
disruption, I have to make the argument that it's not OK for
compatibility to begin and end with Java[1][2], plus deal with
outright hostility and abuse from extreme Java partisans[3].

I'd rather spend my time and energy contributing to Lucy.  Besides, I
think that ultimately, trying to be compatible with other ports would
be as much of a drag on Lucy as Lucene, and I think it's advisable
for both projects to declare their file formats private.  The Lucene
file format is just too complex and difficult to serve as a good
interchange medium.

The only major reason for Lucy to be file-format-compatible with
Lucene is Luke.  IMO, if we want Luke's benefits, we should be
hacking Luke.

Marvin H.
Rectangular Research
http://www.rectangular.com/

[1] http://xrl.us/m2o3 (Link to mail-archives.apache.org)
[2] http://xrl.us/m2o7 (Link to mail-archives.apache.org)
[3] http://xrl.us/m2kp (Link to mail-archives.apache.org)
Marvin H. (Guest)
on 2006-06-05 21:32
(Received via mailing list)
On Jun 4, 2006, at 10:46 PM, David B. wrote:

> In my experimental version of Ferret I have a fields
> file along with the segments file. The fields file stores all the
> field metadata such as store, index, term-vector and field boosts.
> That way there is no need to maintain a separate FieldInfos file per
> segment. (This will make merging a lot more difficult but I'm still
> thinking about that one.)

Robert Kirchgessner made a similar proposal:

http://xrl.us/m2qq (Link to mail-archives.apache.org)

Robert addresses the merging issue in a subsequent email, and I think
his arguments are compelling.

IMO, field defs should be immutable and consistent over the entire
index.

> probably end up going with XML rather than YAML.
I think it would be a binary file, using Lucene's standard
writeString, writeVInt, etc. methods.

A question/answer app could easily be built based around a module.

How does "IndexCreator" sound?  Take away the ability of the
IndexWriter module to create or redefine indexes, and encapsulate
that functionality within one module.  Using Java as our lingua
franca...

   IndexCreator creator = new IndexCreator(filePath);
   FieldDefinition titleDef = new FieldDefinition("title",
     Field.Store.YES, Field.Index.TOKENIZED);
   FieldDefinition bodyDef = new FieldDefinition("body",
     Field.Store.YES, Field.Index.TOKENIZED
     Field.TermVector.YES);
   creator.addFieldDefinition(titleDef);
   creator.addFieldDefinition(bodyDef);
   creator.createIndex();

Marvin H.
Rectangular Research
http://www.rectangular.com/
Lee M. (Guest)
on 2006-06-06 21:21
(Received via mailing list)
Do you mean that all fields would have to be known at index creation
time or just that once a field is defined it properties are the same
across all documents?  Right now I'm indexing documents that create
new fields as needed based on user defined properties, so we don't
know all the fields initially.
Marvin H. (Guest)
on 2006-06-06 22:22
(Received via mailing list)
On Jun 6, 2006, at 10:11 AM, Lee M. wrote:

> Do you mean that all fields would have to be known at index creation
> time or just that once a field is defined it properties are the same
> across all documents?  Right now I'm indexing documents that create
> new fields as needed based on user defined properties, so we don't
> know all the fields initially.

How would you handle this if you were using an SQL database rather
than Ferret?  Your app wouldn't be able to modify the table on the
fly on that case, unless you did something insane like run a remote
"ALTER TABLE" command.

Marvin H.
Rectangular Research
http://www.rectangular.com/
Jan P. (Guest)
on 2006-06-06 22:38
(Received via mailing list)
Hi Marvin,

this statement tempted me to jump in, even without using something like
dynamic field creation myself __right now__. But I have been -
especially on
cms like projects badly in need for dynamic fields.

That something isn't common in sql doesn't mean that there is no need
for
this "something". This limitation of sql is the reason for doing things
like
storing xml in relational dbs as well as the reason for people using
object
dbs. I don't know if you had a look at dabble db, but imagine something
like
this with a relational dbms. not funny! Because of this they haven't
even
thought about using sql for dabble db. So maybe it's just me but the
argument: you can't do this in sql either doesn't sound too
convincing...

Cheers,
Jan
Marvin H. (Guest)
on 2006-06-07 01:08
(Received via mailing list)
On Jun 6, 2006, at 11:37 AM, Jan P. wrote:

> using sql for dabble db. So maybe it's just me but the argument:
> you can't do this in sql either doesn't sound too convincing...

Jan, I don't understand the requirement, and I'm not familiar with
the either dabble db or Rails, so neither that example nor the
"models" example Dave cited earlier has spoken to me.  I asked the
question because I honestly wanted to see a concrete example of an
application that couldn't be handled within the constraint of pre-
defined fields.

Behind the scenes in Lucene is an elaborate, expensive apparatus for
dealing with dynamic fields.  Each document gets turned into its own
miniature inverted index, complete with its own FieldInfos,
FieldsWriter, DocumentWriter, TermInfosWriter, and so on.  When these
mini-indexes get merged, field definitions have to be reconciled.
This merge stage is one of the bottlenecks which slow down
interpreted-language ports of Lucene so severely, because there's a
lot of object creation and destruction and a lot of method calls.

KinoSearch uses a fixed-field-definition model.  Before you add any
documents to an index, you have to tell the index writer about all
the possible fields you might use.  When you add the first document,
it creates the FieldInfos, FieldsWriter, etc, which persist
throughout the life of the index writer.  Instead of reconciling
field definitions each time a document gets added, the field defs are
defined as invariant for that indexing session.  This is much faster,
because there is far less object creation and destruction, and far
less disk shuffling as well -- no segment merging, therefore no
movement of stored fields, term vectors, etc.

There are several possible ways to add dynamic fields back in to the
fixed-field-def model.  My main priority in doing so, if it proves to
be necessary, is to keep table-alteration logic separate from
insertion operations.  Having the two conflated introduces needless
complexity and computational expense at the back end.  It's also just
plain confusing -- if you accidentally forget to set OMIT_NORMS just
once, all of a sudden that field is going to have norms for ever and
ever amen.  I think the user ought to have absolute control over
field definitions.  Inserting a field with a conflicting definition
ought to be an error.

Lucy is going to start with the KinoSearch merge model.  I will do a
better job of adding dynamic capabilities to it if you or someone
else can articulate some specific examples of situations where static
definitions would not suffice.  I can think of a few tasks which
would be slightly more convenient if new fields could be added on the
fly, but maybe you can go one better and illustrate why dynamic field
defs are essential.

Marvin H.
Rectangular Research
http://www.rectangular.com/
David B. (Guest)
on 2006-06-07 03:45
(Received via mailing list)
On 6/7/06, Lee M. <removed_email_address@domain.invalid> wrote:
> Do you mean that all fields would have to be known at index creation
> time or just that once a field is defined it properties are the same
> across all documents?  Right now I'm indexing documents that create
> new fields as needed based on user defined properties, so we don't
> know all the fields initially.

Hi Lee,

Dynamic fields will definitely be remaining in Ferret. But, as you
said, once a field is defined its properties are set for all
documents. So in your case, you would set the default properties for a
field to match those that you use for your user defined field.
Otherwise you could use Index#add_field(<field properties>) to add a
field with whatever properties you need.

This functionality is going to exist in Ferret but not necessarily in
Lucy. Could you describe in more what kind of user defined properties
you are indexing to help convince Marvin that dynamic fields are a
good thing.

Cheers,
Dave
David B. (Guest)
on 2006-06-07 05:10
(Received via mailing list)
On 6/7/06, Marvin H. <removed_email_address@domain.invalid> wrote:
> > reason for people using object dbs. I don't know if you had a look
> defined fields.
>
> Behind the scenes in Lucene is an elaborate, expensive apparatus for
> dealing with dynamic fields.  Each document gets turned into its own
> miniature inverted index, complete with its own FieldInfos,
> FieldsWriter, DocumentWriter, TermInfosWriter, and so on.  When these
> mini-indexes get merged, field definitions have to be reconciled.
> This merge stage is one of the bottlenecks which slow down
> interpreted-language ports of Lucene so severely, because there's a
> lot of object creation and destruction and a lot of method calls.

The way I'm dealing with this now is by having all the field
definitions in a single file. When a field is defined it gets assigned
a field number which is set for the life of the index. Hence, dynamic
fields without the expense.

> KinoSearch uses a fixed-field-definition model.  Before you add any
> documents to an index, you have to tell the index writer about all
> the possible fields you might use.  When you add the first document,
> it creates the FieldInfos, FieldsWriter, etc, which persist
> throughout the life of the index writer.  Instead of reconciling
> field definitions each time a document gets added, the field defs are
> defined as invariant for that indexing session.  This is much faster,
> because there is far less object creation and destruction, and far
> less disk shuffling as well -- no segment merging, therefore no
> movement of stored fields, term vectors, etc.

What happens when there are deletes? Which files should I look in to
see how this works? I really need to get my head around the KinoSearch
merge model.

> There are several possible ways to add dynamic fields back in to the
> fixed-field-def model.  My main priority in doing so, if it proves to
> be necessary, is to keep table-alteration logic separate from
> insertion operations.  Having the two conflated introduces needless
> complexity and computational expense at the back end.  It's also just
> plain confusing -- if you accidentally forget to set OMIT_NORMS just
> once, all of a sudden that field is going to have norms for ever and
> ever amen.  I think the user ought to have absolute control over
> field definitions.  Inserting a field with a conflicting definition
> ought to be an error.

I mostly agree but I don't think it is too expensive (computationally
or with regard to complexity) to dynamically add unknown fields with
default properties.

> Lucy is going to start with the KinoSearch merge model.  I will do a
> better job of adding dynamic capabilities to it if you or someone
> else can articulate some specific examples of situations where static
> definitions would not suffice.  I can think of a few tasks which
> would be slightly more convenient if new fields could be added on the
> fly, but maybe you can go one better and illustrate why dynamic field
> defs are essential.

Hopefully Lee will be able to describe his needs in a little more
detail. I must admit that in most cases dynamic fields just make
things a little easier, but you could do without them. Having said
that I don't think Ferret would be a very ruby-like search library if
it didn't allow dynamic fields. Ruby allows me to add methods not only
to the core classes but also to already instantiated objects. Coming
from a language that didn't allow you to do things like this, you'd
probably think this feature is totally unnessecary. Earlier I said I'd
be using Hashes as documents. Here is an example of how I could add
lazy loading to documents in Ferret:

    def get_doc(doc_num)
        doc = {}
        class <<doc
            attr_accessor :ferret_index, :ferret_doc_num
            def [](key)
                if val = super
                    return val
                else
                    return self[key] =
@ferret_index.get_doc_field(@ferret_doc_num, key)
                end
            end
        end
        doc.ferret_index = self
        doc.ferret_doc_num = doc_num
        return doc
    end

This example may be difficult to understand coming from Perl.
Basically what it does is return an empty Hash object when get_doc is
called. Now, whenever you reference a field in that Hash object, for
example doc[:title], it lazily loads that field from the index. All
other Hash objects are unnaffected. Perhaps you can do this sort of
thing in Perl also but I suspect it's a lot more common in Ruby. A
language like this definitely deserves a search library with dynamic
fields. Not necessarily because they solve an otherwise impossible
problem but because they make other problems much easier to solve.
Lee M. (Guest)
on 2006-06-07 07:52
(Received via mailing list)
We index properties for products that vary from product to product.
For instance, a shoe could have a color field with values of red, blue
and green.  It would also have a size field with 3,4,5,6,7,8,9,10 for
values.  Another product could be a car with transmission field with
values automatic and manual.  I index all the properties into their
own field as well as dump them into another generic field for
searching.

In the database we have a property_types table where size, color, and
transmission go.  Then there is a many to many table from that to the
products table that holds the acutal values of those properties (e.g.
automatic, manual, red, green, 8, 9, etc.)

I hope that helps explain it.

-Lee
Marvin H. (Guest)
on 2006-06-07 08:20
(Received via mailing list)
On Jun 6, 2006, at 6:08 PM, David B. wrote:

> What happens when there are deletes? Which files should I look in to
> see how this works? I really need to get my head around the KinoSearch
> merge model.

Let's say we're indexing a book.  It has three pages.

    page 1 => "peas porridge hot"
    page 2 => "peas porridge cold"
    page 3 => "peas porridge in the pot, nine days old"

Here's what Lucene does:

First, create a mini-inverted index for each page...

    hot      => 1
    peas     => 1
    porridge => 1

    cold     => 2
    peas     => 2
    porridge => 2

    days     => 3
    in       => 3
    nine     => 3
    old      => 3
    peas     => 3
    porridge => 3
    pot      => 3

Then combine the indexes...

    cold     => 2
    days     => 3
    in       => 3
    hot      => 1
    nine     => 3
    old      => 3
    peas     => 1, 2, 3
    porridge => 1, 2, 3
    pot      => 3

... and here's what KinoSearch does:

First, dump everything into one giant pool...

    peas     => 1
    porridge => 1
    hot      => 1
    peas     => 2
    porridge => 2
    cold     => 2
    peas     => 3
    porridge => 3
    in       => 3
    pot      => 3
    the      => 3
    nine     => 3
    days     => 3
    old      => 3

...then sort the whole thing in one go.

Make sense?

The big problem with the KinoSearch method is that you can't just
keep dumping stuff into an array indefinitely -- you'll run out of
memory, duh!  So what you need is an object that looks like an array
that you can keep dumping stuff into forever.  Then you "sort" that
"array".

That's where the external sort algorithm comes in.  The sortex object
is basically a PriorityQueue of unlimited size, but which never
occupies more than 20 or 30 megs of RAM because it periodically sorts
and flushes its payload to disk. It recovers that stuff from disk
later -- in sorted order -- when it's in fetching mode.

If you want to spelunk KinoSearch to see how this happens, start with
Invindexer::add_doc().  After some minor fiddling, it feeds
SegWriter::add_doc().  SegWriter goes through each field, having
TokenBatch invert the field's contents, feeding the inverted and
serialized but unordered postings into PostingsWriter (which is where
the external sort object lives), and writing the norms byte.  Last,
SegWriter hands the Doc object to FieldsWriter so that it can write
the stored fields.

The most important part of the previous chain is the step that never
happened: nobody ever invoked SegmentMerger by calling the equivalent
of Lucene's maybeMergeSegments().  There IS no SegmentMerger in
KinoSearch.

The rest of the process takes place when InvIndexer::finish() gets
called.  This time, InvIndexer has a lot to do.

First, InvIndexer has to decide which segments need to be merged, if
any, which it does using an algorithm based on the fibonacci series.
If there are segments that need mergin', InvIndexer feeds each one of
them to SegWriter::add_segment().  SegWriter has DelDocs generate a
doc map which maps around deleted documents (just like Lucene).  Next
it has FieldInfos reconcile the field defs and create a field number
map, which maps field numbers from the segment that's about to get
merged away to field numbers for the current segment.  SegWriter
merges the norms itself.  Then it calls FieldsWriter::add_segment(),
which reads fields off disk (without decompressing compressed fields,
or creating document objects, or doing anything important except
mapping to new field numbers) and writes them into their new home in
the current segment.  Last, SegWriter arranges for
PostingsWriter::add_segment to dump all the postings from the old
segment into the current sort pool -- which *still* hasn't been
sorted -- mapping to new field and document numbers as it goes.

(Think of add_segment as add_doc on steroids.)

Now that all documents and all merge-worthy segments have been
processed, it's finally time to deal with the sort pool.  InvIndexer
calls SegWriter::finish(), which calls PostingsWriter::finish().
PostingsWriter::finish() does a little bit in Perl, then hands off to
a heavy-duty C routine that goes through the sort pool one posting at
a time, writing the .frq and .prx files itself, and feeding
TermInfosWriter so that it can write the .tis and .tii files.

SegWriter::finish() also invokes closing routines for the
FieldsWriter, the norms filehandles, and so on.  Last, it writes the
compound file. (For simplicity's sake, and because there isn't much
benefit to using the non-compound format under the KinoSearch merge
model, KinoSearch always uses the compound format).

Now that all the writing is complete, InvIndexer has to commit the
changes by rewriting the 'segments' file.  One interesting aspect of
the KinoSearch merge model is that no matter how many documents you
add or segments you merge, if the process gets interrupted at any
time up till that single commit, the index remains unchanged.  In
KinoSearch, InvIndexer handles deletions too (IndexReader isn't even
a public class), and deletions -- at least those deletions which
affect segments that haven't been optimized away -- are committed
during at the same moment.

Deletable files are deleted if possible, the write lock is
released... TADA! We're done.

... and since I spent so much time writing this up, I don't have time
to respond to the other points.  Check y'all later...

Marvin H.
Rectangular Research
http://www.rectangular.com/
This topic is locked and can not be replied to.