Aligning Ferret's IndexSearcher.search API with Lucene's


#1

Recently I’ve been revisiting some of my search code. With a greater
understanding of how Java Lucene implements its search methods, I
realized that one level of abstraction is not present in the Ferret
classes/methods. Here are the relevant method signatures:

Ferret’s search methods:

in Ferret::Index::Index:
search(query, options = {}) -> returns a TopDocs
search_each(query, options = {}) {|doc, score| …} -> yields to
context w/ doc and score for each hit

in Ferret::Search::IndexSearcher:
search(query, options = {}) -> returns a TopDocs
search_each(query, filter = nil) {|doc, score| …} -> yields to
context w/ doc and score for each hit

Lucene’s search methods:

in the interface Searchable:
public void search(Query query, Filter filter, HitCollector results)
public TopDocs search(Query query, Filter filter, int n)
public TopFieldDocs search(Query query, Filter filter, int n, Sort sort)

in org.apache.lucene.search.Searcher (which implements Searchable):
public final Hits search(Query query)
public Hits search(Query query, Filter filter)
public Hits search(Query query, Sort sort)
public Hits search(Query query, Filter filter, Sort sort)

I was wondering if there were plans to implement the Hits class in
Ferret. (Or if someone were to write a patch implementing them, would
David integrate it into the source?) It seems like it is a useful
abstraction since TopDocs does not allow you to access its hits by
index, only by the .each() method call.

Some questions:

  • Will changing these methods break people’s existing code?
  • Where is the proper place to put these methods? Move the methods
    that return TopDocs to a module, which is more or less the same as a
    Java interface, and implement the methods that return Hits directly in
    the class? What is a good way to do this that feels Rubyish and takes
    advantage of its strengths and idioms?
  • The options to limit the search (first_doc and num_doc) in
    Search::IndexSearcher and the code that implements them should
    probably be moved out of Search::IndexSearcher into Index::Index
  • Are there lower level issues I am not aware of that would make any
    of this a bad idea?

Am I missing something here? Are there reasons not to have Ferret’s
implementation of these methods and classes follow Java Lucene’s as
closely as possible? I’d appreciate hearing your thoughts.

-F


#2

On 1/3/06, Finn S. removed_email_address@domain.invalid wrote:

context w/ doc and score for each hit
public void search(Query query, Filter filter, HitCollector results)
I was wondering if there were plans to implement the Hits class in
Ferret. (Or if someone were to write a patch implementing them, would
David integrate it into the source?)

I’d be happy to integrate it if someone sends me a patch. Having said
that…

It seems like it is a useful
abstraction since TopDocs does not allow you to access its hits by
index, only by the .each() method call.

Actually you can access the hits by index like this;

hit_three = topdocs.score_docs[2]

The reason I didn’t bother implementing the hits class is that I can’t
see that it adds anything useful. Really it all just seems a matter of
notation. What is easiest for people to understand and remember.
Adding the hits class might just make everthing a little more
complicated. Please refer to Martin F.'s discussion on the Humane
interface;

http://www.martinfowler.com/bliki/HumaneInterface.html

While Java likes to have multiple different implementations of simple
interfaces and a separate class for each data structure, in Ruby you
can use an array for many different jobs; stack, list queue etc. I
feel it would be better to do the same thing with TopDocs. Rather than
adding the Hits class I feel it would be better to add the desired
functionality to TopDocs. I’m happy to listen to other points of view.

Some questions:

  • Will changing these methods break people’s existing code?

Perhaps. Depends what we change. Ferret is still beta though so I
think it’s open to non-backwards compatible changes if necessary,
although we should avoid this if possible.

  • Where is the proper place to put these methods? Move the methods
    that return TopDocs to a module, which is more or less the same as a
    Java interface, and implement the methods that return Hits directly in
    the class? What is a good way to do this that feels Rubyish and takes
    advantage of its strengths and idioms?

I think I answered this already. I’d like to keep TopDocs as a class
as add the desired functionality to it.

  • The options to limit the search (first_doc and num_doc) in
    Search::IndexSearcher and the code that implements them should
    probably be moved out of Search::IndexSearcher into Index::Index

I think this needs to stay in IndexSearcher as it limits the amount of
memory used by a search. Even the java version allows you to specify
nDocs.

Hope this helps. Feedback is welcome.

Cheers,
Dave


#3

On Jan 11, 2006, at 8:43 PM, David B. wrote:

Actually you can access the hits by index like this;

hit_three = topdocs.score_docs[2]

The reason I didn’t bother implementing the hits class is that I can’t
see that it adds anything useful. Really it all just seems a matter of
notation.

It’s more than just notation. Hits performs some caching of Document
objects as well as providing a means to iterate through the hits
without having to manually re-search as it does it under the covers.
Sure, it’s perhaps a mere convenience, but a handy abstraction
nonetheless.

feel it would be better to do the same thing with TopDocs. Rather than
adding the Hits class I feel it would be better to add the desired
functionality to TopDocs. I’m happy to listen to other points of view.

I think not having Hits makes it more complicated for those coming
from Java Lucene at least, but it is also a conceptual abstraction.
One thinks of getting “hits” back from a search, not “top docs”. So
in that sense, the semantics of having Hits is powerful. Part of
Fowler’s argument is to have redundancy, aliases, and conveniences
for the humane interface, and I think Hits would qualify in that regard.

Erik

#4

On 1/12/06, Erik H. removed_email_address@domain.invalid wrote:

abstraction since TopDocs does not allow you to access its hits by
It’s more than just notation. Hits performs some caching of Document
http://www.martinfowler.com/bliki/HumaneInterface.html
One thinks of getting “hits” back from a search, not “top docs”. So
in that sense, the semantics of having Hits is powerful. Part of
Fowler’s argument is to have redundancy, aliases, and conveniences
for the humane interface, and I think Hits would qualify in that regard.

    Erik

I’m not arguing that TopDocs is a better name than Hits. Rather that
having search methods return two different classes is unnecessary and
not “The Ruby Way”. My goal is to make Ferret easy for Ruby
programmers to use, not Java programmers. So what I’d like to hear is
an argument as to why having two separate classes - TopDocs and Hits -
is superior to combining the functionality of both into one class. My
personal feeling is that this is where the difference lies between
Java and Ruby but I could easily be swayed.

Dave


#5

On Jan 12, 2006, at 6:52 PM, David B. wrote:

see that it adds anything useful. Really it all just seems a
Adding the hits class might just make everthing a little more
feel it would be better to do the same thing with TopDocs. Rather
for the humane interface, and I think Hits would qualify in that
personal feeling is that this is where the difference lies between
Java and Ruby but I could easily be swayed.

It seems an injustice to Java in this regard. Surely Hits and
TopDocs could have their functionality blended together into single
class. There was an intentional separation, not some constraint that
Java the language imposed.

I’m being a bit defensive of the Lucene API here and don’t want to
see Ferret diverge too much from it for no real benefit. What’s one
more class in Ruby in this situation to maintain consistency across
languages for the finest search engine available? Seems a small
sacrifice of Ruby “purity” to make for the noble cause :slight_smile: Just my
$0.02.

Practically no one in Java Lucene uses TopDocs - you’ll notice that
all of those search methods are marked as “Expert”. Hits is the most
common way to access search results, allowing them to automatically
be paged through and have a bit of caching along with it.

Erik

#6

David B. wrote:

My
personal feeling is that this is where the difference lies between
Java and Ruby but I could easily be swayed.

Hi Dave & Erik,

I don’t intend to hurt anybody’s opinions, but let me speak up on
something: I did some 2 years of Java programming, and was never really
comfortable with its verbosity though I liked it for other things. I
felt Ferret’s API is already a bit un-Rubyish, if you know what I mean.
It almost feels like I’m back to using Java libraries again.

Learning Rails took a complete break from the way I was doing J2EE
programming. But I totally love it, for it’s refreshing simplicity and
thereby came to love Ruby too. So what I mean to say is, don’t be afraid
to break compatibility - if it makes programmer’s life easier. I agree
with Dave’s sentiments that Ferret should be better than Lucene. I’m
guilty of not understanding what it really takes, but I think we should
put ‘making a developer’s job easy’ before anything else. If it means
breaking compatibility with Lucene, I don’t really mind.

Of course, I speak from a very selfish point of view - I don’t have any
running Lucene apps to run or port. Just my 2 cents.

Regards,
Vamsee.


#7

Vamsee,

No feelings hurt here, and I completely understand your sentiment.

There are folks using Java Lucene that have expressed similar
sentiment about its API, which is more C-like than Java-like in many
ways. But, let’s focus on the heart of this thing… a high-powered
full-text search engine. The goal is speed and efficient use of
resources. An elegant API is desirable but of a secondary nature.
Hits is a pretty elegant way to navigate search results. I hope this
simple class can find its way into Ferret and that the IndexSearcher
API be made reasonably similar to Java Lucene.

Dave has done a great job with the Index class in Ferret, which has
features that Java Lucene does not - that of being able to flush and
see changes right away (which is harder to manage in Java Lucene) and
also of having keys to documents and managing an “update”. In Java
Lucene there is no concept of an update - there is only a remove and
an add.

I’m all for a slick Ruby API, but I would very much like to see it
built on top of a Lucene compatible index format for
interoperability. That interoperability is important to me and will
very likely be important to others. Consider Nutch for example. It
is an incredibly scalable web crawler and indexer. With index
compatibility you could use Nutch to crawl the web and use Ferret for
searching. Further, the HTML parsers I’ve used in Ruby are lousy
compared to what is available in Java. Indexing in Java makes a lot
of sense in many circumstances, but fronting an application with
Rails and Ferret also makes a lot of sense.

Dave is, of course, the creator and driver of Ferret. I encourage
him to consider keeping index file compatibility, and keep basic API
in tact for the classes that are direct ports of Lucene, and innovate
on top of them rather than change them. He certainly may choose to
do otherwise, but doing so would likely drive me (and perhaps
others?) to other solutions.

Erik

#8

On Jan 13, 2006, at 5:13 AM, David B. wrote:

common way to access search results, allowing them to automatically
be paged through and have a bit of caching along with it.

If this is the case then maybe I should just return a Hits object,
roll the TopDocs functionality into it and be done with it. If
practically no one is using TopDocs then practically no one will miss
it. :wink:

My argument is mainly on why there is an issue with one more internal
class. You’ve ported quite faithfully the underlying Lucene class
structure and API. Why is this one more item a big deal? TopDocs is
useful, don’t get me wrong. It is just not used by the general
Lucene consuming public, but many expert level folks do use it.

I hope that Hits and TopDocs can stay, and whether it makes sense for
them to be separate classes or not is really immaterial, but at least
keep the most public and useful part of Lucene’s API, IndexSearcher,
as compatible as possible.

What I was really looking for (and still hope to see) was an argument
discussing the pros and cons of having the two separate classes (and
“That’s what they did in Java” doesn’t count :-).

Keeping a consistent IndexSearcher API between Java Lucene and Ferret
is definitely an argument that counts for me personally. Innovating
“Ruby Way” features alongside that is also greatly desirable for sure!

Nevertheless, I’ve
revisted the Hits class in Lucene and I’ve thought more about the
issue at hand and Hits will be coming in the next release of Ferret.

Yay!!!

I
haven’t decided exactly how I’m going to do it yet. There will
probably still be some differences from the Lucene API. For example,
search_each() is here to stay. I’ll probably bring it up for
discussion again when I come to it. I still have a fair bit of work in
cFerret before I get to that stage.

Adding conveniences with block iteration and such make me extremely
happy! PyLucene did the same thing.

Erik

p.s. whispering GCJ… SWIG… :slight_smile:


#9

On 1/11/06, David B. removed_email_address@domain.invalid wrote:

  • The options to limit the search (first_doc and num_doc) in
    Search::IndexSearcher and the code that implements them should
    probably be moved out of Search::IndexSearcher into Index::Index

I think this needs to stay in IndexSearcher as it limits the amount of
memory used by a search. Even the java version allows you to specify
nDocs.

Reviewing the code again, and taking another look at the Java code I
think you’re right about this. If there is a more general search
method exposed that returns Hits I’ll be happy.

-F


#10

On 1/13/06, David B. removed_email_address@domain.invalid wrote:

What I was really looking for (and still hope to see) was an argument
discussing the pros and cons of having the two separate classes (and
“That’s what they did in Java” doesn’t count :-). Nevertheless, I’ve
revisted the Hits class in Lucene and I’ve thought more about the
issue at hand and Hits will be coming in the next release of Ferret. I
haven’t decided exactly how I’m going to do it yet. There will
probably still be some differences from the Lucene API. For example,
search_each() is here to stay. I’ll probably bring it up for
discussion again when I come to it. I still have a fair bit of work in
cFerret before I get to that stage.

I was curious how this problem was addressed in other languages that
are not as strongly typed as Java so I took a look at the Plucene
implementation.

In Plucene there is an abstract base class Searcher which
IndexSearcher inherits from. Searcher has the method search which
instantiates a Hits object and passes “self” in as the searcher
argument before returning the newly created Hits object. The abstract
method search_top is implemented in IndexSearcher and returns TopDocs.
The search_top method is used internally by Hits when retrieving
results.

This follows the Java implementation pretty closely while still having
some of the advantages of more dynamic languages. A method isn’t
defined for each possible combination of arguments. Rather, methods
are identified by their functionality as reflected in their name. This
is in contrast to Java where a bunch of methods with the same name
(“search”) are identified by the method signature consisting of return
type, name and arguments.

I don’t know if it will be any help, but it might be worth glancing
through the Plucene code for another perspective on how to organize
the various objects and their interactions.

-F


#11

On 1/14/06, Finn S. removed_email_address@domain.invalid wrote:

it. :wink:
cFerret before I get to that stage.
The search_top method is used internally by Hits when retrieving
I don’t know if it will be any help, but it might be worth glancing
through the Plucene code for another perspective on how to organize
the various objects and their interactions.

-F

Thanks Finn. I have downloaded PyLucene, Plucene, Lupy etc. and I have
been using all of them to solve various problems. I will certainly
study all of their search APIs.

Cheers,
Dave


#12

On 1/13/06, Erik H. removed_email_address@domain.invalid wrote:

nonetheless.

simple
One thinks of getting “hits” back from a search, not “top docs”. So
programmers to use, not Java programmers. So what I’d like to hear is
an argument as to why having two separate classes - TopDocs and Hits -
is superior to combining the functionality of both into one class. My
personal feeling is that this is where the difference lies between
Java and Ruby but I could easily be swayed.

It seems an injustice to Java in this regard. Surely Hits and
TopDocs could have their functionality blended together into single
class. There was an intentional separation, not some constraint that
Java the language imposed.

I was never implying there was some constraint imposed by the language
itself. I’m talking about the way things are done in Ruby versus the
way things are done in Java. There was an intentional seperation of
ArrayList, Vector, Stack etc in Java too but it doesn’t mean we have
to do the same thing in Ruby. I’m not saying one way is better than
the other. But Ferret is a Ruby library so I’d like to do it the Ruby
way where possible.

be paged through and have a bit of caching along with it.
If this is the case then maybe I should just return a Hits object,
roll the TopDocs functionality into it and be done with it. If
practically no one is using TopDocs then practically no one will miss
it. :wink:

What I was really looking for (and still hope to see) was an argument
discussing the pros and cons of having the two separate classes (and
“That’s what they did in Java” doesn’t count :-). Nevertheless, I’ve
revisted the Hits class in Lucene and I’ve thought more about the
issue at hand and Hits will be coming in the next release of Ferret. I
haven’t decided exactly how I’m going to do it yet. There will
probably still be some differences from the Lucene API. For example,
search_each() is here to stay. I’ll probably bring it up for
discussion again when I come to it. I still have a fair bit of work in
cFerret before I get to that stage.

Cheers,
Dave