Ferret 0.2.1 (port of Apache Lucene to pure ruby)


#1

Hi Folks,

I’ve just released version 0.2.1. Since my last announcement there
have been quit a few changes, mostly to the Index::Index interface. We
also have a great new logo thanks to Jan P… You can check it all
out here;

http://ferret.davebalmain.com/trac/

Dave Balmain

== Description

Ferret is a full port of the Java Lucene searching and indexing
library. It’s available as a gem so try it out! To get started quickly
read the quick start at the project homepage;

http://ferret.davebalmain.com/api
http://ferret.davebalmain.com/api/files/TUTORIAL.html

== Changes

=== Multifield searches

You can now do multi field searches using the query parser.

 # search the title and content fields for ruby
index.search_each("title|content:ruby") {|doc, score| puts

“#{doc}:#{score}”}

 # search all fields for ruby
index.search_each("*:ruby") {|doc, score| puts "#{doc}:#{score}"}

=== Compound file support and Apache Lucene index reading

You can now store your index in compound files which reduces the
number of files used by the index. This is useful as your index gets
bigger to prevent a too many files open index. It is also handy for
reading Apache Lucene indexes as Apache Lucene uses compound file
format by default.

=== Merging indexes

You can now merge two or more existing indexes into one. The is useful
if you want to have indexers working in parallel to create your index
and then merge all the indexes together create one final index.

# add indexes 1 to 10 to the final index
index.add_indexes([index1, index2, ... , index10])

=== Persisting in Memory index.

You can gain a little in performance by using an in memory index for
your indexing and then persisting it to your file system when you are
finished.

index = Index::Index.new()

# do all your indexing

index.persist("/path/to/your/index/directory")

=== Thread safety

Ferret is now threadsafe so feel safe to use it in a multithreaded
environment. Check out the thread tests in the test/functional
directory in the latest distribution.

=== Easy update and delete

You can now use a query to do a delete;

index.query_delete("content:java or content:perl")

And you can now easily update documents;

index.update(34, doc)
index.query_update('author:"David B."', {:author => "Dave 

Balmain"})

=== Primary Key

The latest addition is a primary key to the index. Note that this only
works through the Index::Index class and should only be used if you
know what you are doing.

index = Index::Index.new(:key => ["id", "table"])
index << {:id => 1123, :table => "product", :product = "Jacket"}
# ...
# The following will replace the Jacket product with a t-shirt
index << {:id => 1123, :table => "product", :product = "T-Shirt"}

Have fun and let me know what you think.


#2

David B. wrote:

Have fun and let me know what you think.

Thank you for this awesome library. I just wanted to tell you that you
work
is much appreciated. I don’t actually use it right now, but I most
certainly will in the future. Having such a nice and powerful search
engine
is really beneficial for Ruby, too, I think.

Sascha E.


#3

Does it support indexing PDFs, Docs and PPT files? If I remember
correctly this feature is provided in Java Lucene via a project called
Jakarta POI. It is not a big deal since you already started the ball
rolling and someone might add these features in time. Kudos to your
efforts.


#4

Hi Kris,

If you want to index these you’ll need to write (or acquire) specific
analyzers for the document type. That’s how it works in Lucene too.
One solution may be to index the documents with Lucene and use Ferret
to search the indexes.

Cheers,
Dave


#5

Any example of (web) search scripts (not necessarily Ruby) that will
work
with the index?

Thanks!

“David B.” removed_email_address@domain.invalid wrote in message
news:removed_email_address@domain.invalid…
Hi Folks,

I’ve just released version 0.2.1.


#6

I’m really excited about this library. However, after testing it out
I’m a little puzzled by the behavior. To test it out I added about 20
documents, each containing the same 5 fields (with different field
values in each doc). When I then try to query, the results I get seem
random - that is, they don’t always return documents that I’d expect
should be matching. Example:

doc = Document.new

doc << Field.new(“name”, “foobar”, Field::Store::NO,
Field::Index::UNTOKENIZED)

index << doc

Now when i call search_each(“foobar”), i dont see a result (with some i
do, others i don’t). However, if I call search_each(“foobar~”), then it
seems to reliably return the expected matches. Any tips?

-jay
(running ruby 1.8.2 on OS X 10.3.8)


#7

Hi Jay,

You’ve got me puzzled too. Would it be possible for you to send me a
full example of this strange behaviour. It’s possible that it only
happens on OS X. :frowning: I really need to get my hands on a Mac for a day
because there seems to be a few problems with that environment.
Hopefully we’ll have this all sorted out soon.

Thanks,
Dave