Ferret not able to read a Lucene Index?

Hi all,

Having problems trying to get Ferret to read an index generated by
Lucene.

Am I right in thinking Ferret should be able to read a Lucene generated
index no problem?

Using the code snippets detailed in
http://www.ruby-forum.com/topic/64099#new

Any advice gratefully received.
Many Thanks,
Steven

On May 15, 2006, at 12:08 PM, steven shingler wrote:

Am I right in thinking Ferret should be able to read a Lucene
generated
index no problem?

That would be nice, but it is not currently the case because of
Java’s wacky “modified” UTF-8 serialization. I’ve seen that plain
ol’ ASCII text indexes will be compatible, but once you put in some
higher order characters things go askew.

Erik

Hi Erik, Thanks for getting back to me.

Ahh yes, I see what you mean - if I “Lucene-Index” only plain text
files, Ferret can search that index fine (it seems).

However, what I’m trying to do is index pdfs, using PDFBox to create the
Lucene documents - but Ferret isn’t at all pleased when I try to search:

NoMethodError: You have a nil object when you didn’t expect it!
The error occured while evaluating nil.name
c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/term_buffer.rb:31:in
read' c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/segment_term_enum.rb:90:innext
?’
c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/segment_term_enum.rb:118:in
sca n_to' c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/term_infos_io.rb:285:inscan_fo
r_term_info’
c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/term_infos_io.rb:163:in
get_ter m_info' c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/segment_reader.rb:176:indoc_fr
eq’
c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/multi_reader.rb:169:in
doc_freq ' c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/multi_reader.rb:169:ineach’
c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/multi_reader.rb:169:in
doc_freq ' c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/index_searcher.rb:47:indoc_fr
eq’
c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/term_query.rb:13:in
initialize ' c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/term_query.rb:99:innew’
c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/term_query.rb:99:in
create_wei ght' c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/boolean_query.rb:113:ininitia
lize’
c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/boolean_query.rb:112:in
each' c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/boolean_query.rb:112:ininitia
lize’
c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/boolean_query.rb:209:in
new' c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/boolean_query.rb:209:increate
_weight’
c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/query.rb:51:in weight' c:/ruby/lib/ruby/site_ruby/1.8/ferret/search/index_searcher.rb:107:insearc
h’
c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/index.rb:660:in
do_search' c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/index.rb:331:insearch_each’
c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/index.rb:330:in
synchronize' c:/ruby/lib/ruby/site_ruby/1.8/ferret/index/index.rb:330:insearch_each’
./lib/ferret_client.rb:34:in search_index' test/functional/ferret_client_test.rb:12:intest_search_index’

This is a shame, as I thought I was onto a winner with the Lucene/Ferret
combo - especially with PDFBox able to create Lucene Docs so easily.

This may not actually relate to your point of higher order chars…?

Does anyone have any experience of indexing pdfs in Lucene (using
PDFBox) and searching with Ferret? Or of course creating Ferret Index
Docs from pdf files in ruby?

Any ideas or advice gratefully received.
Thanks,
Steven

Hi, steven,

first of all: would you mind to provide a little more info on the
environment you are on: os, version of ferret, version of ruby et al.

second: You might be interested in FerretFinder utility as well as RDig.
Links to both of them you’ll find at the bottom of the howto section on
ferret trac: http://ferret.davebalmain.com/trac/wiki/HowTos . Both of
these
tools seem to use pdftotext to extract content from PDFs but might be of
help to you anyways.

Regards
Jan P.

Hi Jan,

Right - sorry.

I’m on Windows XP(pro); ferret 0.9.1 (pure ruby); ruby 1.8.2

I’ll look into those links now.
Many Thanks
Steven

hey steven,

have you got a linux box to your availability too? It might be of
interest
if the problem persists with ferret 0.9.3. If you got any scripts and
test
data of your pdfs I might as well check this out for you on linux,
ferret
0.9.3 and ruby 1.8.4

regards
Jan

Hi Jan,

Yes, I’ve got an Ubuntu box I can try it on - just updated to ferret
0.9.3 and ruby 1.8.4 on it.

Will have a look now and report back.

Many Thanks for your help.
S~

p.s. the ferret_helper finder utils look v interesting

On 5/16/06, Erik H. [email protected] wrote:

On May 15, 2006, at 12:08 PM, steven shingler wrote:

Am I right in thinking Ferret should be able to read a Lucene
generated
index no problem?

That would be nice, but it is not currently the case because of
Java’s wacky “modified” UTF-8 serialization. I’ve seen that plain
ol’ ASCII text indexes will be compatible, but once you put in some
higher order characters things go askew.

Hey guys,

What Erik said is exactly correct. Marvin H., (author of
KinoSearch, a Perl port of Lucene) has submitted a patch to Lucene so
that non-java ports of Lucene will be able to read Lucene indexes. It
currently slows Lucene down by about 25% at the moment (I think??) so
I’m going to be working with him to improve the performance of the
patch so that it can one day be included in Lucene. Don’t hold your
breath though. It’s going to take us a while to get it in there. For
now, I’d recommend using pdftotext as Jan already mentioned. I’m not
sure what is available on Windows but I’m sure it would be trivial to
write your own pdftotext using Java’s PDFBox and then call it from
Ruby.

Cheers,
Dave

On May 16, 2006, at 7:53 AM, David B. wrote:

higher order characters things go askew.

Hey guys,

What Erik said is exactly correct. Marvin H., (author of
KinoSearch, a Perl port of Lucene) has submitted a patch to Lucene so
that non-java ports of Lucene will be able to read Lucene indexes. It
currently slows Lucene down by about 25% at the moment (I think??)

Around 20% for indexing according to my benchmarker. I don’t have a
benchmark for searching.

Modified UTF-8 is not so much the problem for performance of my
patch, nor is it actually causing the index incompatibility in this
case. Modified UTF-8 is problematic for a couple other reasons.

When text contains either null bytes or Unicode code points above the
Basic Multilingual Plane (values 2^16 and up, such as U+1D160
“MUSICAL SYMBOL EIGHTH NOTE”), KinoSearch and Ferret, if they write
legal UTF-8, would write indexes which would cause Lucene to crash
from time to time with a baffling “read past EOF” error. Therefore,
to be Lucene-compatible they’d have to pre-scan all text to detect
those conditions, which would impose a performance burden and require
some crufty auxilliary code to turn the legal UTF-8 into Modified UTF-8.

Also, non-shortest-form UTF-8 presents a theoretical security risk,
and Perl is set up to issue a warning whenever a scalar which is
marked as UTF-8 isn’t shortest-form. That condition would occur
whenever Modified UTF-8 containing null bytes or code points above
the BMP was read in – thus requiring that all incoming text be pre-
scanned as well.

Those are rare conditions, but it isn’t realistic to just say
“KinoSearch|Ferret doesn’t support null bytes or characters above the
BMP”, because a lot of times the source text that goes into an index
isn’t under the full control of the indexing/search app’s author.

To be fair to Java and Lucene, they are paying a price for early
commitment to the Unicode standard. Lucene’s UTF-8 encoding/decoding
hasn’t been touched since Doug Cutting wrote it in 1998, when non-
shortest-form UTF-8 was still legal and Unicode was still 16-bit.
You could argue that the Unicode consortium pulled the rug out from
under its early champions by changing the spec so that existing
implementations were no longer compliant.

The performance problem sof my patch and the crashing are actually
tied to the Lucene File Format’s definition of a String. A String in
Lucene is the length of the string in Java chars, followed by the
character data translated to Modified UTF-8. A String in KinoSearch,
and if I am not mistaken in Ferret as well, is the length of the
character data in bytes, followed by the character data.

Those two definitions of String result in identical indexes so long
as your text is pure ASCII, but as Erik noted, when you add higher
order characters to the mix, problems arise. You end up reading
either too few bytes or too many, the stream gets out of sync, and
whammo: ‘Read past EOF’.

My patch modifies Lucene to use bytecounts as the prefix to its
Strings. Unfortunately, there are encoding/decoding inefficiencies
associated with the new way of doing things. Under Lucene’s current
definition of a string you allocate an array of Java char then read
characters into it one by one. With the new patch, you don’t know
how many chars you need, so you might have to re-allocate several
times. There are ways to address that inefficiency, but they’d take
a while to explain.

Don’t hold your
breath though. It’s going to take us a while to get it in there.

Yeah. Modifying Lucene so that it can read both the old index format
and the new without suffering a performance degradation in either
case is going to be non-trivial. I’m sympathetic to the notion that
it may not be worth it and that Lucene should declare its file format
private. There are a lot of issues in play.

No KinoSearch user has yet complained about Lucene/KinoSearch file-
format compatibility. The only thing I miss is Luke – which is
significant, because Luke is really handy.

How many users here care about Lucene compatibility, and why?

Marvin H.
Rectangular Research
http://www.rectangular.com/

On May 16, 2006, at 12:51 PM, Marvin H. wrote:

How many users here care about Lucene compatibility, and why?

Personally I’m putting my eggs into the Solr basket - http://

Solr has a ton of benefits over using raw Lucene with its caching and
configurable handling of putting new searchers online, etc. Its got
plenty of room for improvement, and those improvements are in
progress. I am integrating Solr into a Ruby on Rails front-end as we
speak, but doing so crudely through a rough HTTP API, but abstracting
that communication layer behind a nice Rubyish DSL would be quite cool.

I used to really really want Lucene index compatibility at the file
format layer along with a really fast Ruby implementation. At this
point I’ve changed my mind and Solr is my recommended basis for
search integration into non-Java (and even Java perhaps) applications.

I just wanted to toss out my thoughts since I’ve been mostly silent
on the Ferret/KinoSearch issues. I still day dream of GCJ’d Java
Lucene being the basis for cross-language integration using PyLucene
as a great example. They achieve 100% index compatibility with Java
Lucene because it is Java Lucene. I’m still extremely pleased to
see folks like Dave and Marvin digging deep in to Ruby and Perl
integration and starting to work together. Very promising no matter
how this ends up. I’m optimistic we’ll have Lucene in Ruby one of
these days in a compatible way and incredibly performant way!

Erik

On May 16, 2006, at 3:30 PM, Nick S. wrote:

Solr looks a promising project, only problem I have with it is that
you
need Tomcat and a JVM. This adds two more variables to your
configuration you have to control. Great if you know Java, but I’m
programming in Ruby so I don’t have to program in Java or .NET, or
whatever. So I prefer a Ruby only environment for it’s simplicity.

A fair and expected critique of using Solr in a Ruby environment.
Every language enjoys a bit of lock-in and programmers obviously
would prefer to work with native API’s.

It is true you need a JVM to run Solr, but it doesn’t have to be
Tomcat. I use Jetty. To fire up Solr in my Rails environment only
required I customize its schema.xml and solrconfig.xml files and run
“java -jar start.jar”. And voila, its up and running. So while it
does add an entirely new moving piece, I view it as something akin to
adding a database. As long as there is a good way to communicate
with it natively (a Ruby/Solr API would be well received, methinks)
then Solr isn’t any more, actually less, overhead to a projects
deployment than adding a database server.

Erik

I don’t care about the fact that Ferret isn’t able to read a Lucene
index. The only problem is that when the Ferret index isn’t compatible
with Lucene as is the case right now (damn EOF errors), you are not able
to use Luke to take a quick peek inside the index. So a port of Luke to
access Ferret would be great.

Ferret should be fast, have the power of Lucene searches and be easy to
access from Ruby, as it is right now. If you are going to use Lucene, go
all the way and stick to Java. Only problem with Ferret is that the C
version isn’t available on Windows (for testing purposes) yet, but that
is being worked on. GJC and SWIG sounds great but setting it up is a
real pain in the ass, great for techies, but horrible for all the
others.

Solr looks a promising project, only problem I have with it is that you
need Tomcat and a JVM. This adds two more variables to your
configuration you have to control. Great if you know Java, but I’m
programming in Ruby so I don’t have to program in Java or .NET, or
whatever. So I prefer a Ruby only environment for it’s simplicity.

So Luke is a definite plus as a debugging tool.

Kind regards,

Nick

On 5/17/06, Marvin H. [email protected] wrote:

How many users here care about Lucene compatibility, and why?

Great question. Who does care, and why? Performance used to be a very
good reason but that doesn’t apply anymore. Is it Java’s libraries?
Java does have PDFBox for example. Unfortunately Ruby doesn’t yet have
an equivalent but there are ways around this. The only good reason I
can think of is the lack of a Luke port. Anyone care to enlighten us?

Cheers,
Dave

On May 16, 2006, at 12:30 PM, Nick S. wrote:

I don’t care about the fact that Ferret isn’t able to read a Lucene
index. The only problem is that when the Ferret index isn’t compatible
with Lucene as is the case right now (damn EOF errors), you are not
able
to use Luke to take a quick peek inside the index. So a port of
Luke to
access Ferret would be great.

You know what… I think using Luke powered by a version of Lucene
with my patch applied would allow it to read Ferret indexes.

I don’t have time to check this out right now. And ironically, I’ve
made further mods to KinoSearch’s file format, so it wouldn’t make
Luke available to KinoSearch users unless I change it back. hahaha. ":o

The patch was prepared against subversion, but it might work against
1.9.1. If it doesn’t, it would be trival to finish it and package it
up. Maybe we can convince the Lucene folks to distribute it through
their channels… or I can put it up at my site. Maybe Luke’s author
would be amenable to distributing it from his site, but I dunno about
that - people might blame him rather than me or Balmain when stuff
fails to work.

Marvin H.
Rectangular Research
http://www.rectangular.com/

hey Marvin,

is there a link in this thread already? I’ve found
http://issues.apache.org/jira/browse/LUCENE-510?page=comments#action_12378519as
well as the links at the bottom of
http://www.archivum.info/[email protected]/2005-09/msg00025.htmlwith
google. Is there anything else? I’ll definitly try this out but wanted
to make sure if this is the latest development…

Regards
Jan

I agree with Jan’s ‘real-world’ scenario - it is the reason I started
this thread in the first place… :slight_smile:

…not so much because of management pressures, but I see merit in being
able to create indexes in either Java or Ruby, then use Rails to present
a query interface.

It keeps one’s options open - particularly with PDFBox and POI in the
Java space, although I’m looking into both routes of the
pdftotext/ferret_helper tools, and applying Marvin’s patch - so perhaps
both paths can remain open.

Thanks to all though, for contributing to this very interesting thread!
:slight_smile:
Cheers
Steven

Hi Dave,

IMHO there are two things:

  1. these little marketing and management issues that often have no valid
    reason but make a big difference:

Programmer / Freelancer : let’s use ruby we’ll even be able to build a
superfast search interface to all your great marketing docs with ferret,
rails and ruby
Manager: i think we’ve got this, it’s implemented by something called
bluezeneeee

P/F: yes we even might use the indexes of this and perform searches with
the
old system while we are changing…
M: changing what

P/F: the system to ruby, ferret…
M: WTF?

for these conversations it would be of help to stay in the background as
much as possible with changes as possible…

  1. Tools around Lucene

I think people will now give marvins patch and luke a try, but luke is
not
the only thing. Thanks to eric for putting up solr. I think it’s a
little
bit of the old java 90%/10% - thingy. For 90% of webapps all the java,
spring, hibernate stuff is damn complex and you’ll be faster with ruby.
but
the 10 or less percent, often the big money stuff of fortune companys,
of
banks etc. made their management decision to either j2ee or .net. And
for
these projects the programming teams often need distributed and high
volume
things, see cnet and solr.

I’ve heard about solr on this thread for the first time and wonder a
little
how it does together with nutch / hadoop for the distributed things but
will
do some googleing on this myself. I think there is definitly need - also
in
the ruby world - for search engines and crawlers. And nutch has some
nifty
features about RDig. Discussions about the interchangeability between
nutch
and ferret are showing that people are interested in using Lucene tools
but
front end with ruby, rails and ferret. I’ve for example tried to work
with
ferret on a nutch index and luckily ferret didn’t choke on the index
because
there were no utf-8 chars in there. So I could extract url, segment,
docno
but then there came this nfs / hadoop thing to extract content and
summaries
as well and I gave up.

There also seems to be interest and need in distributed search
architectures
as the p2p efforts of hyperestraier as well as nfs / hadoop and solr
(rsync?) are showing…

Regards
Jan