So, this search thing

nicksnels · February 7, 2006, 9:29pm

I am using ferret right now, and it works great for all my regular text
documents/information. My problem arises when I want to index/search all
of
our assets (mostly pdf files). Currently, there is no way to READ pdfs
from
Ruby. Because of this I have to resort to using Java to read the PDF’s
and
then Lucene to index them. My problem here is a couple things.
One, to index a asset I have to either fire up a complete new JVM for
each
asset, or have to the index rebuilt each night at a set time. Each way
has
their own advantages/downfalls, but the biggest is that Ferret doesn’t
like
to talk to Lucene created indexes doh!
So, on to number two. So now I can go at this from a couple angles. I
could
create a Java webservice to do the indexing and the searching and then
return the results. Or I could simply write a small utility program
(with
groovy perhaps?) that uses Java just to get the content of the pdf files
and
use ferret for everything. Or some combination of one or the other or
something completly different.

I’m interested in what you folks out there have to say about this. I
would
really really like to avoid creating a whole web service just for
searching,
but if thats the most viable way then I may go that route.

-Nick “searching for a clue” S

nicksnels · February 7, 2006, 9:41pm

On Tuesday, February 07, 2006, at 3:27 PM, Nick S. wrote:

asset, or have to the index rebuilt each night at a set time. Each way has
their own advantages/downfalls, but the biggest is that Ferret doesn’t like
to talk to Lucene created indexes doh!

I could swear that I had read Ferret would use Lucene’s indices and vice
versa…I just can’t dig up the link now. It’s a true port, right?

Out of curiosity, what are you using to read the pdfs in Java?

Thanks!
John

nicksnels · February 7, 2006, 9:41pm

Also, is this a Linux or unix box? Have you considered running the pdfs
through pdf2txt and indexing the results?

nicksnels · February 7, 2006, 9:44pm

It does read Lucene indexes to a point, and in fact worked for my
initial
testing. However, once I started to really put content into the indexes
it
ran into a problem with some encoding issues. David knows about this and
stated it was a non-trivial issue with how Lucene produces it’s indexes,
or
how ferret reads them (can’t remember exactly).

For reading the PDF’s I’m using PDFBox and its worked great so far.

-Nick

On 7 Feb 2006 20:38:29 -0000, John W.
[email protected]

nicksnels · February 7, 2006, 9:59pm

This is unfortunatly deployed on a windows server. I haven’t looked to
see if there is a tool like pdf2text for windows, but would like to
keep this as platform agnostic as possible.

-Nick

On 7 Feb 2006 20:40:51 -0000, John W.

nicksnels · February 7, 2006, 10:49pm

Another option is to create a Java-based index (and even search)
server. There is one in incubation at Apache (http://
Solr Incubation Status - Apache Incubator), which is distilled from the
system that drives CNET’s amazing faceted search system. This option
would allow you to use PDFBox (or the wrapper TextMining) to deal
with PDF’s quite reliably.

I currently use a simple custom Java XML-RPC search server with my
Rails front-end, and it works quite well and more than acceptably
fast. I index with a Java process as well.

And yeah, unfortunately there is a mismatch with Ferret and Lucene
indexes when certain characters get used. This has been discussed in
great detail on the java-dev@lucene e-mail list: http://
marc.theaimsgroup.com/?l=lucene-user&m=112529400725721&w=2 - it is an
unfortunate situation that seems to result in either making Java
Lucene somewhat slower to be more standard or make the ports have to
deal with a Java oddity in UTF. The low-level specifics have thus
far gone over my head, but have been discussed by some very sharp
folks at that thread and others. PyLucene does not suffer this issue
because it is truly Java Lucene underneath (via GCJ and SWIG).

Erik

nicksnels · February 8, 2006, 2:11am

On 2/7/06, Erik H. [email protected] wrote:

Another option is to create a Java-based index (and even search)
server. There is one in incubation at Apache (http://
Solr Incubation Status - Apache Incubator), which is distilled from the
system that drives CNET’s amazing faceted search system. This option
would allow you to use PDFBox (or the wrapper TextMining) to deal
with PDF’s quite reliably.

Well have to take a look into that, hadn’t noticed it before

I currently use a simple custom Java XML-RPC search server with my

Rails front-end, and it works quite well and more than acceptably
fast. I index with a Java process as well.

Ya, I think is going to have to be the route I take. I have a couple of
questions though if you dont mind sharing.
What rpc server do you use? Am currently thinking about using axis, but
am
not sure what else is out there.
And for building the indexes, do you have a task that just runs on a
scheduled time to rebuild the indexes? (do you build them from scratch
or
add onto them?)

And yeah, unfortunately there is a mismatch with Ferret and Lucene

indexes when certain characters get used. This has been discussed in
great detail on the java-dev@lucene e-mail list: http://
marc.theaimsgroup.com/?l=lucene-user&m=112529400725721&w=2 - it is an
unfortunate situation that seems to result in either making Java
Lucene somewhat slower to be more standard or make the ports have to
deal with a Java oddity in UTF. The low-level specifics have thus
far gone over my head, but have been discussed by some very sharp
folks at that thread and others. PyLucene does not suffer this issue
because it is truly Java Lucene underneath (via GCJ and SWIG).

Ya, that was what I found. The detail of the problem is not really
something
I’m interested in, or can even currently understand. Its unfortunate
too,
because everything about ferret has been working great. The only reason
I
need Java at all at this point is to get stuff out of PDF’s.

    Erik

Thanks for the info Erik, and all, looks like I have some research to do
here.

-Nick

nicksnels · February 8, 2006, 2:44pm

On Feb 7, 2006, at 8:09 PM, Nick S. wrote:

I currently use a simple custom Java XML-RPC search server with my
Rails front-end, and it works quite well and more than acceptably
fast. I index with a Java process as well.

Ya, I think is going to have to be the route I take. I have a
couple of questions though if you dont mind sharing.
What rpc server do you use? Am currently thinking about using axis,
but am not sure what else is out there.

I use Apache’s XML-RPC implementation: ws-xmlrpc - Apache XML-RPC

My code is essentially this:

public class SearchServer {
private IndexReader reader;
private IndexSearcher searcher;

public SearchServer(Directory directory) throws IOException {
reader = IndexReader.open(directory);
searcher = new IndexSearcher(reader);
}

public Hashtable search(Vector constraints, int start, int max)
throws IOException, ParseException {
// …
}

public static void main(String [] args) {
String indexPath = args[0];
int port = 8076;
if (args.length > 1) {
port = Integer.valueOf(args[1]).intValue();
}

 try {
   WebServer server = new WebServer(port);
   server.addHandler("$default", new SearchServer

(FSDirectory.getDirectory(indexPath, false)));
server.start();
} catch (Exception e) {
System.err.println("SearchServer: " + e.toString());
e.printStackTrace(System.err);
}
}
}

And for building the indexes, do you have a task that just runs on
a scheduled time to rebuild the indexes? (do you build them from
scratch or add onto them?)

My project deals with static data. I build the index from scratch
once and that is it, so there is no updating of it on the fly. When
the data changes, the entire index is rebuilt.

Ya, that was what I found. The detail of the problem is not really
something I’m interested in, or can even currently understand. Its
unfortunate too, because everything about ferret has been working
great. The only reason I need Java at all at this point is to get
stuff out of PDF’s.

With your use of Ferret, I’m curious what size of data, such as how
many documents, you’re dealing with? What does your index size end
up being?

Thanks,
Erik

nicksnels · February 8, 2006, 4:11pm

http://arton.no-ip.info/collabo/backyard/?RubyJavaBridge

Works like a charm on modern linuxes. On windows you have to (or at
least i had to) fix (undo) the automatic iconv that’s going on. Just
make the check for whether or not iconv has been installed always return
false.

nicksnels · February 8, 2006, 6:54pm

On 2/8/06, Erik H. [email protected] wrote:

I currently use a simple custom Java XML-RPC search server with my
Rails front-end, and it works quite well and more than acceptably
fast. I index with a Java process as well.

That was the kind of thing I’m looking for. I knew tomcat or any other
container was bit overkill for this.

I use Apache’s XML-RPC implementation: ws-xmlrpc - Apache XML-RPC

My code is essentially this:

public class SearchServer {
…
}

Looks to be about the amount of code I want to write for this. I
already have the indexing part built so hopefully I’ll just be able
to add a few more rpc calls to add/delete/modify the indexes as
needed.

stuff out of PDF’s.

With your use of Ferret, I’m curious what size of data, such as how
many documents, you’re dealing with? What does your index size end
up being?

Currently I’m only indexing our page content and with this I’m not
storing anything in the index besides an id field so I know the page
that the stuff is linked to. With that, at about 100 pages the index
size is only 252kb, pretty damn small.

With that though, I know we have a couple hundred PDFs that are going
to be indexed as well, and again, wont actually store anything besides
a id field so I can load it up when needed. I expect though that this
index wont be that sizable either.

Thanks for the info Erik!

nicksnels · February 8, 2006, 6:54pm

I’ll have to look into that Colin, thanks for pointing it out!

nicksnels · February 8, 2006, 4:49pm

On Tuesday, February 07, 2006, at 4:48 PM, Erik H. wrote:

PyLucene does not suffer this
issue because it is truly Java Lucene underneath (via GCJ and SWIG).

Just out of curiosity, is there anything technically preventing Ruby
from going the same route (i.e., binding to a GCJ-compiled verison
through SWIG)?

nicksnels · February 8, 2006, 10:47pm

On Feb 8, 2006, at 10:47 AM, John W. wrote:

On Tuesday, February 07, 2006, at 4:48 PM, Erik H. wrote:

PyLucene does not suffer this
issue because it is truly Java Lucene underneath (via GCJ and SWIG).

Just out of curiosity, is there anything technically preventing Ruby
from going the same route (i.e., binding to a GCJ-compiled verison
through SWIG)?

Nothing whatsoever and it’s a project I keep dreaming of but never
making the time to do it myself. I have, however, been tinkering
with PyLucene’s build process recently in a half-hearted attempt to
do this myself. I would definitely open source it under the ASL and
host it at lucene.apache.org. PyLucene has expressed interest in
migrating to be a sibling of Java Lucene also.

The primary reason I think the GCJ/SWIG approach is the right way to
the most amazing search engine available, even considering commercial
products, is that Doug Cutting and the other information retrieval
experts are spending their time improving the Java codebase
constantly. I personally want the latest greatest text search engine
features to be available exactly the same from both Java and Ruby,
with of course the added semantic (syntactic sugar you may call it)
goodies that Ruby offers.

Dave - go full steam with Ferret! It rocks! I am as much a
cheerleader of your efforts as anyone possibly could be.

Erik

nicksnels · February 8, 2006, 10:43pm

On Feb 8, 2006, at 12:53 PM, Nick S. wrote:

I use Apache’s XML-RPC implementation: ws-xmlrpc - Apache XML-RPC
to add a few more rpc calls to add/delete/modify the indexes as
needed.

It’s one of those things thats pragmatically simple enough to solve
the problem at hand. Why of course I’d love to stay in Ruby all the
time, but there is absolutely nothing wrong with using the best of
all the computing tools at our disposal, even if it is that evil four
letter word Java

I’m as simple as they come. I gravitate away from complexity. And
Apache’s XML-RPC was the simplest and most performant way I could
glue the oh so incredibly powerful Lucene into a Rails front-end.
Nothing wrong with being multi-lingual, I say. å¯¹ä¸å¯¹ï¼?

Currently I’m only indexing our page content and with this I’m not
storing anything in the index besides an id field so I know the page
that the stuff is linked to. With that, at about 100 pages the index
size is only 252kb, pretty damn small.

With that though, I know we have a couple hundred PDFs that are going
to be indexed as well, and again, wont actually store anything besides
a id field so I can load it up when needed. I expect though that this
index wont be that sizable either.

Ok, so we’re still talking < 100 documents in the Ferret index total,
which is quite suitable, I think for the pure Ruby Ferret to handle.
I’m curious how it would fair with my 30k document set - ashamedly
I’ve yet to try it actually. I sorta put Ferret on the backburner
due to time constraints and needing to move my project forward using
the Java-based indexing code I already had, and hit the wall when my
Java built index did not jive with Ferret sadly. No offense to Dave
whatsoever, in fact I’m still in awe of the cleanliness and internal
elegance to Ferret and how it matches Java code-wise. A work of art
truly.