Forum: Ruby on Rails So, this search thing...

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Cb610750ee94ca103aef4b2dc7b1b768?d=identicon&s=25 Nick Stuart (Guest)
on 2006-02-07 21:29
(Received via mailing list)
I am using ferret right now, and it works great for all my regular text
documents/information. My problem arises when I want to index/search all
of
our assets (mostly pdf files). Currently, there is no way to READ pdfs
from
Ruby. Because of this I have to resort to using Java to read the PDF's
and
then Lucene to index them. My problem here is a couple things.
One, to index a asset I have to either fire up a complete new JVM for
each
asset, or have to the index rebuilt each night at a set time. Each way
has
their own advantages/downfalls, but the biggest is that Ferret doesn't
like
to talk to Lucene created indexes :(    doh!
So, on to number two. So now I can go at this from a couple angles. I
could
create a Java webservice to do the indexing and the searching and then
return the results. Or I could simply write a small utility program
(with
groovy perhaps?) that uses Java just to get the content of the pdf files
and
use ferret for everything. Or some combination of one or the other or
something completly different.

I'm interested in what you folks out there have to say about this. I
would
really really like to avoid creating a whole web service just for
searching,
but if thats the most viable way then I may go that route.

-Nick "searching for a clue" S
5d15c6821f3c3054c04b85471824ba7c?d=identicon&s=25 John Wells (Guest)
on 2006-02-07 21:41
(Received via mailing list)
On Tuesday, February 07, 2006, at 3:27 PM, Nick Stuart wrote:
>asset, or have to the index rebuilt each night at a set time. Each way has
>their own advantages/downfalls, but the biggest is that Ferret doesn't like
>to talk to Lucene created indexes :(    doh!

I could swear that I had read Ferret would use Lucene's indices and vice
versa...I just can't dig up the link now. It's a true port, right?

Out of curiosity, what are you using to read the pdfs in Java?

Thanks!
John
5d15c6821f3c3054c04b85471824ba7c?d=identicon&s=25 John Wells (Guest)
on 2006-02-07 21:41
(Received via mailing list)
Also, is this a Linux or unix box? Have you considered running the pdfs
through pdf2txt and indexing the results?
Cb610750ee94ca103aef4b2dc7b1b768?d=identicon&s=25 Nick Stuart (Guest)
on 2006-02-07 21:44
(Received via mailing list)
It does read Lucene indexes to a point, and in fact worked for my
initial
testing. However, once I started to really put content into the indexes
it
ran into a problem with some encoding issues. David knows about this and
stated it was a non-trivial issue with how Lucene produces it's indexes,
or
how ferret reads them (can't remember exactly).

For reading the PDF's I'm using PDFBox and its worked great so far.

-Nick

On 7 Feb 2006 20:38:29 -0000, John Wells
<devlists-rubyonrails@devlists.com>
Cb610750ee94ca103aef4b2dc7b1b768?d=identicon&s=25 Nick Stuart (Guest)
on 2006-02-07 21:59
(Received via mailing list)
This is unfortunatly deployed on a windows server. I haven't looked to
see if there is a tool like pdf2text for windows, but would like to
keep this as platform agnostic as possible.

-Nick

On 7 Feb 2006 20:40:51 -0000, John Wells
4d6a47158a7c8a032e5f6a4da8976d7d?d=identicon&s=25 Erik Hatcher (Guest)
on 2006-02-07 22:49
(Received via mailing list)
Another option is to create a Java-based index (and even search)
server.  There is one in incubation at Apache (http://
incubator.apache.org/projects/solr.html), which is distilled from the
system that drives CNET's amazing faceted search system.  This option
would allow you to use PDFBox (or the wrapper TextMining) to deal
with PDF's quite reliably.

I currently use a simple custom Java XML-RPC search server with my
Rails front-end, and it works quite well and more than acceptably
fast.  I index with a Java process as well.

And yeah, unfortunately there is a mismatch with Ferret and Lucene
indexes when certain characters get used.  This has been discussed in
great detail on the java-dev@lucene e-mail list: http://
marc.theaimsgroup.com/?l=lucene-user&m=112529400725721&w=2 - it is an
unfortunate situation that seems to result in either making Java
Lucene somewhat slower to be more standard or make the ports have to
deal with a Java oddity in UTF.  The low-level specifics have thus
far gone over my head, but have been discussed by some very sharp
folks at that thread and others.  PyLucene does not suffer this issue
because it is truly Java Lucene underneath (via GCJ and SWIG).

	Erik
Cb610750ee94ca103aef4b2dc7b1b768?d=identicon&s=25 Nick Stuart (Guest)
on 2006-02-08 02:11
(Received via mailing list)
On 2/7/06, Erik Hatcher <erik@ehatchersolutions.com> wrote:
>
> Another option is to create a Java-based index (and even search)
> server.  There is one in incubation at Apache (http://
> incubator.apache.org/projects/solr.html), which is distilled from the
> system that drives CNET's amazing faceted search system.  This option
> would allow you to use PDFBox (or the wrapper TextMining) to deal
> with PDF's quite reliably.


Well have to take a look into that, hadn't noticed it before

I currently use a simple custom Java XML-RPC search server with my
> Rails front-end, and it works quite well and more than acceptably
> fast.  I index with a Java process as well.


Ya, I think is going to have to be the route I take. I have a couple of
questions though if you dont mind sharing.
What rpc server do you use? Am currently thinking about using axis, but
am
not sure what else is out there.
And for building the indexes, do you have a task that just runs on a
scheduled time to rebuild the indexes? (do you build them from scratch
or
add onto them?)


And yeah, unfortunately there is a mismatch with Ferret and Lucene
> indexes when certain characters get used.  This has been discussed in
> great detail on the java-dev@lucene e-mail list: http://
> marc.theaimsgroup.com/?l=lucene-user&m=112529400725721&w=2 - it is an
> unfortunate situation that seems to result in either making Java
> Lucene somewhat slower to be more standard or make the ports have to
> deal with a Java oddity in UTF.  The low-level specifics have thus
> far gone over my head, but have been discussed by some very sharp
> folks at that thread and others.  PyLucene does not suffer this issue
> because it is truly Java Lucene underneath (via GCJ and SWIG).


Ya, that was what I found. The detail of the problem is not really
something
I'm interested in, or can even currently understand. Its unfortunate
too,
because everything about ferret has been working great. The only reason
I
need Java at all at this point is to get stuff out of PDF's.

        Erik


Thanks for the info Erik, and all, looks like I have some research to do
here.

-Nick
4d6a47158a7c8a032e5f6a4da8976d7d?d=identicon&s=25 Erik Hatcher (Guest)
on 2006-02-08 14:44
(Received via mailing list)
On Feb 7, 2006, at 8:09 PM, Nick Stuart wrote:

> I currently use a simple custom Java XML-RPC search server with my
> Rails front-end, and it works quite well and more than acceptably
> fast.  I index with a Java process as well.
>
> Ya, I think is going to have to be the route I take. I have a
> couple of questions though if you dont mind sharing.
> What rpc server do you use? Am currently thinking about using axis,
> but am not sure what else is out there.

I use Apache's XML-RPC implementation: http://ws.apache.org/xmlrpc/

My code is essentially this:

public class SearchServer {
   private IndexReader reader;
   private IndexSearcher searcher;

   public SearchServer(Directory directory) throws IOException {
     reader = IndexReader.open(directory);
     searcher = new IndexSearcher(reader);
   }


   public Hashtable search(Vector constraints, int start, int max)
throws IOException, ParseException {
     // ...
   }

   public static void main(String [] args) {
     String indexPath = args[0];
     int port = 8076;
     if (args.length > 1) {
       port = Integer.valueOf(args[1]).intValue();
     }

     try {
       WebServer server = new WebServer(port);
       server.addHandler("$default", new SearchServer
(FSDirectory.getDirectory(indexPath, false)));
       server.start();
     } catch (Exception e) {
       System.err.println("SearchServer: " + e.toString());
       e.printStackTrace(System.err);
     }
   }
}



> And for building the indexes, do you have a task that just runs on
> a scheduled time to rebuild the indexes? (do you build them from
> scratch or add onto them?)

My project deals with static data.  I build the index from scratch
once and that is it, so there is no updating of it on the fly.  When
the data changes, the entire index is rebuilt.

>
> Ya, that was what I found. The detail of the problem is not really
> something I'm interested in, or can even currently understand. Its
> unfortunate too, because everything about ferret has been working
> great. The only reason I need Java at all at this point is to get
> stuff out of PDF's.

With your use of Ferret, I'm curious what size of data, such as how
many documents, you're dealing with?  What does your index size end
up being?

Thanks,
	Erik
7b8adabc68bcdcf2df5ea4a8441cecc4?d=identicon&s=25 Colin (Guest)
on 2006-02-08 16:11
http://arton.no-ip.info/collabo/backyard/?RubyJavaBridge

Works like a charm on modern linuxes. On windows you have to (or at
least i had to) fix (undo) the automatic iconv that's going on. Just
make the check for whether or not iconv has been installed always return
false.
5d15c6821f3c3054c04b85471824ba7c?d=identicon&s=25 John Wells (Guest)
on 2006-02-08 16:49
(Received via mailing list)
On Tuesday, February 07, 2006, at 4:48 PM, Erik Hatcher wrote:
> PyLucene does not suffer this
>issue because it is truly Java Lucene underneath (via GCJ and SWIG).

Just out of curiosity, is there anything technically preventing Ruby
from going the same route (i.e., binding to a GCJ-compiled verison
through SWIG)?
Cb610750ee94ca103aef4b2dc7b1b768?d=identicon&s=25 Nick Stuart (Guest)
on 2006-02-08 18:54
(Received via mailing list)
On 2/8/06, Erik Hatcher <erik@ehatchersolutions.com> wrote:
>
> > I currently use a simple custom Java XML-RPC search server with my
> > Rails front-end, and it works quite well and more than acceptably
> > fast.  I index with a Java process as well.
> >

That was the kind of thing I'm looking for. I knew tomcat or any other
container was  bit overkill for this.

> I use Apache's XML-RPC implementation: http://ws.apache.org/xmlrpc/
>
> My code is essentially this:
>
> public class SearchServer {
>  ...
> }
>

Looks to be about the amount of code I want to write for this. I
already have the indexing part built so hopefully I'll  just be able
to add a few more rpc calls to add/delete/modify the indexes as
needed.

> > stuff out of PDF's.
>
> With your use of Ferret, I'm curious what size of data, such as how
> many documents, you're dealing with?  What does your index size end
> up being?
>

Currently I'm only indexing our page content and with this I'm not
storing anything in the index besides an id field so I know the page
that the stuff is linked to. With that, at about 100 pages the index
size is only 252kb, pretty damn small.

With that though, I know we have a couple hundred PDFs that are going
to be indexed as well, and again, wont actually store anything besides
a id field so I can load it up when needed. I expect though that this
index wont be that sizable either.

Thanks for the info Erik!
Cb610750ee94ca103aef4b2dc7b1b768?d=identicon&s=25 Nick Stuart (Guest)
on 2006-02-08 18:54
(Received via mailing list)
I'll have to look into that Colin, thanks for pointing it out!
4d6a47158a7c8a032e5f6a4da8976d7d?d=identicon&s=25 Erik Hatcher (Guest)
on 2006-02-08 22:43
(Received via mailing list)
On Feb 8, 2006, at 12:53 PM, Nick Stuart wrote:
>> I use Apache's XML-RPC implementation: http://ws.apache.org/xmlrpc/
> to add a few more rpc calls to add/delete/modify the indexes as
> needed.

It's one of those things thats pragmatically simple enough to solve
the problem at hand.  Why of course I'd love to stay in Ruby all the
time, but there is absolutely nothing wrong with using the best of
all the computing tools at our disposal, even if it is that evil four
letter word Java :)

I'm as simple as they come.  I gravitate away from complexity.  And
Apache's XML-RPC was the simplest and most performant way I could
glue the oh so incredibly powerful Lucene into a Rails front-end.
Nothing wrong with being multi-lingual, I say.  对不对�

> Currently I'm only indexing our page content and with this I'm not
> storing anything in the index besides an id field so I know the page
> that the stuff is linked to. With that, at about 100 pages the index
> size is only 252kb, pretty damn small.
>
> With that though, I know we have a couple hundred PDFs that are going
> to be indexed as well, and again, wont actually store anything besides
> a id field so I can load it up when needed. I expect though that this
> index wont be that sizable either.

Ok, so we're still talking < 100 documents in the Ferret index total,
which is quite suitable, I think for the pure Ruby Ferret to handle.
I'm curious how it would fair with my 30k document set - ashamedly
I've yet to try it actually.  I sorta put Ferret on the backburner
due to time constraints and needing to move my project forward using
the Java-based indexing code I already had, and hit the wall when my
Java built index did not jive with Ferret sadly.  No offense to Dave
whatsoever, in fact I'm still in awe of the cleanliness and internal
elegance to Ferret and how it matches Java code-wise.  A work of art
truly.

More on this thread in a sec....

	Erik
4d6a47158a7c8a032e5f6a4da8976d7d?d=identicon&s=25 Erik Hatcher (Guest)
on 2006-02-08 22:47
(Received via mailing list)
On Feb 8, 2006, at 10:47 AM, John Wells wrote:

>
> On Tuesday, February 07, 2006, at 4:48 PM, Erik Hatcher wrote:
>> PyLucene does not suffer this
>> issue because it is truly Java Lucene underneath (via GCJ and SWIG).
>
> Just out of curiosity, is there anything technically preventing Ruby
> from going the same route (i.e., binding to a GCJ-compiled verison
> through SWIG)?

Nothing whatsoever and it's a project I keep dreaming of but never
making the time to do it myself.  I have, however, been tinkering
with PyLucene's build process recently in a half-hearted attempt to
do this myself.  I would definitely open source it under the ASL and
host it at lucene.apache.org.  PyLucene has expressed interest in
migrating to be a sibling of Java Lucene also.

The primary reason I think the GCJ/SWIG approach is the right way to
the most amazing search engine available, even considering commercial
products, is that Doug Cutting and the other information retrieval
experts are spending their time improving the Java codebase
constantly.  I personally want the latest greatest text search engine
features to be available exactly the same from both Java and Ruby,
with of course the added semantic (syntactic sugar you may call it)
goodies that Ruby offers.

Dave - go full steam with Ferret!   It rocks!   I am as much a
cheerleader of your efforts as anyone possibly could be.

	Erik
Cb610750ee94ca103aef4b2dc7b1b768?d=identicon&s=25 Nick Stuart (Guest)
on 2006-02-09 02:14
(Received via mailing list)
On 2/8/06, Erik Hatcher <erik@ehatchersolutions.com> wrote:
> > That was the kind of thing I'm looking for. I knew tomcat or any other
> >
>
> I'm as simple as they come.  I gravitate away from complexity.  And
> Apache's XML-RPC was the simplest and most performant way I could
> glue the oh so incredibly powerful Lucene into a Rails front-end.
> Nothing wrong with being multi-lingual, I say.  对不对�


No, nothing wrong with it at all. I still work on another internal
project
that is Java/Swing based so I get to deal with the four letter words on
a
regular basis. What I did want to avoid was having to setup a whole
container app just to run my searches. That, and everything that goes
with
it, seemed to be a bit to much heavy lifting for this.


> Ok, so we're still talking < 100 documents in the Ferret index total,
> which is quite suitable, I think for the pure Ruby Ferret to handle.
> I'm curious how it would fair with my 30k document set - ashamedly
> I've yet to try it actually.  I sorta put Ferret on the backburner
> due to time constraints and needing to move my project forward using
> the Java-based indexing code I already had, and hit the wall when my
> Java built index did not jive with Ferret sadly.  No offense to Dave
> whatsoever, in fact I'm still in awe of the cleanliness and internal
> elegance to Ferret and how it matches Java code-wise.  A work of art
> truly.


I agree with the comments on Ferret. And as I said before, if I had a
reliable way to get the text out of these silly PDF files easily and in
a
pure ruby fashion, then I would most certainly stay with it. I was never
concerned with any of the percieved slowness because of my relativly low
doc
count, and Ferret as been great to get me started so far!


More on this thread in a sec....
Cb610750ee94ca103aef4b2dc7b1b768?d=identicon&s=25 Nick Stuart (Guest)
on 2006-02-11 18:30
(Received via mailing list)
Hey Erik, just wanted to say thanks! Got the search stuff up and
running following your suggestions. It works like a champ, and it is
pretty quick too. The longest part being re-indexing the assets
themselves. The Ruby XMLRPC client actually times out on that, so I
have to change that still, but the indexing and search work great.

For anyone interested I have two postings on my site with a little
more detail on what I went through.

http://blog.nicholasstuart.com/

-Nick
This topic is locked and can not be replied to.