[ANN] RDig - ferret-based website crawler/indexer

krahe · March 25, 2006, 2:34pm

Hi!

RDig is a small tool to build a Ferret index for the contents of a
website or intranet. It contains a simple HTTP crawler and some support
for extracting textual content from the fetched pages.

I built this to implement a site-wide search for a recent project
that combined a Rails application with lots of static html files
generated by a CMS.

Any feedback is very welcome!

Rubyforge project page: http://rubyforge.org/projects/rdig
RDocs: http://rdig.rubyforge.org/

gem install rdig should work once the gem has reached the rubyforge
mirrors.

Jens

–
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

krahe · March 25, 2006, 4:30pm

Hi, Jens,

great stuff. Just installed it and made a short test as described in the
readme. It works as announced. Thanks for sharing this! The crawler has
problems with frames but this is a quite common problem. I’ve had to
configure it to the main content frame.

You’ll probably know nutch. But here is a pointer anyway:
Apache Nutch™ just if you’re in search for some
inspiration. Nutch is a great tool for webcrawling. I’ve used it and it
worked great…

Best Regards
Jan P.