[ANN] RDig - ferret-based website crawler/indexer


#1

Hi!

RDig is a small tool to build a Ferret index for the contents of a
website or intranet. It contains a simple HTTP crawler and some support
for extracting textual content from the fetched pages.

I built this to implement a site-wide search for a recent project
that combined a Rails application with lots of static html files
generated by a CMS.

Any feedback is very welcome!

Rubyforge project page: http://rubyforge.org/projects/rdig
RDocs: http://rdig.rubyforge.org/

gem install rdig should work once the gem has reached the rubyforge
mirrors.

Jens


webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer removed_email_address@domain.invalid
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66


#2

Hi, Jens,

great stuff. Just installed it and made a short test as described in the
readme. It works as announced. Thanks for sharing this! The crawler has
problems with frames but this is a quite common problem. I’ve had to
configure it to the main content frame.

You’ll probably know nutch. But here is a pointer anyway:
http://lucene.apache.org/nutch/ just if you’re in search for some
inspiration. Nutch is a great tool for webcrawling. I’ve used it and it
worked great…

Best Regards
Jan P.