[ANN] RDig - ferret-based website crawler/indexer

Hi!

RDig is a small tool to build a Ferret index for the contents of a
website or intranet. It contains a simple HTTP crawler and some support
for extracting textual content from the fetched pages.

I built this to implement a site-wide search for a recent project
that combined a Rails application with lots of static html files
generated by a CMS.

Any feedback is very welcome!

Rubyforge project page: http://rubyforge.org/projects/rdig
RDocs: http://rdig.rubyforge.org/

gem install rdig should work once the gem has reached the rubyforge
mirrors.

Jens


webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

Hi, Jens,

great stuff. Just installed it and made a short test as described in the
readme. It works as announced. Thanks for sharing this! The crawler has
problems with frames but this is a quite common problem. I’ve had to
configure it to the main content frame.

You’ll probably know nutch. But here is a pointer anyway:
http://lucene.apache.org/nutch/ just if you’re in search for some
inspiration. Nutch is a great tool for webcrawling. I’ve used it and it
worked great…

Best Regards
Jan P.

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs