Hi! RDig is a small tool to build a Ferret index for the contents of a website or intranet. It contains a simple HTTP crawler and some support for extracting textual content from the fetched pages. I built this to implement a site-wide search for a recent project that combined a Rails application with lots of static html files generated by a CMS. Any feedback is very welcome! Rubyforge project page: http://rubyforge.org/projects/rdig RDocs: http://rdig.rubyforge.org/ `gem install rdig` should work once the gem has reached the rubyforge mirrors. Jens -- webit! Gesellschaft für neue Medien mbH www.webit.de Dipl.-Wirtschaftsingenieur Jens Krämer firstname.lastname@example.org Schnorrstraße 76 Tel +49 351 46766 0 D-01069 Dresden Fax +49 351 46766 66
on 2006-03-25 15:34
on 2006-03-25 17:30
Hi, Jens, great stuff. Just installed it and made a short test as described in the readme. It works as announced. Thanks for sharing this! The crawler has problems with frames but this is a quite common problem. I've had to configure it to the main content frame. You'll probably know nutch. But here is a pointer anyway: http://lucene.apache.org/nutch/ just if you're in search for some inspiration. Nutch is a great tool for webcrawling. I've used it and it worked great... Best Regards Jan P.