OT: Scraper library recommendation


#1

Hi all,

this is quite off-topic, but I’m sure a lot of people here has
experience
in the area, so…

I’m writing a website scraper script that needs to download a web page,
traverse the (X)HTML tree and finally insert data and HTML pieces into
a DB. Eventually this data will be served up as RSS and/or Atom.

I’m currently using html/tree (htmltools); I’ve also tried Rubyful Soup;
both have their own shortcomings. What do you people suggest?

Regarding htmltools: I had to tweak it quite a bit, as it wouldn’t
recognize
XHTML-style “empty” tags (for instance, it dislikes <link … />).
What’s even worse, I can’t seem to get it to dump back the HTML it read.
Something as simple as:

#!/usr/bin/env ruby

require ‘html/tree’

p = HTMLTree::Parser.new(false, false)
p.feed("")
p.tree.dump

Results in:

Rubyful Soup is not perfect either, quite often spewing things like
<img img="" …; OTOH, it groks XHTML. But it’s much much slower…

What do you think? Any pointer, suggestions, ecc. very very welcome!

Bye,
Andrea


#2

On a related topic…

I’ve been thinking about writing a script that would scrape Rdoc html
files and then insert descriptions from the code into a table.

The specific reason for this was to provide automagic population of the
privledge description fields in the ‘user_engine’.

I suspect there may be other applications for this as well.
A good HTML scraper library would really help out with this.

_Kevin