Hi all,
this is quite off-topic, but I’m sure a lot of people here has
experience
in the area, so…
I’m writing a website scraper script that needs to download a web page,
traverse the (X)HTML tree and finally insert data and HTML pieces into
a DB. Eventually this data will be served up as RSS and/or Atom.
I’m currently using html/tree (htmltools); I’ve also tried Rubyful Soup;
both have their own shortcomings. What do you people suggest?
Regarding htmltools: I had to tweak it quite a bit, as it wouldn’t
recognize
XHTML-style “empty” tags (for instance, it dislikes <link … />).
What’s even worse, I can’t seem to get it to dump back the HTML it read.
Something as simple as:
#!/usr/bin/env ruby
require ‘html/tree’
p = HTMLTree::Parser.new(false, false)
p.feed("")
p.tree.dump
Results in:
Rubyful Soup is not perfect either, quite often spewing things like
<img img="" …; OTOH, it groks XHTML. But it’s much much slower…
What do you think? Any pointer, suggestions, ecc. very very welcome!
Bye,
Andrea