Forum: Ruby on Rails OT: Scraper library recommendation

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Andrea C. (Guest)
on 2006-01-10 20:00
(Received via mailing list)
Hi all,

this is quite off-topic, but I'm sure a lot of people here has
experience
in the area, so...

I'm writing a website scraper script that needs to download a web page,
traverse the (X)HTML tree and finally insert data and HTML pieces into
a DB. Eventually this data will be served up as RSS and/or Atom.

I'm currently using html/tree (htmltools); I've also tried Rubyful Soup;
both have their own shortcomings. What do you people suggest?

Regarding htmltools: I had to tweak it quite a bit, as it wouldn't
recognize
XHTML-style "empty" tags (for instance, it dislikes <link ... />).
What's even worse, I can't seem to get it to dump back the HTML it read.
Something as simple as:

#!/usr/bin/env ruby

require 'html/tree'

p = HTMLTree::Parser.new(false, false)
p.feed("<a href='about:blank'><img src='blah' /></a>")
p.tree.dump

Results in:

  <a href="about:blank">
    <img src="blah">


Rubyful Soup is not perfect either, quite often spewing things like
<img img="" ...; OTOH, it groks XHTML. But it's much much slower...


What do you think? Any pointer, suggestions, ecc. very very welcome!

Bye,
	Andrea
Kevin O. (Guest)
on 2006-01-11 02:40
On a related topic....

I've been thinking about writing a script that would scrape Rdoc html
files and then insert descriptions from the code into a table.

The specific reason for this was to provide automagic population of the
privledge description fields in the 'user_engine'.

I suspect there may be other applications for this as well.
A good HTML scraper library would really help out with this.


_Kevin
This topic is locked and can not be replied to.