Ruby (X)HTML Parser?

Hi guys,

I’m starting to learn Ruby and I was thinking about a little app so I
can
get things started as quickly as possible. Since I’m an avid blog
reader,
the first thing that went though my mind was a small app that would
extract
the RSS or Atom feed from a web page, giving the URL.

My first choice were regexps but I’m thinking that my little app my grow
a
little bit more in the not-so-distant future and I might be doing more
than
just extracting feeds.

I found:

but they don’t look really standard and RAA doesn’t look like it’s
currently
maintained. I’ve also heard that there’s a Rails HTML parser but I
couldn’t
find more info (an pro’lly I’ll ask on one of the Rails list).

Is there a more “standard” way to parse HTML pages in Ruby?

Thanks,

Andrei

There’s Hpricot. Haven’t used it myself though.

http://code.whytheluckystiff.net/hpricot/

Jordan E. wrote:

There’s Hpricot. Haven’t used it myself though.

http://code.whytheluckystiff.net/hpricot/

Hpricot is really nice. Also, there is the standard REXML (built-in
since 1.8). See the tutorial for some ideas how to use it:
http://www.germane-software.com/software/rexml/docs/tutorial.html

Regards,
Jordan

Andrei M. wrote:

I found:

but they don’t look really standard and RAA doesn’t look like it’s currently
maintained. I’ve also heard that there’s a Rails HTML parser but I couldn’t
find more info (an pro’lly I’ll ask on one of the Rails list).

Is there a more “standard” way to parse HTML pages in Ruby?
The closest you’ll find to a standard is REXML, which is an XML parser
that ships in the stdlib. You’ll want to throw your HTML through Tidy
first, though - but that’s an easy install.

There are a couple of alternatives: Hpricot and html-parser spring
instantly to mind.

If you’re doing feed parsing, you probably also want to check out
feedtools.

On Mon, Sep 25, 2006 at 11:01:18PM +0900, Jordan E. wrote:

There’s Hpricot. Haven’t used it myself though.

http://code.whytheluckystiff.net/hpricot/

If you decide to us Hpricot, I’d recommend the latest 0.4.52 gems:

gem install hpricot --source code.whytheluckystiff.net

There’s been a good deal of patching over the past week and a new
release is
very close.

_why

Since I’m an avid blog reader,
the first thing that went though my mind was a small app that would extract
the RSS or Atom feed from a web page, giving the URL.

If you’re doing feed parsing, you probably also want to check out feedtools.

Well… he probably won’t learn much from the FeedTools code, but it is
convenient for this sort of thing:

irb(main):001:0> $KCODE = ‘u’
=> “u”
irb(main):002:0> require ‘feed_tools’
=> true
irb(main):003:0> feed = FeedTools::Feed.open(‘http://intertwingly.net/’)
=> #<FeedTools::Feed:0x135d8fe
URL:Sam Ruby>
irb(main):004:0> feed.title
=> “Sam Ruby”
irb(main):005:0> feed.subtitle
=> “It’s just data”

Cheers,
Bob A.