Ruby (X)HTML Parser?

andreim · September 25, 2006, 2:22pm

Hi guys,

I’m starting to learn Ruby and I was thinking about a little app so I
can
get things started as quickly as possible. Since I’m an avid blog
reader,
the first thing that went though my mind was a small app that would
extract
the RSS or Atom feed from a web page, giving the URL.

My first choice were regexps but I’m thinking that my little app my grow
a
little bit more in the not-so-distant future and I might be doing more
than
just extracting feeds.

I found:

ymHTML at http://www.yoshidam.net/Ruby.html
RAA at http://raa.ruby-lang.org/project/html-parser-2/

but they don’t look really standard and RAA doesn’t look like it’s
currently
maintained. I’ve also heard that there’s a Rails HTML parser but I
couldn’t
find more info (an pro’lly I’ll ask on one of the Rails list).

Is there a more “standard” way to parse HTML pages in Ruby?

Thanks,

Andrei

andreim · September 25, 2006, 4:02pm

There’s Hpricot. Haven’t used it myself though.

http://code.whytheluckystiff.net/hpricot/

andreim · September 25, 2006, 4:50pm

Jordan E. wrote:

There’s Hpricot. Haven’t used it myself though.

http://code.whytheluckystiff.net/hpricot/

Hpricot is really nice. Also, there is the standard REXML (built-in
since 1.8). See the tutorial for some ideas how to use it:
http://www.germane-software.com/software/rexml/docs/tutorial.html

Regards,
Jordan

andreim · September 25, 2006, 5:14pm

Andrei M. wrote:

I found:

ymHTML at http://www.yoshidam.net/Ruby.html

RAA at http://raa.ruby-lang.org/project/html-parser-2/

but they don’t look really standard and RAA doesn’t look like it’s currently
maintained. I’ve also heard that there’s a Rails HTML parser but I couldn’t
find more info (an pro’lly I’ll ask on one of the Rails list).

Is there a more “standard” way to parse HTML pages in Ruby?
The closest you’ll find to a standard is REXML, which is an XML parser
that ships in the stdlib. You’ll want to throw your HTML through Tidy
first, though - but that’s an easy install.

There are a couple of alternatives: Hpricot and html-parser spring
instantly to mind.

If you’re doing feed parsing, you probably also want to check out
feedtools.

andreim · September 25, 2006, 6:59pm

On Mon, Sep 25, 2006 at 11:01:18PM +0900, Jordan E. wrote:

There’s Hpricot. Haven’t used it myself though.

http://code.whytheluckystiff.net/hpricot/

If you decide to us Hpricot, I’d recommend the latest 0.4.52 gems:

gem install hpricot --source code.whytheluckystiff.net

There’s been a good deal of patching over the past week and a new
release is
very close.

_why

andreim · September 25, 2006, 7:51pm

Since I’m an avid blog reader,
the first thing that went though my mind was a small app that would extract
the RSS or Atom feed from a web page, giving the URL.

If you’re doing feed parsing, you probably also want to check out feedtools.

Well… he probably won’t learn much from the FeedTools code, but it is
convenient for this sort of thing:

irb(main):001:0> $KCODE = ‘u’
=> “u”
irb(main):002:0> require ‘feed_tools’
=> true
irb(main):003:0> feed = FeedTools::Feed.open(‘http://intertwingly.net/’)
=> #<FeedTools::Feed:0x135d8fe
URL:Sam Ruby>
irb(main):004:0> feed.title
=> “Sam Ruby”
irb(main):005:0> feed.subtitle
=> “It’s just data”

Cheers,
Bob A.