Hpricot 0.1 -- quick, cinchy HTML parsing

Associates worldwide, here is some unpolished software for you. You’ll
need a
compiler to install this one and the `iconv’ library installed.

gem install hpricot --source code.whytheluckystiff.net

Hpricot is a fast HTML parser, based on HTree. I converted the HTree
scanner to
C and I’m just now reworking the parser. I’ve also started adding a
bunch of
nice methods to HTree so that you won’t have any desire to use REXML
objects
instead.

doc = File.open(path) { |f| Hpricot.parse(f) }

supports xpath

doc.search(“//p/a”).set(“href”, “http://google.com”)

supports css selectors

doc.search(“#menu .box”).each { |ele| p ele }

slash is a shortcut

(doc/“#menu box”).each …

symbols also imply css selectors of tag names

(doc/:p/:a).set(“href”, “http://google.com”)

The Hpricot scanner uses Ragel (the same state machine used by Mongrel)
and is
able to whip through hundreds of HTML documents in a second. (I’m
benchmarking
against the sizeable Boing Boing home page, Slashdot, and others.)
However,
this release still includes some of HTree’s existing code, which slows
things
down quite a bit and will be phased out over the next few releases.

Anyway, I have high hopes for this little guy. Please don’t forget to
say the
name right. It’s H-pricot. Like: AYYCHH-pricot.

Subversion is here: http://code.whytheluckystiff.net/svn/hpricot/trunk.

Gracias, mi rubistos!

_why

On Jul 3, 2006, at 10:28 PM, why the lucky stiff wrote:

nice methods to HTree so that you won’t have any desire to use

against the sizeable Boing Boing home page, Slashdot, and others.)
trunk.

Gracias, mi rubistos!

_why

Very nice. Thanks _why

-Ezra

On Tue, Jul 04, 2006 at 02:28:45PM +0900, why the lucky stiff wrote:

gem install hpricot --source code.whytheluckystiff.net

Okay, 0.2 is out. The above is a lifetime prescription.

If you’d rather not install Hpricot, but want to play with it, try out
the
balloon. You can review it at http://balloon.hobix.com/hpricot, then
run it
with:

ruby -ropen-uri -e
‘eval(open(“http://balloon.hobix.com/hpricot”).read)’

_why

On Sat, Jul 08, 2006 at 10:46:00PM +0900, Ron M wrote:

Hpricot chokes on this when I try it.

require ‘rubygems’
require ‘hpricot’
require ‘open-uri’
Hpricot(open(‘http://www.pcmag.com/article2/0,1759,1765785,00.asp’)).to_html

Wonderful, thankyou. This is fixed in trunk now. Have a good time.

_why

why the lucky stiff wrote:

On Tue, Jul 04, 2006 at 02:28:45PM +0900, why the lucky stiff wrote:

gem install hpricot --source code.whytheluckystiff.net

Okay, 0.2 is out…

Hpricot chokes on this when I try it.

require ‘rubygems’
require ‘hpricot’
require ‘open-uri’
Hpricot(open(‘http://www.pcmag.com/article2/0,1759,1765785,00.asp’)).to_html

If I understand, some methods (to_html, at least) in the version
of Hpricot I have (whatever was there yesterday morning) doesn’t
like ugly “html” like “” where apparently
attribute values end up being null.

I can work around it by putting
aval ||= ‘’
in STag in tag.rb