Associates worldwide, here is some unpolished software for you. You’ll
need a
compiler to install this one and the `iconv’ library installed.
gem install hpricot --source code.whytheluckystiff.net
Hpricot is a fast HTML parser, based on HTree. I converted the HTree
scanner to
C and I’m just now reworking the parser. I’ve also started adding a
bunch of
nice methods to HTree so that you won’t have any desire to use REXML
objects
instead.
doc = File.open(path) { |f| Hpricot.parse(f) }
supports xpath
doc.search(“//p/a”).set(“href”, “http://google.com”)
supports css selectors
doc.search(“#menu .box”).each { |ele| p ele }
slash is a shortcut
(doc/“#menu box”).each …
symbols also imply css selectors of tag names
(doc/:p/:a).set(“href”, “http://google.com”)
The Hpricot scanner uses Ragel (the same state machine used by Mongrel)
and is
able to whip through hundreds of HTML documents in a second. (I’m
benchmarking
against the sizeable Boing Boing home page, Slashdot, and others.)
However,
this release still includes some of HTree’s existing code, which slows
things
down quite a bit and will be phased out over the next few releases.
Anyway, I have high hopes for this little guy. Please don’t forget to
say the
name right. It’s H-pricot. Like: AYYCHH-pricot.
Subversion is here: http://code.whytheluckystiff.net/svn/hpricot/trunk.
Gracias, mi rubistos!
_why
On Jul 3, 2006, at 10:28 PM, why the lucky stiff wrote:
nice methods to HTree so that you won’t have any desire to use
against the sizeable Boing Boing home page, Slashdot, and others.)
trunk.
Gracias, mi rubistos!
_why
Very nice. Thanks _why
-Ezra
On Tue, Jul 04, 2006 at 02:28:45PM +0900, why the lucky stiff wrote:
gem install hpricot --source code.whytheluckystiff.net
Okay, 0.2 is out. The above is a lifetime prescription.
If you’d rather not install Hpricot, but want to play with it, try out
the
balloon. You can review it at http://balloon.hobix.com/hpricot, then
run it
with:
ruby -ropen-uri -e
‘eval(open(“http://balloon.hobix.com/hpricot”).read)’
_why
On Sat, Jul 08, 2006 at 10:46:00PM +0900, Ron M wrote:
Hpricot chokes on this when I try it.
require ‘rubygems’
require ‘hpricot’
require ‘open-uri’
Hpricot(open(‘http://www.pcmag.com/article2/0,1759,1765785,00.asp’)).to_html
Wonderful, thankyou. This is fixed in trunk now. Have a good time.
_why
why the lucky stiff wrote:
On Tue, Jul 04, 2006 at 02:28:45PM +0900, why the lucky stiff wrote:
gem install hpricot --source code.whytheluckystiff.net
Okay, 0.2 is out…
Hpricot chokes on this when I try it.
require ‘rubygems’
require ‘hpricot’
require ‘open-uri’
Hpricot(open(‘http://www.pcmag.com/article2/0,1759,1765785,00.asp’)).to_html
If I understand, some methods (to_html, at least) in the version
of Hpricot I have (whatever was there yesterday morning) doesn’t
like ugly “html” like “” where apparently
attribute values end up being null.
I can work around it by putting
aval ||= ‘’
in STag in tag.rb