Hpricot or nokogiri?

goodieboy · January 9, 2009, 8:17pm

OK, was completely sold on Hpricot and now am having my doubts. I
can’t seem to get to any of the docs (the site is down). Is it still
being developed? Who are the developers? I love the API and really am
hoping to use it…

So then I tried out Nokogiri and it works well. The bug that Hpricot
had (re-naming a node only names the open-tag) is not present in
Nokogiri. Great! But it’s built on libxml, which I don’t know much
about. It seems a little more heavy weight than Hpricot. I also heard
that the main developer for libxml doesn’t have much time to devote to
the project.

Any advice for me follks?

Matt

goodieboy · January 9, 2009, 8:36pm

Any advice for me follks?
I use Hpricot extensively for various data mining tasks.
It is my repeated experience that the more difficult task
is devising a harvesting strategy which depends on the
structure of target web page. I rarely have to devise a
workaround because Hpricot does not support a selector or
some other feature. In practice, when you start parsing a
lot of web pages for information things like invalid html,
character entities, whitespace & comments, css interactions
etc become more of an issue the features of your parser.
Other may have different experience, but I find determining
a harvesting strategy more difficult than manipulating
a particular gem such as Hpricot.

goodieboy · January 9, 2009, 11:54pm

On Jan 9, 2009, at 11:16 , goodieboy wrote:

the project.
hpricot drops the ball in a lot of ways and is much more heavyweight
than nokogiri. Parsing an 8 meg itunes xml file takes over a gig in
hpricot (according to my students) and nokogiri zipped right through it.

The libxml developer doesn’t need to devote much time to the project
(assuming you mean libxml itself, not nokogiri). It is a very mature
library. On the other hand, hpricot has had a lot of open bugs for a
long time and they’ve not been touched one way or another. I find
Aaron P. very responsive to my bug reports (but I’m biased,
he’s just down the street–look at the bug tracker on rubyforge for
less biased data).

goodieboy · January 11, 2009, 8:55pm

On Jan 9, 6:01 pm, Aaron P. [email protected] wrote:

Nokogiri. Great! But it’s built on libxml, which I don’t know much
If you find bugs, we have a
Aaron P.http://tenderlovemaking.com/
This is great thank you. Definitely helps clear things up a bit. So
it’s not just me… Hpricot has a few bugs that have been around for a
while. That’s too bad

OK, for a quick Nokogiri question… is it possible to ask a node if
it responds to a certain xpath? Something like:

matching = nodes.select{|n| n.is_findable_by(‘[@class=plant]’) }

Thanks,
Matt

goodieboy · January 11, 2009, 9:24pm

On Mon, Jan 12, 2009 at 04:54:26AM +0900, matt mitchell wrote:

had (re-naming a node only names the open-tag) is not present in

–
Aaron P.http://tenderlovemaking.com/

This is great thank you. Definitely helps clear things up a bit. So
it’s not just me… Hpricot has a few bugs that have been around for a
while. That’s too bad

OK, for a quick Nokogiri question… is it possible to ask a node if
it responds to a certain xpath? Something like:

matching = nodes.select{|n| n.is_findable_by(‘[@class=plant]’) }

I can’t think of a good xpathy way to do that from the current node.
You could do something like this:

matching = nodes.select { |n|
n.parent.xpath(‘./*[@class=“plant”]’).include?(n)
}

That might get kind of slow though. If you know that “class” is the
attribute you’re looking for, you could just do something like this:

matching = nodes.select { |n| n[‘class’] == “plant” }

Hope that helps.

goodieboy · January 10, 2009, 12:01am

Hi Matt,

On Sat, Jan 10, 2009 at 04:16:22AM +0900, goodieboy wrote:

the project.
Yes, Nokogiri is built on top of the libxml2 project from Gnome.
libxml2 is actively developed and well supported since it is the XML
parser used by the Gnome project:

http://xmlsoft.org/

If you find bugs, we have a

mailing list: http://rubyforge.org/mailman/listinfo/nokogiri-talk
IRC Channel on freenode: #nokogiri
Ticketing system:
Lighthouse - Beautifully Simple Issue Tracking
RDoc: http://nokogiri.rubyforge.org/nokogiri/

I’ve switched my projects from Hpricot to Nokogiri, and I’m quite happy.

goodieboy · February 12, 2009, 3:24am

I’ve been going through a similar situation with my current project. I
was initially using Hpricot, and was very frustrated by the lack of
documentation and some of the lingering bugs. I’ve now switched to
nokogiri and have been very impressed with it.

I’m now running into some of the robustness issues that are faced when
you process data from the open web, like Dan alluded to. I’m using
nokogiri’s sax implementation, and I’ve ran into some problems with
handling html entities, rather they are preserved or decoded into utf-8.
In both cases, nokogiri will quit calling my start and end element
handlers, but continue to call my character handler after an entity is
seen. Specifically, I’ve noticed this behavior when it sees and
…. Has anyone else experienced this and have any advice to share?
I appreciate it!
-lance

(here’s my code)

class Nokogiri::XML::SAX::Document
attr_accessor :rhtml
def initialize
@rhtml = “”
@keep_text = true
@keep_elements = %w{ br p img ul ol title li div table head body
meta base blockquote }
end

def start_element name, attrs = []
puts "start element called: " + name
if @keep_elements.include?(name)
puts “keeping: #{name}”
@rhtml << “<#{name}>\n”
end
if [‘script’, ‘style’].include? name
@keep_text = false
end
end

def characters text
#@rhtml << @coder.decode( text ) if @keep_text
@rhtml << text if @keep_text
puts text
end

def end_element name
puts "end element called: " + name
if @keep_elements.include?(name)
@rhtml << “</#{name}>\n”
end
if [‘script’, ‘style’].include? name
@keep_text = true
end
end

end

html = open(ARGV[0], ‘r’).collect { |l| l }.join

#coder = HTMLEntities.new
#html = coder.decode(html)

Tidy.path = ‘/usr/lib/libtidy-0.99.so.0’
xml = Tidy.open(:show_warnings=>true) do |tidy|
tidy.options.output_xml = true
#tidy.options.char_encoding = ‘utf8’
tidy.options.preserve_entities = true
xml = tidy.clean(html)
end

doc = Nokogiri::XML::SAX::Document.new
parser = Nokogiri::XML::SAX::Parser.new(doc)

parser.parse(xml)

puts “doc:”
puts doc.rhtml.gsub(/\n+/, “\n”)

goodieboy · February 12, 2009, 4:53am

On Feb 11, 9:22 pm, Lance B. [email protected] wrote:

handlers, but continue to call my character handler after an entity is
@rhtml = “”
end

end
tidy.options.preserve_entities = true
xml = tidy.clean(html)
end

doc = Nokogiri::XML::SAX::Document.new
parser = Nokogiri::XML::SAX::Parser.new(doc)

parser.parse(xml)

puts “doc:”
puts doc.rhtml.gsub(/\n+/, “\n”)

Note that there are also the libxml ruby bindings.

http://libxml.rubyforge.org

T.