Forum: Ruby hpricot or nokogiri?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
444c7d8579d3389a76c8497f5cb66c13?d=identicon&s=25 M.W. Mitchell (goodieboy)
on 2009-01-09 20:17
(Received via mailing list)
OK, was completely sold on Hpricot and now am having my doubts. I
can't seem to get to any of the docs (the site is down). Is it still
being developed? Who are the developers? I love the API and really am
hoping to use it...

So then I tried out Nokogiri and it works well. The bug that Hpricot
had (re-naming a node only names the open-tag) is not present in
Nokogiri. Great! But it's built on libxml, which I don't know much
about. It seems a little more heavy weight than Hpricot. I also heard
that the main developer for libxml doesn't have much time to devote to
the project.

Any advice for me follks?

Matt
Ad97b577f331ae29ed90da5751f2e44f?d=identicon&s=25 Dan Diebolt (dandiebolt)
on 2009-01-09 20:36
(Received via mailing list)
>Any advice for me follks?
I use Hpricot extensively for various data mining tasks.
It is my repeated experience that the more difficult task
is devising a harvesting strategy which depends on the
structure of target web page. I rarely have to devise a
workaround because Hpricot does not support a selector or
some other feature. In practice, when you start parsing a
lot of web pages for information things like invalid html,
character entities, whitespace & comments, css interactions
etc become more of an issue the features of your parser.
Other may have different experience, but I find determining
a harvesting strategy more difficult than manipulating
a particular gem such as Hpricot.
5a837592409354297424994e8d62f722?d=identicon&s=25 Ryan Davis (Guest)
on 2009-01-09 23:54
(Received via mailing list)
On Jan 9, 2009, at 11:16 , goodieboy wrote:

> the project.
hpricot drops the ball in a lot of ways and is much more heavyweight
than nokogiri. Parsing an 8 meg itunes xml file takes over a gig in
hpricot (according to my students) and nokogiri zipped right through it.

The libxml developer doesn't need to devote much time to the project
(assuming you mean libxml itself, not nokogiri). It is a very mature
library. On the other hand, hpricot has had a lot of open bugs for a
long time and they've not been touched one way or another. I find
Aaron Patterson very responsive to my bug reports (but I'm biased,
he's just down the street--look at the bug tracker on rubyforge for
less biased data).
Be30361bb0b0c495e3077db43ad84b56?d=identicon&s=25 Aaron Patterson (Guest)
on 2009-01-10 00:01
(Received via mailing list)
Hi Matt,

On Sat, Jan 10, 2009 at 04:16:22AM +0900, goodieboy wrote:
> the project.
Yes, Nokogiri is built on top of the libxml2 project from Gnome.
libxml2 is actively developed and well supported since it is the XML
parser used by the Gnome project:

  http://xmlsoft.org/

If you find bugs, we have a

* mailing list: http://rubyforge.org/mailman/listinfo/nokogiri-talk
* IRC Channel on freenode: #nokogiri
* Ticketing system:
  http://nokogiri.lighthouseapp.com/projects/19607-n...
* RDoc: http://nokogiri.rubyforge.org/nokogiri/

I've switched my projects from Hpricot to Nokogiri, and I'm quite happy.
444c7d8579d3389a76c8497f5cb66c13?d=identicon&s=25 M.W. Mitchell (goodieboy)
on 2009-01-11 20:55
(Received via mailing list)
On Jan 9, 6:01 pm, Aaron Patterson <aa...@tenderlovemaking.com> wrote:
> > Nokogiri. Great! But it's built on libxml, which I don't know much
> If you find bugs, we have a
> Aaron Pattersonhttp://tenderlovemaking.com/
This is great thank you. Definitely helps clear things up a bit. So
it's not just me... Hpricot has a few bugs that have been around for a
while. That's too bad :(

OK, for a quick Nokogiri question... is it possible to ask a node if
it responds to a certain xpath? Something like:

matching = nodes.select{|n| n.is_findable_by('[@class=plant]') }

Thanks,
Matt
Be30361bb0b0c495e3077db43ad84b56?d=identicon&s=25 Aaron Patterson (Guest)
on 2009-01-11 21:24
(Received via mailing list)
On Mon, Jan 12, 2009 at 04:54:26AM +0900, matt mitchell wrote:
> > > had (re-naming a node only names the open-tag) is not present in
> >
> > --
> > Aaron Pattersonhttp://tenderlovemaking.com/
>
> This is great thank you. Definitely helps clear things up a bit. So
> it's not just me... Hpricot has a few bugs that have been around for a
> while. That's too bad :(
>
> OK, for a quick Nokogiri question... is it possible to ask a node if
> it responds to a certain xpath? Something like:
>
> matching = nodes.select{|n| n.is_findable_by('[@class=plant]') }

I can't think of a good xpathy way to do that from the current node.
You could do something like this:

  matching = nodes.select { |n|
    n.parent.xpath('./*[@class="plant"]').include?(n)
  }

That might get kind of slow though.  If you know that "class" is the
attribute you're looking for, you could just do something like this:

  matching = nodes.select { |n| n['class'] == "plant" }

Hope that helps.
9fbd3eb69f978b77c1bd66436971cdb2?d=identicon&s=25 Lance Bradley (lancepantz)
on 2009-02-12 03:24
I've been going through a similar situation with my current project. I
was initially using Hpricot, and was very frustrated by the lack of
documentation and some of the lingering bugs. I've now switched to
nokogiri and have been very impressed with it.

I'm now running into some of the robustness issues that are faced when
you process data from the open web, like Dan alluded to. I'm using
nokogiri's sax implementation, and I've ran into some problems with
handling html entities, rather they are preserved or decoded into utf-8.
In both cases, nokogiri will quit calling my start and end element
handlers, but continue to call my character handler after an entity is
seen. Specifically, I've noticed this behavior when it sees &nbsp; and
&#8230;. Has anyone else experienced this and have any advice to share?
I appreciate it!
-lance

(here's my code)

class Nokogiri::XML::SAX::Document
  attr_accessor :rhtml
  def initialize
    @rhtml = ""
    @keep_text = true
    @keep_elements = %w{ br p img ul ol title li div table head body
meta base blockquote }
  end

  def start_element name, attrs = []
    puts "start element called: " + name
    if @keep_elements.include?(name)
      puts "keeping: #{name}"
      @rhtml << "<#{name}>\n"
    end
    if ['script', 'style'].include? name
      @keep_text = false
    end
  end

  def characters text
    #@rhtml << @coder.decode( text ) if @keep_text
    @rhtml << text if @keep_text
    puts text
  end

  def end_element name
    puts "end element called: " + name
    if @keep_elements.include?(name)
      @rhtml << "</#{name}>\n"
    end
    if ['script', 'style'].include? name
      @keep_text = true
    end
  end

end

html = open(ARGV[0], 'r').collect { |l| l }.join

#coder = HTMLEntities.new
#html = coder.decode(html)

Tidy.path = '/usr/lib/libtidy-0.99.so.0'
xml = Tidy.open(:show_warnings=>true) do |tidy|
  tidy.options.output_xml = true
  #tidy.options.char_encoding = 'utf8'
  tidy.options.preserve_entities  = true
  xml = tidy.clean(html)
end

doc = Nokogiri::XML::SAX::Document.new
parser = Nokogiri::XML::SAX::Parser.new(doc)

parser.parse(xml)

puts "doc:"
puts doc.rhtml.gsub(/\n+/, "\n")
45196398e9685000d195ec626d477f0e?d=identicon&s=25 Thomas Sawyer (7rans)
on 2009-02-12 04:53
(Received via mailing list)
On Feb 11, 9:22 pm, Lance Bradley <l...@ncebradley.org> wrote:
> handlers, but continue to call my character handler after an entity is
>     @rhtml = ""
>     end
>
> end
>   tidy.options.preserve_entities  = true
>   xml = tidy.clean(html)
> end
>
> doc = Nokogiri::XML::SAX::Document.new
> parser = Nokogiri::XML::SAX::Parser.new(doc)
>
> parser.parse(xml)
>
> puts "doc:"
> puts doc.rhtml.gsub(/\n+/, "\n")

Note that there are also the libxml ruby bindings.

  http://libxml.rubyforge.org

T.
This topic is locked and can not be replied to.