Forum: Ruby General Nokogiri problem

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Srijayanth S. (Guest)
on 2009-05-07 10:46
(Received via mailing list)
Hello,

On several sites(probably malformed HTML/JavaScript/XML/general parsing
hell) I have the following problem.

For ex:

moonwolf@trantor:~/ruby$ irb
irb(main):001:0> ['rubygems','nokogiri','hpricot','open-uri'].each { |r|
require r }
=> ["rubygems", "nokogiri", "hpricot", "open-uri"]
irb(main):002:0> doc=Nokogiri(open("http://maps.google.com/"))
=> <?xml version="1.0"?>
<!DOCTYPE html>
<html/>

irb(main):003:0> doc/"a"
=>

Same with Nokogiri.Hpricot:

irb(main):004:0> doc=Nokogiri.Hpricot(open("http://maps.google.com/"))
=> <?xml version="1.0"?>
<!DOCTYPE html>
<html/>

However with regular Hpricot:

irb(main):009:0> (Hpricot(open("http://maps.google.com/"))/"a").size
=> 53
(the full post of course is too long, so just showed something simpler)


Hpricot by itself of course works. I tried looking and there's not much
by
way of documentation or blogs on something like this.

Any suggestions/explanations will be welcome as I like Nokogiri's speed
very
much.

I am using:

moonwolf@trantor:~/ruby$ gem list --local | grep -i nokogiri
nokogiri (1.2.3)
moonwolf@trantor:~/ruby$ ruby --version
ruby 1.8.6 (2008-03-03 patchlevel 114) [i686-linux]


Jayanth
Aaron P. (Guest)
on 2009-05-07 11:03
(Received via mailing list)
On Thu, May 07, 2009 at 03:45:28PM +0900, Srijayanth S. wrote:
> => ["rubygems", "nokogiri", "hpricot", "open-uri"]
> irb(main):004:0> doc=Nokogiri.Hpricot(open("http://maps.google.com/"))
>
> Hpricot by itself of course works. I tried looking and there's not much by
> way of documentation or blogs on something like this.
>
> Any suggestions/explanations will be welcome as I like Nokogiri's speed very
> much.

Nokogiri detects the XML header and parses it as XML.  If you force it
to use the HTML parser, you may be more successfull:

  >> (Nokogiri::HTML(open("http://maps.google.com/"))/'a').length
  => 53
  >>
Srijayanth S. (Guest)
on 2009-05-07 11:07
(Received via mailing list)
Thanks Aaron.

Jayanth

On Thu, May 7, 2009 at 12:32 PM, Aaron P.
<removed_email_address@domain.invalid
Srijayanth S. (Guest)
on 2009-05-07 11:09
(Received via mailing list)
Whoops,

irb(main):015:0> (Nokogiri::HTML(open("http://maps.google.com/
"))/'a').length
=> 0

Not sure what the deal is.

Jayanth
This topic is locked and can not be replied to.