Forum: Ruby Nokogiri not getting html body sometimes

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Ff97ca87af59ee68ceff5877a8365788?d=identicon&s=25 Jarmo Pertman (juuser)
on 2009-05-20 19:44
I'm using Mechanize to get imdb page and then Nokogiri Node#search
method to get some info from the page, but I've stumbled onto one
special case where #search doesn't work properly, e.g. all other pages
I've tried so far work as expected.

It seems that some special characters are causing the trouble for
Nokogiri, because when I tried to print document itself it outputted
only half of <head> tag and no body tags at all!

Anyway here is the code snippet which I'd expect to output "false" 4
times. Instead, it outputs false, false, true, false. Try with some
other imdb url and it's ok.

require 'mechanize'

mech = WWW::Mechanize.new {|agent| agent.user_agent_alias = 'Windows
Mozilla'}
mech.get("http://www.imdb.com/title/tt1092016/") do |page|
  puts page.search("/html").empty?
  puts page.search("/html/head").empty?
  puts page.search("/html/body").empty?
  puts page.body.empty?
end

What could be the problem?

I'm using ruby 1.8.6 (2007-09-24 patchlevel 111) [i386-mswin32]
00e4a880b1262a125b5e342e4b536765?d=identicon&s=25 Lui Kore (night_stalker)
on 2009-05-21 16:28
i think you'd better set the encoding first.

mech.get("http://www.imdb.com/title/tt1092016/") do |page|
  page.encoding = 'ISO-8859-1'
  #... the rest of ur code
end
Ff97ca87af59ee68ceff5877a8365788?d=identicon&s=25 Jarmo Pertman (juuser)
on 2009-05-21 18:32
Thank you! It did the trick.

Best regards,
Jarmo

Lui Core wrote:
> i think you'd better set the encoding first.
>
> mech.get("http://www.imdb.com/title/tt1092016/") do |page|
>   page.encoding = 'ISO-8859-1'
>   #... the rest of ur code
> end
This topic is locked and can not be replied to.