Mechanize: How to scrape accented chars that weren't entered as entities?

Hi,
I’m trying to scrape links using Mechanize. Sometimes accented
characters
(on French pages) are corrupt once Ruby gets them. To see what I mean,
check
this:

require ‘mechanize’
a = WWW::Mechanize.new
page =
a.get('http://www.agr.gc.ca/cb/index_f.php?s1=n&s2=index&page=2009_07
')
page.links.each do |a_link|
puts a_link
end

Of course, it’s only the accents that are entered in plain text (i.e.,
without entities) that have this problem. But in an imperfect world, I
can’t
always count on accents being entered properly.

Is there anything I can do about this? I’ve tried using Iconv to convert
the
strings to UTF-8, but that just resulted in a different (but still
wrong)
character in place of the broken ones.

Thanks for any help,

Patrick

On Tue, Jul 07, 2009 at 10:18:50PM +0900, Patrick Lajeunesse wrote:

puts a_link
end

Of course, it’s only the accents that are entered in plain text (i.e.,
without entities) that have this problem. But in an imperfect world, I can’t
always count on accents being entered properly.

Is there anything I can do about this? I’ve tried using Iconv to convert the
strings to UTF-8, but that just resulted in a different (but still wrong)
character in place of the broken ones.

What version of nokogiri / mechanize do you have installed? I ran your
code and was able to see the accents:

http://skitch.com/aaron.patterson/bs4qt/terminal-bash-80x24

Most of the time, these encoding issues are due to the server
incorrectly identifying the encoding of the content. Is this content
supposed to be ISO-8859-1?

Thanks Aaron - I thought I was up-to-date, but I guess I not. I did a
gem
update and got 0.9.3 - and then it worked fine.
Thanks again,

Patrick

On Tue, Jul 7, 2009 at 12:13 PM, Aaron P.
<[email protected]

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs