Hi,
I’m trying to scrape links using Mechanize. Sometimes accented
characters
(on French pages) are corrupt once Ruby gets them. To see what I mean,
check
this:
require ‘mechanize’
a = WWW::Mechanize.new
page =
a.get('http://www.agr.gc.ca/cb/index_f.php?s1=n&s2=index&page=2009_07
')
page.links.each do |a_link|
puts a_link
end
Of course, it’s only the accents that are entered in plain text (i.e.,
without entities) that have this problem. But in an imperfect world, I
can’t
always count on accents being entered properly.
Is there anything I can do about this? I’ve tried using Iconv to convert
the
strings to UTF-8, but that just resulted in a different (but still
wrong)
character in place of the broken ones.
Thanks for any help,
Patrick
On Tue, Jul 07, 2009 at 10:18:50PM +0900, Patrick Lajeunesse wrote:
puts a_link
end
Of course, it’s only the accents that are entered in plain text (i.e.,
without entities) that have this problem. But in an imperfect world, I can’t
always count on accents being entered properly.
Is there anything I can do about this? I’ve tried using Iconv to convert the
strings to UTF-8, but that just resulted in a different (but still wrong)
character in place of the broken ones.
What version of nokogiri / mechanize do you have installed? I ran your
code and was able to see the accents:
Skitch | Evernote
Most of the time, these encoding issues are due to the server
incorrectly identifying the encoding of the content. Is this content
supposed to be ISO-8859-1?
Thanks Aaron - I thought I was up-to-date, but I guess I not. I did a
gem
update and got 0.9.3 - and then it worked fine.
Thanks again,
Patrick
On Tue, Jul 7, 2009 at 12:13 PM, Aaron P.
<[email protected]