Encoding issues when parsing HTML in 1.9

dubstep · March 30, 2011, 3:46am

Hi, I’m having some encoding problems while parsing HTML with Nokogiri
in 1.9.

I was first getting errors on non-breaking space characters (code
160), but managed to resolve this by setting the encoding at the top
of my script file (’# coding: utf-8’).

However now I’m trying to do simple string substitution with gsub()
and am getting the error:

invalid byte sequence in UTF-8

An example of where this is bombing is the word “PROT\xC9G” as parsed
by Nokogiri. Removing the encoding setting from my script causes the
original problems, so I seem to be stuck.

Has anybody worked through these issues successfully? Google turns up
a number of discussions without many solutions.

ctdev · March 30, 2011, 4:16am

On Tue, Mar 29, 2011 at 9:45 PM, ctdev [email protected] wrote:

However now I’m trying to do simple string substitution with gsub()
and am getting the error:

invalid byte sequence in UTF-8

An example of where this is bombing is the word “PROT\xC9G” as parsed
by Nokogiri.

What is the encoding of your input HTML file?

ctdev · March 30, 2011, 1:35pm

Hello,

What is the encoding of your input HTML file?

Opening one of the files in IRB and checking external_encoding.name
returns “UTF-8”.

This is from a group of pages I scraped with Hpricot (before switching
to Nokogiri) and saved locally.

The site itself comes from a Microsoft environment and there seems to
be much weirdness in the files. I’ll need to anticipate and
accommodate that in my code.

I wonder if I might have better luck building the scraping portion of
my app in a different language (though I’d rather stick with Ruby).

ctdev · March 30, 2011, 2:25pm

On Wed, Mar 30, 2011 at 1:35 PM, ctdev [email protected] wrote:

What is the encoding of your input HTML file?

Opening one of the files in IRB and checking external_encoding.name
returns “UTF-8”.

That was not the question. He wanted to know the encoding of the
file. You should be able to identify this from the HTTP response.

This is from a group of pages I scraped with Hpricot (before switching
to Nokogiri) and saved locally.

The site itself comes from a Microsoft environment and there seems to
be much weirdness in the files. I’ll need to anticipate and
accommodate that in my code.

Weirdness with regard to encodings or other weirdness?

I wonder if I might have better luck building the scraping portion of
my app in a different language (though I’d rather stick with Ruby).

IMHO it is usually simpler to stay in one ecosystem. If the server
sends the correct encoding I would expect Hpricot and Nokogiri to
treat the file properly. If you fetched the files with a pre 1.9
version then maybe you have to refetch them.

Cheers

robert

ctdev · March 30, 2011, 2:26pm

I also tried the following on a test string:

s.encode(“UTF-8”, :invalid => :replace, :undef =>:replace, :replace
=> “?”)

But it doesn’t seem to replace the invalid character(s), the very
one(s) it’s complaining about!

So I’m stuck because I’m getting the “invalid byte sequence” error,
yet the above function won’t replace the invalid bytes.

TFM says:

“:invalid : If the value is :replace, encode replaces invalid byte
sequences in str with the replacement character”

That’s exactly what I’m trying to do but it isn’t working. It isn’t
replacing the invalid byte sequence it’s complaining about with the
replacement character.

ctdev · March 30, 2011, 2:34pm

On Wed, Mar 30, 2011 at 7:35 AM, ctdev [email protected] wrote:

What is the encoding of your input HTML file?

Opening one of the files in IRB and checking external_encoding.name
returns “UTF-8”.

That doesn’t detect the true file encoding (indeed, the file is either
in a different encoding or the file is corrupt, hence your invalid
byte sequence).

http://blog.grayproductions.net/articles/ruby_19s_three_default_encodings

ruby -v -e ‘puts File.open(“/etc/passwd”).external_encoding’
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.4.0]
US-ASCII

LC_CTYPE=ja_JP.sjis ruby -v -e ‘puts
File.open(“/etc/passwd”).external_encoding’
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.4.0]
Shift_JIS

I wonder if I might have better luck building the scraping portion of
my app in a different language (though I’d rather stick with Ruby).

Well, another language might ignore the invalid characters so it would
look like it worked fine, but your output could actually be invalid.

ctdev · March 30, 2011, 4:01pm

Could that be an optimization in encode: since the string is already
thought to be UTF-8, just return it?

Not sure, it isn’t obvious (to me) looking at encode()'s source.

There’s no charset specified in the response headers from IIS. The
Content-Type meta tag specifies “text/html; charset=UTF-8” though I’m
not sure if Firefox respects that.

file -I on one of the downloaded files displays “text/html;
charset=unknown-8bit.”

Firefox is choosing UTF-8 but the special characters aren’t displayed
properly. Switching from within the browser to one of the Western
encodings displays the characters correctly (as mentioned this is all
MS stuff and I assume people just copy and paste from MS Office).

ctdev · March 30, 2011, 2:40pm

On Wed, Mar 30, 2011 at 8:25 AM, ctdev [email protected] wrote:

I also tried the following on a test string:

s.encode(“UTF-8”, :invalid => :replace, :undef =>:replace, :replace
=> “?”)

But it doesn’t seem to replace the invalid character(s)

Could that be an optimization in encode: since the string is already
thought to be UTF-8, just return it?

s = “PROT\xC9G”=> “PROT\xC9G\u00C9”
s.encode(“UTF-8”, :invalid => :replace, :undef =>:replace, :replace => “?”)
=> “PROT\xC9G\u00C9”

s.
encode(‘ISO8859-9’, :invalid => :replace, :undef =>:replace, :replace
=> “#”).
encode(“UTF-8”, :invalid => :replace, :undef =>:replace, :replace =>
“?”)
=> “PROT#G\u00C9”

ctdev · March 30, 2011, 5:42pm

On Wed, Mar 30, 2011 at 10:00 AM, ctdev [email protected] wrote:

Firefox is choosing UTF-8 but the special characters aren’t displayed
properly. Switching from within the browser to one of the Western
encodings displays the characters correctly (as mentioned this is all
MS stuff and I assume people just copy and paste from MS Office).

Okay, try specifying that encoding when you parse it with Nokogiri?

ctdev · March 30, 2011, 6:16pm

Okay, try specifying that encoding when you parse it with Nokogiri?

I resolved this problem by opening and rewriting the original files
with a specified mode as described in Overbryd’s answer:

So:

old = File.open(“old”, “r:windows-1252:utf-8”)
new = File.open(“new”, “w+:utf-8”) {|f| f.write(old.read)}

Everything works now. The characters were all converted and I was able
to remove the encoding directive and non-breaking space literals from
my script by using ‘\u00A0’ in the regex I’m passing to the split
function.

Thanks for the help.

ctdev · March 30, 2011, 7:57pm

On Wed, Mar 30, 2011 at 12:15 PM, ctdev [email protected] wrote:

I resolved this problem by opening and rewriting the original files
with a specified mode as described in Overbryd’s answer:

So:

old = File.open(“old”, “r:windows-1252:utf-8”)
new = File.open(“new”, “w+:utf-8”) {|f| f.write(old.read)}

Cool; thanks

ctdev · March 30, 2011, 6:20pm

Okay, try specifying that encoding when you parse it with Nokogiri?

And you’re right, I’ll have to see if/how that translates to Nokogiri
for future downloads.