Nokogiri SAX parser encoding problem

micheldogger · August 24, 2010, 11:26am

According to Nokogiri’s doc, it works internally in UTF-8.
Running this :

encoding: utf-8

require ‘nokogiri’

class MyDoc < Nokogiri::XML::SAX::Document
def characters(string)
puts string.encoding
puts string
end
end

puts RUBY_VERSION
puts Encoding.default_external

parser = Nokogiri::XML::SAX::Parser.new(MyDoc.new, ‘UTF-8’)
parser.parse(‘Ã©pÃ©e’)

gives :

1.9.2
UTF-8
UTF-8
ÃƒÂ©pÃƒÂ©e

Why ?
_md

micheldogger · August 24, 2010, 11:45am

On Aug 24, 2010, at 2:26 AM, Michel D. [email protected]
wrote:

puts string
end
end

puts RUBY_VERSION
puts Encoding.default_external

parser = Nokogiri::XML::SAX::Parser.new(MyDoc.new, ‘UTF-8’)
parser.parse(‘Ã©pÃ©e’)

What does a plain put with this string give you?

What if you redirect nokogiri’s output to a file and view it in whatever
you entered the above string in?

Chances are it is your terminal, not ruby.

micheldogger · August 24, 2010, 3:13pm

Ryan D. wrote:

What if you redirect nokogiri’s output to a file and view it in whatever
you entered the above string in?

Chances are it is your terminal, not ruby.

Yes, Ryan, you are right : writing to a utf-8 file gives the good
answer.

Actually, in my project, I use the SAX parser to build complex ruby
objects, which are marshaled to a file, and then used by a Shoes app.
This app gets the wrong answer. The culprit may therefore be Marshal.
I’ll shift to YAML and report.

_md

micheldogger · August 24, 2010, 3:35pm

Michel D. wrote:

Ryan D. wrote:

What if you redirect nokogiri’s output to a file and view it in whatever
you entered the above string in?

Chances are it is your terminal, not ruby.

Yes, Ryan, you are right : writing to a utf-8 file gives the good
answer.

Alas, no !

This is strange : when writing to a file :

by luck, for the example I gave (“Ã©pÃ©e”), I get back “Ã©pÃ©e”
correctly,
but when parsing “deuxiÃ¨me”, I get “Ã¨me” (this was the
initial bug I discovered in my app).

This is not the first time I see the “grave accented e” giving trouble
when scanning or parsing in ruby, whatever tool is used…

_md

micheldogger · August 24, 2010, 8:35pm

On Aug 24, 2010, at 06:49 , Michel D. wrote:

called twice, the first call giving “deuxi”, the second one “ème”.
Strange feature, still a bug (?), but one can do with…

Yeah. that last part sounds like a bug. Unfortunately, Aaron P.
is on an airplane for the next 12ish hours as he flies to rubykaigi.
Mike may be able to help out here… otherwise I suggest you email the
nokogiri mailing list with a minimal reproduction of the bug.

micheldogger · August 25, 2010, 1:57am

Hi,

On 2010-08-24, at 9:49 AM, Michel D. wrote:

called twice, the first call giving “deuxi”, the second one “Ã¨me”.
Strange feature, still a bug (?), but one can do with…

Actually this is allowed by the XML spec, annoying as it is. Many
parsers do this when encountering an entity (e.g. ') in the input
stream (you get three strings, before, entity character, after). Some
XML parsers have a parameter that tells it to join adjacent strings
together before reporting a single string. I don’t know if Nokogiri
provides this functionality, but it might be worth a quick peek.

Cheers,
Bob

_md

–
Posted via http://www.ruby-forum.com/.

Bob H.
Recursive Design Inc.
http://www.recursive.ca/
weblog: Xampl.com is for sale | HugeDomains

micheldogger · August 25, 2010, 8:29am

Bob H. wrote:

Actually this is allowed by the XML spec, annoying as it is. Many
parsers do this when encountering an entity (e.g. ') in the input
stream (you get three strings, before, entity character, after). Some
XML parsers have a parameter that tells it to join adjacent strings
together before reporting a single string. I don’t know if Nokogiri
provides this functionality, but it might be worth a quick peek.

@Bob : Yes, it is allowed.

From the nokogiri doc for the ‘characters’ method :

“This method might be called multiple times given one contiguous string
of characters.”

@Ryan : strange as it is, it’s a feature. So, IMHO, no bug report.

Actually, it is very strange. Parsing ‘deuxiÃ¨me’, you get two calls
‘deuxi’ + ‘Ã¨me’, but parsing the more complex ‘Ã©pÃ©e deuxiÃ¨me’, you get
only one …

Thanks to both of you.
_md

micheldogger · August 24, 2010, 3:49pm

Michel D. wrote:

Michel D. wrote:

but when parsing “deuxiÃ¨me”, I get “Ã¨me” (this was the
initial bug I discovered in my app).

This is not the first time I see the “grave accented e” giving trouble
when scanning or parsing in ruby, whatever tool is used…

Sorry for posting again. Actually, in this last example, ‘characters’ is
called twice, the first call giving “deuxi”, the second one “Ã¨me”.
Strange feature, still a bug (?), but one can do with…

_md