Forum: Ruby libxml's SaxParser and UTF-8 problem

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
F669bcd61304c72eb39dc28c8dc7ab28?d=identicon&s=25 Peter Higgins (retardo)
on 2007-03-02 07:00
I've written a small script to parse an xml doc with SaxParser and
everything goes well until the parser encounters a Unicode character.
For example, in the for the following snippet:

<key>Name</key><string>90's Music</string>

In case it doesn't come through correctly, the "'" character above is an
apostrophe, represented as <E2><80><99> when I view the xml with less.

When the on_characters method is called for the string "90's Music", the
buffer only contains "90", with no error or warning being presented.
After this is encountered parsing occurs normally; the first I saw of
the bug was when I noticed some of my strings being truncated. Is there
some setting of libxml or ruby that I've overlooked to cause this
behavior?
F669bcd61304c72eb39dc28c8dc7ab28?d=identicon&s=25 Peter Higgins (retardo)
on 2007-03-05 05:14
Peter Higgins wrote:
> I've written a small script to parse an xml doc with SaxParser and
> everything goes well until the parser encounters a Unicode character.
> For example, in the for the following snippet:
>
> <key>Name</key><string>90's Music</string>
>
> In case it doesn't come through correctly, the "'" character above is an
> apostrophe, represented as <E2><80><99> when I view the xml with less.
>
> When the on_characters method is called for the string "90's Music", the
> buffer only contains "90", with no error or warning being presented.
> After this is encountered parsing occurs normally; the first I saw of
> the bug was when I noticed some of my strings being truncated. Is there
> some setting of libxml or ruby that I've overlooked to cause this
> behavior?

As part of researching the problem, I wrote a small test script with
REXML looking  for that particular string, and it returned the correct,
full quote: "90’s Music". It looks like this is a bug with libxml then,
so I'll post on their mailing list.
88521907e2c9c585bc94e35a38893dc5?d=identicon&s=25 Jenda Krynicky (jendaperl)
on 2007-03-07 15:06
Peter Higgins wrote:
> I've written a small script to parse an xml doc with SaxParser and
> everything goes well until the parser encounters a Unicode character.
> For example, in the for the following snippet:
>
> <key>Name</key><string>90's Music</string>
>
> In case it doesn't come through correctly, the "'" character above is an
> apostrophe, represented as <E2><80><99> when I view the xml with less.
>
> When the on_characters method is called for the string "90's Music", the
> buffer only contains "90", with no error or warning being presented.
> After this is encountered parsing occurs normally; the first I saw of
> the bug was when I noticed some of my strings being truncated. Is there
> some setting of libxml or ruby that I've overlooked to cause this
> behavior?

Any chance the quote is passed to another call to on_characters? I do
believe SAX does not always return all the content of a tag in one call
to the handler, but sometimes calls the handler several times and you
have to put it all together yourself.

Of course it could be Wuby unable to handle the UTF8.

Jenda
This topic is locked and can not be replied to.