Libxml's SaxParser and UTF-8 problem


#1

I’ve written a small script to parse an xml doc with SaxParser and
everything goes well until the parser encounters a Unicode character.
For example, in the for the following snippet:

Name90’s Music

In case it doesn’t come through correctly, the “’” character above is an
apostrophe, represented as <80><99> when I view the xml with less.

When the on_characters method is called for the string “90’s Music”, the
buffer only contains “90”, with no error or warning being presented.
After this is encountered parsing occurs normally; the first I saw of
the bug was when I noticed some of my strings being truncated. Is there
some setting of libxml or ruby that I’ve overlooked to cause this
behavior?


#2

Peter Higgins wrote:

I’ve written a small script to parse an xml doc with SaxParser and
everything goes well until the parser encounters a Unicode character.
For example, in the for the following snippet:

Name90’s Music

In case it doesn’t come through correctly, the “’” character above is an
apostrophe, represented as <80><99> when I view the xml with less.

When the on_characters method is called for the string “90’s Music”, the
buffer only contains “90”, with no error or warning being presented.
After this is encountered parsing occurs normally; the first I saw of
the bug was when I noticed some of my strings being truncated. Is there
some setting of libxml or ruby that I’ve overlooked to cause this
behavior?

As part of researching the problem, I wrote a small test script with
REXML looking for that particular string, and it returned the correct,
full quote: “90’s Music”. It looks like this is a bug with libxml then,
so I’ll post on their mailing list.


#3

Peter Higgins wrote:

I’ve written a small script to parse an xml doc with SaxParser and
everything goes well until the parser encounters a Unicode character.
For example, in the for the following snippet:

Name90’s Music

In case it doesn’t come through correctly, the “’” character above is an
apostrophe, represented as <80><99> when I view the xml with less.

When the on_characters method is called for the string “90’s Music”, the
buffer only contains “90”, with no error or warning being presented.
After this is encountered parsing occurs normally; the first I saw of
the bug was when I noticed some of my strings being truncated. Is there
some setting of libxml or ruby that I’ve overlooked to cause this
behavior?

Any chance the quote is passed to another call to on_characters? I do
believe SAX does not always return all the content of a tag in one call
to the handler, but sometimes calls the handler several times and you
have to put it all together yourself.

Of course it could be Wuby unable to handle the UTF8.

Jenda