Change/ignore XML encoding?

Hey guys,

I think I am missing something very basic here. I have an XML request,
using the following code as an example:

require “rubygems”
require “xml/libxml”

movie = “sin+city”
search_url =
http://www.movie-xml.com/interfaces/getmovie.php?moviename=
url = search_url+movie
doc = XML::Document.file(url)

Now, with most of the XML results I get from movie-xml.com, the default
utf-8 is fine since there are no non-utf-8 characters. When searching
Sin City as an example, there are. Here’s the response I get:

Input is not proper UTF-8, indicate encoding !

The source XML has an encoding declared as such:

<?xml version="1.0" encoding="ISO-8859-1"?>

So I should probably just decode as ISO-8859-1 as well. How the hell do
I do that? I have Googled the crap out of this and just can’t seem to
find what I need here…

Travis B. [email protected] wrote:

http://www.movie-xml.com/interfaces/getmovie.php?moviename=

<?xml version="1.0" encoding="ISO-8859-1"?>

So I should probably just decode as ISO-8859-1 as well. How the hell do
I do that? I have Googled the crap out of this and just can’t seem to
find what I need here…

Could this just be a bug in Libxml? REXML seems to do the right thing…
m.

On Aug 22, 12:32 pm, [email protected] (matt neuburg) wrote:

So I should probably just decode as ISO-8859-1 as well. How the hell do
I do that? I have Googled the crap out of this and just can’t seem to
find what I need here…

Could this just be a bug in Libxml? REXML seems to do the right thing…

Clearly libxml is expecting UTF-8, even though the XML file specifies
that it’s encoded in ISO-8859-1. So that’s a bug.

However, it appears that libxml is “correctly” rejecting data that is
not proper UTF-8 (independent of what it claims to be). Twice in the
XML data the word “verg?enza” appears where the “?” has hex code 0xFC
that encodes a lower case “u” with umlaut in ISO-8859-1. 0xFC cannot
appear in UTF-8 data due to RFC-3629.

libxml should work with ISO-8859-1 data much of the time, as long as
it doesn’t contain 13 specific bytes (0xC0, 0xC1, 0xF5…0xFF).

Eric

====

Are you interested in on-site Ruby or Ruby on Rails training
that uses well-designed, real-world, hands-on exercises?
http://LearnRuby.com

Eric I. wrote:

Clearly libxml is expecting UTF-8, even though the XML file specifies
that it’s encoded in ISO-8859-1. So that’s a bug.

libxml should work with ISO-8859-1 data much of the time, as long as
it doesn’t contain 13 specific bytes (0xC0, 0xC1, 0xF5…0xFF).

Heh, so is there a way around this aside from using REXML? Are we
concluding this is a bug in libxml?

Travis B. [email protected] wrote:

Eric I. wrote:

Clearly libxml is expecting UTF-8, even though the XML file specifies
that it’s encoded in ISO-8859-1. So that’s a bug.

libxml should work with ISO-8859-1 data much of the time, as long as
it doesn’t contain 13 specific bytes (0xC0, 0xC1, 0xF5…0xFF).

Heh, so is there a way around this aside from using REXML?

Well, if you really want to, I suppose you could parse the encoding info
yourself, convert the encoding of the entire text and change the
encoding info to utf8, and then open with libxml.

Are we
concluding this is a bug in libxml?

Not sure. Couldn’t hurt to report it, though. It has its own google
group and its own bug reporting page… m.

matt neuburg wrote:

Well, if you really want to, I suppose you could parse the encoding info
yourself, convert the encoding of the entire text and change the
encoding info to utf8, and then open with libxml.

Are we
concluding this is a bug in libxml?

Not sure. Couldn’t hurt to report it, though. It has its own google
group and its own bug reporting page… m.

Right on. For now I just switched to rexml and without any special
change everything parses properly. Good for anyone else to know for
future reference.

Travis B. [email protected] wrote:

Right on. For now I just switched to rexml and without any special
change everything parses properly. Good for anyone else to know for
future reference.

Okay, but that helps no one since you didn’t submit the bug. So I
submitted it for you. m.

matt neuburg wrote:

Travis B. [email protected] wrote:

Right on. For now I just switched to rexml and without any special
change everything parses properly. Good for anyone else to know for
future reference.

Okay, but that helps no one since you didn’t submit the bug. So I
submitted it for you. m.

http://rubyforge.org/tracker/?func=detail&atid=1971&aid=21658&group_id=494

Here’s the link so the concerned can follow its status.

-Erik