XMLRPC (REXML) incorrectly handles UTF-8 data

Hi,
I’m running ruby 1.9.2-p0 on Centos 5.5 x86_64 along with rails 2.3.8.

I have XMLRPC server on another windows machine (rails 1.9.1) and XMLRPC
client on the Centos machine. I need to return UTF-8 encoded data from
server to client and this is where I’m stuck.

The Server seems to be sending correct UTF-8 encoded data, bud client is
unable to parse the XML. If the XML contains ASCII only strings,
everything’s OK, but once there is any multi-byte UTF-8 character, ruby
bails out and outputs this:


REXML::ParseException (#<Encoding::CompatibilityError: incompatible
encoding regexp match (UTF-8 regexp with ASCII-8BIT string)>
/usr/local/lib/ruby/1.9.1/rexml/source.rb:212:in match' /usr/local/lib/ruby/1.9.1/rexml/source.rb:212:in match’
/usr/local/lib/ruby/1.9.1/rexml/parsers/baseparser.rb:425:in pull' /usr/local/lib/ruby/1.9.1/rexml/parsers/streamparser.rb:16:in parse’
/usr/local/lib/ruby/1.9.1/rexml/document.rb:204:in parse_stream' /usr/local/lib/ruby/1.9.1/xmlrpc/parser.rb:717:in parse’
/usr/local/lib/ruby/1.9.1/xmlrpc/parser.rb:460:in parseMethodResponse' /usr/local/lib/ruby/1.9.1/xmlrpc/client.rb:421:in call2’
/usr/local/lib/ruby/1.9.1/xmlrpc/client.rb:410:in `call’

There seems to be something wrong with REXML non-ASCII data parsing or
maybe encoding detection. I’ve tracked it down to the “match” method in
IOSource wrapper class in rexml/source.rb file. The problem seems to be
that the @buffer which the method matches against contains ASCII-8bit
string sometimes. Strangely, it happens only when it contains some
non-ASCII data. If there are only ASCII characters in @buffer, it
happily proceeds as UTF-8.

BTW, my client script looks like this:

module SubmitFilesHelper

@rpc_server_url=‘http://172.16.1.2:3000

def self.sendToServer(filename,language)
require ‘xmlrpc/client’
server = XMLRPC::Client.new2(@rpc_server_url)
result = server.call(‘check’, filename,language)
end
end

Centos has locale set to en_us.UTF-8

Is there anything I’m doing wrong, or is it ruby bug?

Thanks,
Petr

Hm, it’s possible to encode the offending string to base64 before
handing it to xmlrpc, effectively bypassing any ruby 1.9 encoding
awareness. Not exactly what I would like to see…

Anyway, is there a correct solution to my problem? Base64 encoding is
working solution, but not correct as I’m manually bypassing a language
feature worth having.

Cheers,
Petr

On Tue, Nov 16, 2010 at 10:37 PM, Petr K. [email protected] wrote:

REXML::ParseException (#<Encoding::CompatibilityError: incompatible
encoding regexp match (UTF-8 regexp with ASCII-8BIT string)>

try,
Encoding.default_internal = Encoding.default_external = “UTF-8”

best regards -botp

Hi,

In [email protected]
“XMLRPC (REXML) incorrectly handles UTF-8 data” on Tue, 16 Nov 2010
23:37:48 +0900,
Petr K. [email protected] wrote:

I have XMLRPC server on another windows machine (rails 1.9.1) and XMLRPC
client on the Centos machine. I need to return UTF-8 encoded data from
server to client and this is where I’m stuck.

The Server seems to be sending correct UTF-8 encoded data, bud client is
unable to parse the XML. If the XML contains ASCII only strings,
everything’s OK, but once there is any multi-byte UTF-8 character, ruby
bails out and outputs this:

Could you show us a reproducable example? We need at least
the HTTP response header and the XML response from your
XML-RPC server.

Thanks,

botp wrote in post #961846:

try,
Encoding.default_internal = Encoding.default_external = “UTF-8”

Damn, I have seen this before and I would swear I tried it and it didn’t
help (I was using 1.9.1 at the time). Hm, probably somehow slipped
between my fingers. Thanks a lot, works now :slight_smile:

Hi,

In [email protected]
“Re: XMLRPC (REXML) incorrectly handles UTF-8 data” on Thu, 18 Nov
2010 17:21:45 +0900,
Petr K. [email protected] wrote:

Hi,
here is the reply from XMLRPC server:

HTTP header:

XML response (should be one line):

As you can see, there’s correct UTF-8 string in cyrillic in the middle
of the XML.

Thanks. I can reproduce it.
This had been fixed in trunk.

This is a problem of REXML but maybe the following code will
fix it. (I don’t try it. Sorry.)

module SubmitFilesHelper
module XMLRPCWorkAround
def do_rpc(request, async=false)
data = super
data.force_encoding(“UTF-8”)
data
end
end

@rpc_server_url=‘http://172.16.1.2:3000

def self.sendToServer(filename,language)
require ‘xmlrpc/client’
server = XMLRPC::Client.new2(@rpc_server_url)
server.extend(XMLRPCWorkAround)
result = server.call(‘check’, filename,language)
end
end

Thanks,

Hi,
here is the reply from XMLRPC server:

HTTP header:

HTTP/1.1 200: OK
Content-Length: 921
Content-Type: text/xml; charset=utf-8
Server: WEBrick/1.3.1 (Ruby/1.9.1/2010-01-10)
Date: Thu, 18 Nov 2010 07:57:17 GMT
Connection: Keep-Alive

XML response (should be one line):

<?xml version="1.0" ?>resultok

program_ver10.0.1153engine_ver10.0.424virus_db_ver42
4/3263
2010-11-1threat_descОпределен
вирус EICAR_Test </s
tring>infections_found1pup
s_found0infections_healed0
pups_healed0warnings</name

0


As you can see, there’s correct UTF-8 string in cyrillic in the middle
of the XML.

BTW, botp’s suggested solution (Encoding.default_internal =
Encoding.default_external = “UTF-8”) doesn’t work in Apache module
Passenger 3.0.0