REXML: parsing a string with unescaped ampersand entities

Hi,

REXML seems to SOMETIMES choke on parsing ampersands within entities,
e.g.

string = ‘<?xml version=“1.0”
encoding=“UTF-8”?>hello&world’
doc = Document.new(string)
puts “#{doc}”

works fine (output below):

<?xml version='1.0' encoding='UTF-8'?>hello&world

BUT:

string = ‘<?xml version="1.0" encoding="UTF-8"?>hello&
world’
doc = Document.new(string)
puts “#{doc}”

crashes out with:

REXML::ParseException: #<RuntimeError: Illegal character ‘&’ in raw
string “hello& world”>
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/text.rb:91:in
‘initialize’
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/parsers/treeparser.rb:43:in
new' /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/parsers/treeparser.rb:43:inparse’
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/document.rb:190:in
build' /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/document.rb:45:ininitialize’ /Users/frankreiff/Live
Developments/ruby/analyze/xml_parser.rb:102:in `new’
/Users/frankreiff/Live Developments/ruby/analyze/xml_parser.rb:102 …
Illegal character ‘&’ in raw string “hello& world” Line: Position: Last
80 unconsumed characters:

The difference is the space after the &

What is going on? and how can I fix this?

Best regards,

Frank

Hi,

On 7-Dec-07, at 1:12 PM, Frank R. wrote:

works fine (output below):

<?xml version='1.0' encoding='UTF-8'?>hello&world

BUT:

string = ‘<?xml version="1.0" encoding="UTF-8"?>hello&
world’
doc = Document.new(string)
puts “#{doc}”

[ snip]

What is going on? and how can I fix this?

Neither is legal XML, both should fail. You can either escape the
content or use a CDATA block.

Cheers,
Bob

Best regards,

Frank

Posted via http://www.ruby-forum.com/.


Bob H. – tumblelog at
http://www.recursive.ca/so/
Recursive Design Inc. – weblog at
http://www.recursive.ca/hutch
http://www.recursive.ca/ – works on
http://www.raconteur.info/cms-for-static-content/home/

I think I might be on to something there…

Ok, it was in fact precisely that. When I do a :

cgi.params.to_s

I get the correctly formatted XML message, but when there is an & in the
message

cgi.params.to_s

this produces an erratic output.

This is of course because:

The method params() returns a hash of all parameters in the request as
name/value-list pairs, where the value-list is an Array of one or more
values. The CGI object itself also behaves as a hash of parameter names
to values, but only returns a single value (as a String) for each
parameter name.

The output is therefore a fluke that’s solely based on the fact that
there is only one parameter.

Now my FINAL question to all the Ruby gurus:

  • How do I get the POST-ed message body without any clever splitting
    into key/value pairs!?

Neither is legal XML, both should fail. You can either escape the
content or use a CDATA block.

You’re of course right. Both are illegal.

Somebody suggested to me that the original problem might be caused by
incorrectly encoded entities (& ") and reading through the w3c
spec (always a bad idea) got me confused to the extend of believing that
you only had to encode character entities in attribute values; which
isn’t the case. Can’t in fact be the case, otherwise the parser couldn’t
differentiate between a “normal” ampersand and the beginning of a
character entity.

Which brings me back to my original problem of receiving a truncated XML
as an HTML post (see my previous question). This ONLY HAPPENS when there
is an ampersand somewhere in the message.

Could it be that CGI.params behaves differently when there is an
ampersand in the request, e.g. it tries to parse the request into
key/value pairs and returns a hash rather than a simple string in that
case!?

I think I might be on to something there…

Did anyone actually find out a solution to this issue.

I have an ActiveResource object which returns back some xml content (as
a result of a web service call) with unescaped "& " and this causes the
same issue.

Wondering if we can customize the method call to
REXML::Text.initialize() and set raw=>false.

Tried to change the setting in the REXML library that comes with the
Ruby distribution, but that did not help.

Is there a way to overcome this issue without patching any of Ruby or
Rails code. I cannot change the content on the web service server.

Thanks,
Maruthy.

Frank R. wrote:

I think I might be on to something there…

Ok, it was in fact precisely that. When I do a :

cgi.params.to_s

I get the correctly formatted XML message, but when there is an & in the
message

cgi.params.to_s

this produces an erratic output.

This is of course because:

The method params() returns a hash of all parameters in the request as
name/value-list pairs, where the value-list is an Array of one or more
values. The CGI object itself also behaves as a hash of parameter names
to values, but only returns a single value (as a String) for each
parameter name.

The output is therefore a fluke that’s solely based on the fact that
there is only one parameter.

Now my FINAL question to all the Ruby gurus:

  • How do I get the POST-ed message body without any clever splitting
    into key/value pairs!?

On Aug 24, 2009, at 15:18, Engine Y. wrote:

Did anyone actually find out a solution to this issue.

I have an ActiveResource object which returns back some xml content
(as
a result of a web service call) with unescaped "& " and this causes
the
same issue.

These two statements are contradictory. It can’t both be XML and have
unescaped &.

Wondering if we can customize the method call to
REXML::Text.initialize() and set raw=>false.

Tried to change the setting in the REXML library that comes with the
Ruby distribution, but that did not help.

Is there a way to overcome this issue without patching any of Ruby or
Rails code. I cannot change the content on the web service server.

Use a parser that handles errors:

$ ruby -rubygems -e 'require “nokogiri”; d = Nokogiri::XML "&<bar/

"; p d.errors; p d’
[#<Nokogiri::XML::SyntaxError: xmlParseEntityRef: no name
]

<?xml version="1.0"?>

$

PS: The name on your email account is odd.

On Aug 24, 2009, at 15:57, Engine Y. wrote:

Sorry, may be I was unclear.

I understood.

What happens is: the xml that is sent back contains something like
“cheese & coffee”

Right. This is not XML because & is not escaped.

so this breaks REXML, and since the call to REXML’s method is made
from the ActiveResource object itself we are unable to customize how
REXML’s initialize method is called.

When I modified the line “cheese & coffee” to “cheese & coffee”
everything works fine

Yep, this is escaped and valid for XML. You should file a bug with
the website you’re consuming telling them they have broken XML output.

That doesn’t help you solve your problem though! :slight_smile:

#<RuntimeError: Illegal character ‘&’ in raw string “cheese &
coffee”>
[…]

Yes. ActiveResource and REXML are behaving correctly. They’re not
really sure what to do with text you’ve given them as it’s not XML.

Since you have to deal with reality, you’ll want a forgiving XML
parser that can handle some invalid XML when you really need to.

How can we change what parser is being used by Rails. I would
definitely
like to try out Nokogiri but am unsure about how to make it work with
Rails (2.3.2) and in specific with ActiveResource though.

I’m not sure either. You could either use a different tool than
ActiveResource or try contacting the ActiveResource maintainers for
help in adding the option of being forgiving of invalid XML.
(Nokogiri is good at correcting invalid XML, so I suggested it.)

Sorry, may be I was unclear. What happens is: the xml that is sent back
contains something like “cheese & coffee” so this breaks REXML, and
since the call to REXML’s method is made from the ActiveResource object
itself we are unable to customize how REXML’s initialize method is
called.

When I modified the line “cheese & coffee” to “cheese & coffee”
everything works fine. So I am pretty sure this issue is being caused
because of the unescaped ampersand that is contained in the xml. But I
cannot modify the xml content as it comes to me from a third party
source over which we have no control.

Here is the stack trace that might help clearing up things:

--- !ruby/exception:REXML::ParseException message: |- # /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/text.rb:91:in `initialize' /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/parsers/treeparser.rb:43:in `new' /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/parsers/treeparser.rb:43:in `parse' /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/document.rb:227:in `build' /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/document.rb:43:in `initialize' /Users/mxx/.gem/ruby/1.8/gems/activesupport-2.3.2/lib/active_support/xml_mini/rexml.rb:17:in `new' /Users/mxx/.gem/ruby/1.8/gems/activesupport-2.3.2/lib/active_support/xml_mini/rexml.rb:17:in `parse' (__DELEGATION__):2:in `__send__' (__DELEGATION__):2:in `parse' /Users/mxx/.gem/ruby/1.8/gems/activesupport-2.3.2/lib/active_support/core_ext/hash/conversions.rb:154:in `from_xml' /Library/Ruby/Gems/1.8/gems/activeresource-2.3.2/lib/active_resource/formats/xml_format.rb:19:in `decode' /Library/Ruby/Gems/1.8/gems/activeresource-2.3.2/lib/active_resource/connection.rb:116:in `get' /Library/Ruby/Gems/1.8/gems/activeresource-2.3.2/lib/active_resource/base.rb:587:in `find_one' /Library/Ruby/Gems/1.8/gems/activeresource-2.3.2/lib/active_resource/base.rb:522:in `find'

How can we change what parser is being used by Rails. I would definitely
like to try out Nokogiri but am unsure about how to make it work with
Rails (2.3.2) and in specific with ActiveResource though.

Eric H. wrote:

On Aug 24, 2009, at 15:18, Engine Y. wrote:

Did anyone actually find out a solution to this issue.

I have an ActiveResource object which returns back some xml content
(as
a result of a web service call) with unescaped "& " and this causes
the
same issue.

These two statements are contradictory. It can’t both be XML and have
unescaped &.

Wondering if we can customize the method call to
REXML::Text.initialize() and set raw=>false.

Tried to change the setting in the REXML library that comes with the
Ruby distribution, but that did not help.

Is there a way to overcome this issue without patching any of Ruby or
Rails code. I cannot change the content on the web service server.

Use a parser that handles errors:

$ ruby -rubygems -e 'require “nokogiri”; d = Nokogiri::XML "&<bar/

"; p d.errors; p d’
[#<Nokogiri::XML::SyntaxError: xmlParseEntityRef: no name
]

<?xml version="1.0"?>

$

PS: The name on your email account is odd.