Hi all,
Im working with some UTF-8 data and basically if I run this:
require ‘rexml/document’
data = “\302”
doc = REXML::Document.new(data)
I get an error that says I did not close the tag:
REXML::ParseException: #<REXML::ParseException: No close tag for
[“name”]>
/usr/lib/ruby/1.8/rexml/parsers/treeparser.rb:26:in parse' /usr/lib/ruby/1.8/rexml/document.rb:190:in
build’
/usr/lib/ruby/1.8/rexml/document.rb:45:in initialize' (irb):48:in
new’
(irb):48:in irb_binding' /usr/lib/ruby/1.8/irb/workspace.rb:52:in
irb_binding’
/usr/lib/ruby/1.8/irb/workspace.rb:52
…
No close tag for [“name”]
Line:
Position:
Last 80 unconsumed characters:
from /usr/lib/ruby/1.8/rexml/parsers/treeparser.rb:89:in
parse' from /usr/lib/ruby/1.8/rexml/document.rb:190:in
build’
from /usr/lib/ruby/1.8/rexml/document.rb:45:in initialize' from (irb):48:in
new’
from (irb):48
The code only works if I use single quotes instead,
i.e.
doc = REXML::Document.new(’\302’)
But since data is a variable, I cant simply declare it with single
quotes.
Any ideas why REXML::Document doesnt parse properly? Or perhaps is
there a way around this? Maybe I can convert to some other character
encoding to avoid the problem…
Best regards,
Jesse
Hi,
In message “Re: REXML::Document could not parse UTF-8
“\302””
on Sat, 5 Jan 2008 02:40:00 +0900, “Jesse P.” [email protected]
writes:
|Im working with some UTF-8 data and basically if I run this:
|
|require ‘rexml/document’
|data = “\302”
|doc = REXML::Document.new(data)
“\302” is not a valid UTF-8 byte sequence. The rest is
up to you, after recognizing working on non UTF-8 data.
matz.
Hi Matz,
Thanks for your help. So I guess my problem is this:
- I get an XML that is declared to be valid UTF-8, but
- when I process some of the values, as you pointed out, some is not
valid UTF-8, and
- causes a lot of problems when parsed by REXML.
For a string of characters (e.g. some xml file), is there anyway I can
detect just the non UTF-8 characters and convert them to UTF-8?
This way I can make sure what is processed by REXML is valid UTF-8
without unnecessarily processing characters in the string that are
already valid UTF-8.
Best regards,
Jesse
Hi,
In message “Re: REXML::Document could not parse UTF-8
“\302””
on Sun, 6 Jan 2008 03:00:04 +0900, “Jesse P.” [email protected]
writes:
|Thanks for your help. So I guess my problem is this:
|1. I get an XML that is declared to be valid UTF-8, but
|2. when I process some of the values, as you pointed out, some is not
|valid UTF-8, and
|3. causes a lot of problems when parsed by REXML.
|
|For a string of characters (e.g. some xml file), is there anyway I can
|detect just the non UTF-8 characters and convert them to UTF-8?
I guess you have to define what you want to do with this broken UTF-8
data first. As long as you treat the data as UTF-8, it is impossible
to treat it correctly. You can either
- fix the data before reading it via REXML
- parse data as Latin-1 or some other single byte encoding
- replace the broken data with some valid UTF-8 sequence
But YMMV.
matz.