How to clean an xml files from non-utf-8 chars?

Krzysieq · September 17, 2008, 11:15am

Hi,

I have a problem. I’m trying to parse with ruby some test results from
jmeter, that are stored in xml files. Unfortunately, while they should
be
utf-8, some of them aren’t. Probably because some db data isn’t. In any
case, this makes other toys break down, like xslt transformation and
anything else that relies on the xml files being utf-8.

Does anyone know, how to get rid of such characters? When opened in an
editor like Kate, they are viewed as a white question mark in black
square.
I don’t really care much about the data - if it’s missing some chars,
nobody
will care. The point is not to destroy the xml structure and enable
other
tool’s operations. Any help will be greatly appreciated.

Cheers,
Chris

Krzysieq · September 17, 2008, 11:22am

If you really don’t care about the content:
str.gsub(/[\x80-\xff]/,’?’)

Krzysieq · September 17, 2008, 2:34pm

On Sep 17, 2008, at 5:15 AM, Brian C. wrote:

If you really don’t care about the content:
str.gsub(/[\x80-\xff]/,‘?’)

You can have bytes in that range as the first byte of a well-formed
UTF-8 Byte Sequence. They just can’t represent a single byte. It’s
just not that simple.

-Rob

Rob B. http://agileconsultingllc.com
[email protected]

Krzysieq · September 17, 2008, 2:52pm

On Sep 17, 5:07 am, Krzysieq [email protected] wrote:

[Note: parts of this message were removed to make it a legal post.]

Hi,

I have a problem. I’m trying to parse with ruby some test results from
jmeter, that are stored in xml files. Unfortunately, while they should be
utf-8, some of them aren’t. Probably because some db data isn’t. In any
case, this makes other toys break down, like xslt transformation and
anything else that relies on the xml files being utf-8.

Look at Vectoring Ruby On Rails: Encoding problems,
particularly the “iconvert” method which attempts conversion to UTF-8,
but in the case where the string cannot be converted to UTF-8 (e.g.
double-byte chars) then it replaces the chars with “?”.

– Mark.

Krzysieq · September 17, 2008, 2:53pm

On Sep 17, 2008, at 4:07 AM, Krzysieq wrote:

I have a problem. I’m trying to parse with ruby some test results from
jmeter, that are stored in xml files. Unfortunately, while they
should be utf-8, some of them aren’t. Probably because some db data
isn’t. In any case, this makes other toys break down, like xslt
transformation and
anything else that relies on the xml files being utf-8.

Does anyone know, how to get rid of such characters?

If you can figure out the encoding they are actually in, I recommend
using Iconv’s transliterate mode:

require “iconv”
Iconv.conv(“UTF-8//TRANSLIT”, old_encoding_name, data)

James Edward G. II

Krzysieq · September 17, 2008, 2:54pm

Hey,

Thanks for inputs. So do You have another proposition?

Cheers,
Chris

2008/9/17 Rob B. [email protected]

Krzysieq · September 17, 2008, 3:40pm

Rob B. wrote:

On Sep 17, 2008, at 5:15 AM, Brian C. wrote:

If you really don’t care about the content:
str.gsub(/[\x80-\xff]/,‘?’)

You can have bytes in that range as the first byte of a well-formed
UTF-8 Byte Sequence. They just can’t represent a single byte. It’s
just not that simple.

That’s why I said “if you really don’t care” … it strips all valid
non-ASCII UTF8 as well as invalid.

There is a nice table at UTF-8 - Wikipedia which would
let you build something more accurate. Ruby quiz perhaps?

Krzysieq · September 17, 2008, 9:30pm

On Wed, Sep 17, 2008 at 12:47 PM, Jeremy H.
[email protected] wrote:

module UTF8
module Cleanable
#
# Converts the string representation of this class to a utf8 clean
# string. This assumes that #to_s on the object will result in a utf8
# string. All chars that are not valid utf8 char sequences will be
# silently dropped.

To silently drop chars with IConv, you’d want to do:

Iconv.conv(“UTF-8//IGNORE”, old_encoding_name, data)

TRANSLIT just works a little harder and tries to convert your
characters into a series of UTF-8 chars if possible.
I’m not sure if it drops chars that can’t be transliterated…

-greg

Krzysieq · September 17, 2008, 6:56pm

On Wed, Sep 17, 2008 at 09:44:23PM +0900, James G. wrote:

If you can figure out the encoding they are actually in, I recommend using
Iconv’s transliterate mode:

require “iconv”
Iconv.conv(“UTF-8//TRANSLIT”, old_encoding_name, data)

This is the approach we have take on some of our code, basically we
wanted to
replicate the ‘iconv -c’ behavior. Does TRANSLIT do this ? I’ve never
used
that mode before.

module UTF8
module Cleanable
#
# Converts the string representation of this class to a utf8 clean
# string. This assumes that #to_s on the object will result in a
utf8
# string. All chars that are not valid utf8 char sequences will
be
# silently dropped.
#
def utf8_clean
Iconv.open( “UTF-8”, “UTF-8” ) do |iconv|
output = StringIO.new
working = self.to_s
loop do
begin
output.print iconv.iconv( working )
break
rescue Iconv::IllegalSequence => is
output.print is.success
working = is.failed[1…-1]
end
end
return output.string
end
end
end
end

class String
include UTF8::Cleanable
end

enjoy,

-jeremy

Krzysieq · September 17, 2008, 11:26pm

On Sep 17, 2008, at 11:47 AM, Jeremy H. wrote:

anything else that relies on the xml files being utf-8.
This is the approach we have take on some of our code, basically we
wanted to
replicate the ‘iconv -c’ behavior. Does TRANSLIT do this ? I’ve
never used
that mode before.

//TRANSLIT is better than that. It tries to translate the
characters. Thus a UTF-8 ellipse would become three periods if
converted to ISO-8859-1 with //TRANSLIT.

You can mimic -c though, just use //IGNORE instead of //TRANSLIT. You
can even do //TRANSLIT//IGNORE which transliterates what it can and
discards the rest.

James Edward G. II

Krzysieq · September 18, 2008, 7:20pm

On Sep 18, 9:25 am, Krzysieq [email protected] wrote:

[Note: parts of this message were removed to make it a legal post.]

Unfortunately, there’s no way telling the original encoding. I would rather
go for some method of removing / substituting the chars that don’t belong
there, but the method first suggested by Brian doesn’t seem to work for some
reason. Does anyone have another option?

Try the iconv solutions with latin-1 (iso-8859-1) as the From. That’s
as close as you can get to a one-byte “anything-goes” encoding.

-Mark.

Krzysieq · September 18, 2008, 4:08pm

Unfortunately, there’s no way telling the original encoding. I would
rather
go for some method of removing / substituting the chars that don’t
belong
there, but the method first suggested by Brian doesn’t seem to work for
some
reason. Does anyone have another option? I’m investigating the reasons
of
failure, I will write more when I know something more. Thanks for all
help
anyways

Cheers,
Chris

2008/9/17 James G. [email protected]

Krzysieq · September 18, 2008, 10:31pm

On Thu, Sep 18, 2008 at 9:25 AM, Krzysieq [email protected] wrote:

Unfortunately, there’s no way telling the original encoding. I would rather
go for some method of removing / substituting the chars that don’t belong
there, but the method first suggested by Brian doesn’t seem to work for some
reason. Does anyone have another option? I’m investigating the reasons of
failure, I will write more when I know something more. Thanks for all help
anyways

If there is no way of telling the original encoding, the input data
may not have valid unicode in it at all, right?

-greg

Krzysieq · September 19, 2008, 3:10pm

How is the XML file created? If you know in advance which parts of the
XML come from the database, wrap those sections in CDATA blocks and
your XML will remain valid.

Krzysieq · September 19, 2008, 1:15pm

Ok, I tried all previous suggestions, neither worked (gsub idea,
TRANSLIT,
IGNORE or the one from the link posted by Mark T.). In fact, the
last
two don’t seem to have done anything, while gsub seems to do too much -
seems like it has damaged the xml structure in some way, which seems
very
strange to me. I don’t really care about the data inside, but I need the
xml
to remain valid.

@Gregory - that’s true, it may not. However, the places where I found
the
funny characters are text nodes inside xml documents, and there aren’t
that
many of them. Surely, one is many enough to break the whole thing, but
typically there’s very few and it seems more like corrupted database
data. I
think they store some newspaper articles there or pieces of news. I
learned
from the team who maintain that database in their app, that typically it
should all be ISO-8859-1, but for some reason it’s not always the case.
Hence the idea with corrupted data seems quite likely.

Thanks for any help You can provide me with
Cheers,
Chris

2008/9/18 Mark T. [email protected]

Krzysieq · September 19, 2008, 4:10pm

Sill answer, but what is $KCODE ?? I’m relatively new to Ruby, so this
tells
me nothing… And as You might have guessed, no, I haven’t set it.
What’s it
do?

Cheers,
Chris

2008/9/19 Gregory B. [email protected]

Krzysieq · September 19, 2008, 4:20pm

On Fri, Sep 19, 2008 at 9:00 AM, Krzysieq [email protected] wrote:

Sill answer, but what is $KCODE ?? I’m relatively new to Ruby, so this tells
me nothing… And as You might have guessed, no, I haven’t set it. What’s it
do?

It tells Ruby that you are working with UTF-8

-greg

Krzysieq · September 19, 2008, 4:21pm

On Sep 19, 2008, at 8:00 AM, Krzysieq wrote:

Sill answer, but what is $KCODE ??

It’s a global variable that affects how Ruby 1.8 handles characters.

And as You might have guessed, no, I haven’t set it.

Does your code run inside of a recent version of Rails? I’m just
asking because it sets $KCODE for you.

James Edward G. II

Krzysieq · September 19, 2008, 3:45pm

On Fri, Sep 19, 2008 at 7:07 AM, Krzysieq [email protected] wrote:

Thanks for any help You can provide me with

Silly question, but did you set $KCODE = “U” while processing your data?

-greg

How to clean an xml files from non-utf-8 chars?

If you really don’t care about the content:
str.gsub(/[\x80-\xff]/,‘?’)

If you really don’t care about the content:
str.gsub(/[\x80-\xff]/,‘?’)

How to clean an xml files from non-utf-8 chars?

If you really don’t care about the content: str.gsub(/[\x80-\xff]/,‘?’)

If you really don’t care about the content: str.gsub(/[\x80-\xff]/,‘?’)

If you really don’t care about the content:
str.gsub(/[\x80-\xff]/,‘?’)

If you really don’t care about the content:
str.gsub(/[\x80-\xff]/,‘?’)