Iconv problem - not handling \r correctly


#1

I have an XML file that I need to process. I’m working in the Windows
environment. Here is the head of the file:

0000000000 Â â– < \0 A \0 u \0 t \0 o \0 S \0 t
\0
0000000020 a \0 t \0 \0 J \0 a \0 v \0 a \0 C
\0
0000000040 l \0 a \0 s \0 s \0 = \0 " \0 c \0 o
\0
0000000060 m \0 . \0 a \0 u \0 t \0 o \0 s \0 i
\0
0000000100 m \0 . \0 a \0 s \0 t \0 . \0 a \0 u
\0
0000000120 t \0 o \0 m \0 o \0 d \0 . \0 A \0 M
\0
0000000140 H \0 e \0 a \0 d \0 e \0 r \0 " \0 >
\0
0000000160 \r \0 \n \0 \t \0 < \0 B \0 a \0 s \0 e
\0
0000000200 A \0 u \0 t \0 o \0 S \0 t \0 a \0 t
\0
0000000220 \0 J \0 a \0 v \0 a \0 C \0 l \0 a
\0

Notice the character sequence \r \0 \n \0.

I need to edit some of the text elements in this file. I have used both
REXML and Hpricot to edit the file successfully, after converting to
UTF-8. Here is the head of the UTF-8 file:

0000000000 < A u t o S t a t J a v a C
l
0000000020 a s s = ’ c o m . a u t o s i
m
0000000040 . a s t . a u t o m o d . A M
H
0000000060 e a d e r ’ > \r \n \t < B a s e
A
0000000100 u t o S t a t J a v a C l a
s
0000000120 s = ’ c o m . a u t o s i m .
a
0000000140 s t . a u t o m o d . A M H e
a
0000000160 d e r ’ S a v e F i l e V e
r
0000000200 s i o n = ’ 1 . 3 ’ > \r \n \t \t
<
0000000220 P r o p e r t i e s J a v a
C

Notice that \r \n shows up in the next to last line.

Now in order for the edited XML file to work with my original
application, I need to convert back to UTF-16. Here is the code that I
use:

file = File.read(“sta_utf8.xml”)
conv = Iconv.new(“UTF-16LE”, “UTF-8”)
result = conv.iconv(file);
result= 0xFF.chr << 0xFE.chr << result
file = File.new(“sta_utf16.xml”, “w”)
file.write(result)
file.close

The resulting file (sta_utf16.xml) looks like:

0000000000 Â â– < \0 A \0 u \0 t \0 o \0 S \0 t
\0
0000000020 a \0 t \0 \0 J \0 a \0 v \0 a \0 C
\0
0000000040 l \0 a \0 s \0 s \0 = \0 ’ \0 c \0 o
\0
0000000060 m \0 . \0 a \0 u \0 t \0 o \0 s \0 i
\0
0000000100 m \0 . \0 a \0 s \0 t \0 . \0 a \0 u
\0
0000000120 t \0 o \0 m \0 o \0 d \0 . \0 A \0 M
\0
0000000140 H \0 e \0 a \0 d \0 e \0 r \0 ’ \0 >
\0
0000000160 \r \n \0 \t \0 < \0 B \0 a \0 s \0 e \0
A
0000000200 \0 u \0 t \0 o \0 S \0 t \0 a \0 t \0
0000000220 \0 J \0 a \0 v \0 a \0 C \0 l \0 a \0
s

Notice that the \r does not have a \0 following it. This means that
every other line in my sta_utf16.xml file is in the wrong byte order and
I get garbled results:

à¨à¤€ã°€äˆ€æ„€çŒ€æ”€ä„€ç”€ç€æ¼€åŒ€ç€æ„€ç€â€€ä¨€æ„€ç˜€æ„€äŒ€æ°€æ„€çŒ€çŒ€ã´€âœ€æŒ€æ¼€æ´€â¸€æ„€ç”€ç€æ¼€çŒ€æ¤€æ´€â¸€æ„€çŒ€ç€â¸€æ„€ç”€ç€æ¼€æ´€æ¼€æ€â¸€ä„€ä´€ä €æ”€æ„€æ€æ”€çˆ€âœ€â€€åŒ€æ„€ç˜€æ”€ä˜€æ¤€æ°€æ”€å˜€æ”€çˆ€çŒ€æ¤€æ¼€æ¸€ã´€âœ€ã„€â¸€ãŒ€âœ€ã¸€à´€

Is this a defect in Iconv?

Thanks,
LG


#2

2008/10/27 Louise R. removed_email_address@domain.invalid:

\0
0000000220 \0 J \0 a \0 v \0 a \0 C \0 l \0 a
0000000020 a s s = ’ c o m . a u t o s i
a
application, I need to convert back to UTF-16. Here is the code that I
The resulting file (sta_utf16.xml) looks like:
\0

Notice that the \r does not have a \0 following it. This means that
every other line in my sta_utf16.xml file is in the wrong byte order and
I get garbled results:

à¨à¤€ã°€äˆ€æ„€çŒ€æ”€ä„€ç”€ç€æ¼€åŒ€ç€æ„€ç€â€‚ä¨€æ„€ç˜€æ„€äŒ€æ°€æ„€çŒ€çŒ€ã´€âœ€æŒ€æ¼€æ´€â¸€æ„€ç”€ç€æ¼€çŒ€æ¤€æ´€â¸€æ„€çŒ€ç€â¸€æ„€ç”€ç€æ¼€æ´€æ¼€æ€â¸€ä„€ä´€ä €æ”€æ„€æ€æ”€çˆ€âœ€â€‚åŒ€æ„€ç˜€æ”€ä˜€æ¤€æ°€æ”€å˜€æ”€çˆ€çŒ€æ¤€æ¼€æ¸€ã´€âœ€ã„€â¸€ãŒ€âœ€ã¸€à´€

Is this a defect in Iconv?

No, it’s a defect not in Iconv but in Windows.

Use binary flag for file handling like this:

file = File.open(“sta_utf8.xml”,“rb”).read
conv = Iconv.new(“UTF-16LE”, “UTF-8”)
result = conv.iconv(file);
result= 0xFF.chr << 0xFE.chr << result
file = File.new(“sta_utf16.xml”, “wb”)
file.write(result)
file.close

Regards,

Park H.


#3

It looks like IO.binmode does the same thing as well:

file = File.new(“sta_utf16.xml”, “wb”)
file.binmode
file.write(result)
file.close

Thanks!

Heesob P. wrote:

No, it’s a defect not in Iconv but in Windows.

Use binary flag for file handling like this:

file = File.open(“sta_utf8.xml”,“rb”).read
conv = Iconv.new(“UTF-16LE”, “UTF-8”)
result = conv.iconv(file);
result= 0xFF.chr << 0xFE.chr << result
file = File.new(“sta_utf16.xml”, “wb”)
file.write(result)
file.close

Regards,

Park H.