Forum: Ruby Iconv problem - not handling \r correctly

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
B148a9a418f647ed8d1358c17e64a2d8?d=identicon&s=25 Louise Rains (rainyglade)
on 2008-10-26 22:48
I have an XML file that I need to process.  I'm working in the Windows
environment.  Here is the head of the file:

0000000000         ■   <  \0   A  \0   u  \0   t  \0   o  \0   S  \0   t
\0
0000000020     a  \0   t  \0      \0   J  \0   a  \0   v  \0   a  \0   C
\0
0000000040     l  \0   a  \0   s  \0   s  \0   =  \0   "  \0   c  \0   o
\0
0000000060     m  \0   .  \0   a  \0   u  \0   t  \0   o  \0   s  \0   i
\0
0000000100     m  \0   .  \0   a  \0   s  \0   t  \0   .  \0   a  \0   u
\0
0000000120     t  \0   o  \0   m  \0   o  \0   d  \0   .  \0   A  \0   M
\0
0000000140     H  \0   e  \0   a  \0   d  \0   e  \0   r  \0   "  \0   >
\0
0000000160    \r  \0  \n  \0  \t  \0   <  \0   B  \0   a  \0   s  \0   e
\0
0000000200     A  \0   u  \0   t  \0   o  \0   S  \0   t  \0   a  \0   t
\0
0000000220        \0   J  \0   a  \0   v  \0   a  \0   C  \0   l  \0   a
\0

Notice the character sequence \r \0 \n \0.

I need to edit some of the text elements in this file.  I have used both
REXML and Hpricot to edit the file successfully, after converting to
UTF-8.  Here is the head of the UTF-8 file:

0000000000     <   A   u   t   o   S   t   a   t       J   a   v   a   C
l
0000000020     a   s   s   =   '   c   o   m   .   a   u   t   o   s   i
m
0000000040     .   a   s   t   .   a   u   t   o   m   o   d   .   A   M
H
0000000060     e   a   d   e   r   '   >  \r  \n  \t   <   B   a   s   e
A
0000000100     u   t   o   S   t   a   t       J   a   v   a   C   l   a
s
0000000120     s   =   '   c   o   m   .   a   u   t   o   s   i   m   .
a
0000000140     s   t   .   a   u   t   o   m   o   d   .   A   M   H   e
a
0000000160     d   e   r   '       S   a   v   e   F   i   l   e   V   e
r
0000000200     s   i   o   n   =   '   1   .   3   '   >  \r  \n  \t  \t
<
0000000220     P   r   o   p   e   r   t   i   e   s       J   a   v   a
C

Notice that \r \n shows up in the next to last line.

Now in order for the edited XML file to work with my original
application, I need to convert back to UTF-16.  Here is the code that I
use:

file = File.read("sta_utf8.xml")
conv = Iconv.new("UTF-16LE", "UTF-8")
result = conv.iconv(file);
result= 0xFF.chr << 0xFE.chr << result
file = File.new("sta_utf16.xml", "w")
file.write(result)
file.close

The resulting file (sta_utf16.xml) looks like:

0000000000         ■   <  \0   A  \0   u  \0   t  \0   o  \0   S  \0   t
\0
0000000020     a  \0   t  \0      \0   J  \0   a  \0   v  \0   a  \0   C
\0
0000000040     l  \0   a  \0   s  \0   s  \0   =  \0   '  \0   c  \0   o
\0
0000000060     m  \0   .  \0   a  \0   u  \0   t  \0   o  \0   s  \0   i
\0
0000000100     m  \0   .  \0   a  \0   s  \0   t  \0   .  \0   a  \0   u
\0
0000000120     t  \0   o  \0   m  \0   o  \0   d  \0   .  \0   A  \0   M
\0
0000000140     H  \0   e  \0   a  \0   d  \0   e  \0   r  \0   '  \0   >
\0
0000000160    \r  \n  \0  \t  \0   <  \0   B  \0   a  \0   s  \0   e  \0
A
0000000200    \0   u  \0   t  \0   o  \0   S  \0   t  \0   a  \0   t  \0
0000000220    \0   J  \0   a  \0   v  \0   a  \0   C  \0   l  \0   a  \0
s


Notice that the \r does not have a \0 following it.  This means that
every other line in my sta_utf16.xml file is in the wrong byte order and
I get garbled results:

<AutoStat
JavaClass='com.autosim.ast.automod.AMHeader'>਍ऀ㰀䈀愀猀攀䄀甀琀漀匀琀愀琀 䨀愀瘀愀䌀氀愀猀猀㴀✀挀漀洀⸀愀甀琀漀猀椀洀⸀愀猀琀⸀愀甀琀漀洀漀搀⸀䄀䴀䠀攀愀搀攀爀✀ 匀愀瘀攀䘀椀氀攀嘀攀爀猀椀漀渀㴀✀㄀⸀㌀✀㸀ഀ

Is this a defect in Iconv?

Thanks,
LG
666b4e17b4bb0e2d999037a25f65a7cb?d=identicon&s=25 Heesob Park (phasis)
on 2008-10-27 05:57
(Received via mailing list)
2008/10/27 Louise Rains <rainyglade@comcast.net>:
> \0
> 0000000220        \0   J  \0   a  \0   v  \0   a  \0   C  \0   l  \0   a
> 0000000020     a   s   s   =   '   c   o   m   .   a   u   t   o   s   i
> a
> application, I need to convert back to UTF-16.  Here is the code that I
> The resulting file (sta_utf16.xml) looks like:
> \0
>
> Notice that the \r does not have a \0 following it.  This means that
> every other line in my sta_utf16.xml file is in the wrong byte order and
> I get garbled results:
>
> <AutoStat
> 
JavaClass='com.autosim.ast.automod.AMHeader'>਍ऀ㰀䈀愀猀攀䄀甀琀漀匀琀愀琀 䨀愀瘀愀䌀氀愀猀猀㴀✀挀漀洀⸀愀甀琀漀猀椀洀⸀愀猀琀⸀愀甀琀漀洀漀搀⸀䄀䴀䠀攀愀搀攀爀✀ 匀愀瘀攀䘀椀氀攀嘀攀爀猀椀漀渀㴀✀㄀⸀㌀✀㸀ഀ
>
> Is this a defect in Iconv?
>
No, it's a defect not in Iconv but in Windows.

Use binary flag for file handling like this:

 file = File.open("sta_utf8.xml","rb").read
 conv = Iconv.new("UTF-16LE", "UTF-8")
 result = conv.iconv(file);
 result= 0xFF.chr << 0xFE.chr << result
 file = File.new("sta_utf16.xml", "wb")
 file.write(result)
 file.close


Regards,

Park Heesob
B148a9a418f647ed8d1358c17e64a2d8?d=identicon&s=25 Louise Rains (rainyglade)
on 2008-10-27 12:12
It looks like IO.binmode does the same thing as well:

file = File.new("sta_utf16.xml", "wb")
file.binmode
file.write(result)
file.close

Thanks!

>

Heesob Park wrote:

> No, it's a defect not in Iconv but in Windows.
>
> Use binary flag for file handling like this:
>
>  file = File.open("sta_utf8.xml","rb").read
>  conv = Iconv.new("UTF-16LE", "UTF-8")
>  result = conv.iconv(file);
>  result= 0xFF.chr << 0xFE.chr << result
>  file = File.new("sta_utf16.xml", "wb")
>  file.write(result)
>  file.close
>
>
> Regards,
>
> Park Heesob
This topic is locked and can not be replied to.