Iconv transfer code

luofeiyu · April 17, 2010, 7:19am

in my computer(ubuntu9.1+ruby1.9):
pt@pt-laptop:~$ irb
irb(main):001:0> require ‘iconv’
=> true
irb(main):002:0> str = Iconv.iconv(‘GBK’, ‘UTF-8’, ‘æˆ‘è¯´’).to_s
=> “[“ï¿½ï¿½Ëµ”]”

in my friend’s(ubuntu9.1+ruby1.9):
$ irb
irb(main):001:0> require ‘iconv’
=> true
irb(main):002:0> str = Iconv.iconv(‘GBK’, ‘UTF-8’, ‘æˆ‘è¯´’).to_s
=> “\316\322\313\265”
irb(main):003:0> puts Iconv.iconv(‘UTF-8’, ‘GBK’, str).to_s
æˆ‘è¯´
=> nil

what’s wrong in my system?

luofeiyu · April 18, 2010, 11:22am

Pen T. wrote:

in my computer(ubuntu9.1+ruby1.9):
pt@pt-laptop:~$ irb
irb(main):001:0> require ‘iconv’
=> true
irb(main):002:0> str = Iconv.iconv(‘GBK’, ‘UTF-8’, ‘æˆ‘è¯´’).to_s
=> “["ï¿½ï¿½Ëµ"]”

in my friend’s(ubuntu9.1+ruby1.9):
$ irb
irb(main):001:0> require ‘iconv’
=> true
irb(main):002:0> str = Iconv.iconv(‘GBK’, ‘UTF-8’, ‘æˆ‘è¯´’).to_s
=> “\316\322\313\265”
irb(main):003:0> puts Iconv.iconv(‘UTF-8’, ‘GBK’, str).to_s
æˆ‘è¯´
=> nil

what’s wrong in my system?

One of the joys of ruby 1.9 is that the same program run on two
different machines can behave differently. That’s even if the two
machines have identical versions of ruby and OS and you are feeding in
the same input data.

My advice is to stick with ruby 1.8.x, where the behaviour is both sane
and predictable. However there are other people who will vociferously
tell you that I am doing the entire ruby community a disservice by
recommending this to you. It’s up to you whose advice to follow.

If you want to persevere with ruby 1.9, I suggest the following:

Check you have exactly identical versions of 1.9 (check the
RUBY_DESCRIPTION constant) on both machines. The behaviour is subtle,
and a lot of it has changed.
Look at str.bytes.to_a to see if the byte sequence is correct or not.
That is, the fact that irb displays the string wrongly or rightly
doesn’t mean anything; don’t trust what you see.
Instead of using irb, write a .rb script, and run it from the command
line directly.
Check the environments are the same on both. You could try
experimenting with setting LANG and/or LC_ALL environment variables
before starting ruby.
I tried to understand how this all works, and I documented what I
found at string19/string19.rb at master · candlerb/string19 · GitHub

There are about 200 cases of encoding behaviour described there.

Also, it’s possible to do what you’re trying to do in ruby 1.9 without
using Iconv, but instead tagging str with its correct encoding, and then
using encode! to convert it to another. Whether it appears correctly on
the terminal or not, especially within irb, is still not something to
trust. Again, use str.bytes.to_a to see if it is the expected sequence
of bytes in the new encoding.

Good luck,

Brian.

luofeiyu · April 18, 2010, 4:37pm

On Apr 18, 2010, at 4:22 AM, Brian C. wrote:

One of the joys of ruby 1.9 is that the same program run on two
different machines can behave differently. That’s even if the two
machines have identical versions of ruby and OS and you are feeding in
the same input data.

I’m pretty sure that’s true with Ruby 1.8 as well. For example, don’t
the encodings available to iconv vary depending on the platform?

James Edward G. II

luofeiyu · April 18, 2010, 4:19pm

Hi,
On 18 April 2010 11:22, Brian C. [email protected] wrote:

One of the joys of ruby 1.9 is that the same program run on two
different machines can behave differently. That’s even if the two
machines have identical versions of ruby and OS and you are feeding in
the same input data.

Please don’t be so pessimist without real reason
(that said, show some code that has different result in the conditions
you
said).

Maybe what you’re describing is caused by different revisions, but that
happened also in 1.8, no?

Look at str.bytes.to_a to see if the byte sequence is correct or not.

That is, the fact that irb displays the string wrongly or rightly
doesn’t mean anything; don’t trust what you see.

Yes, that’s true, encoding in irb is still ,often, having a bad result.

B.D.

luofeiyu · April 18, 2010, 7:06pm

Benoit D. wrote:

Please don’t be so pessimist without real reason
(that said, show some code that has different result in the conditions
you
said).

Sure. Here’s a simple one:

File.open(“myfile.txt”) do |f|
line = f.gets
line =~ /./
end

You can run this script on two machines, with the same version of OS and
ruby and the same myfile.txt but with different environment variable
settings, and get it to crash on one but not the other. (One way: if the
default external encoding on one machine is US-ASCII and myfile.txt
contains any byte with the top bit set)

Maybe what you’re describing is caused by different revisions, but that
happened also in 1.8, no?

This is intentional behaviour in ruby 1.9.

luofeiyu · April 18, 2010, 7:13pm

James Edward G. II wrote:

On Apr 18, 2010, at 4:22 AM, Brian C. wrote:

One of the joys of ruby 1.9 is that the same program run on two
different machines can behave differently. That’s even if the two
machines have identical versions of ruby and OS and you are feeding in
the same input data.

I’m pretty sure that’s true with Ruby 1.8 as well. For example, don’t
the encodings available to iconv vary depending on the platform?

Perhaps, but I was talking about an identical platform, O/S, and
installation of ruby - but different configured locale (such as LANG,
LC_CTYPE or LC_ALL environment variables)

Unless you write your ruby script defensively, it will behave
differently dependent on those environment settings when everything else
is identical.

luofeiyu · April 18, 2010, 11:02pm

On 18 April 2010 22:17, James Edward G. II [email protected]
wrote:

So your main complaint is that Ruby honors the settings of your
environment?

James Edward G. II

Beautiful that one (couldn’t get a cool answer so I waited somebody else
answer)

Yeah, I think it’s normal it saves in the encoding depending on the
environment.
And if you want something that doesn’t depend on the environment there
is
many possibilities.

The easiest with File: File.open(“myfile.ext”, “w:UTF-8”)

luofeiyu · April 19, 2010, 9:37am

Benoit D. wrote:

The easiest with File: File.open(“myfile.ext”, “w:UTF-8”)

This is a poor example of the point in question, although a good example
of how hard ruby 1.9 is to understand.

In fact: the default external encoding is nil for files opened for
write, and does not depend on the environment at all. That is,

File.open(“myfile.ext”,“w”) { |f| f.puts str }

just outputs whatever bytes are in str, without meddling with them.
Whereas

File.open(“myfile.ext”,“w:UTF-8”) { |f| f.puts str}

will attempt to re-encode str from its current encoding to UTF-8, and
may raise an exception if it cannot do so.

So if you want to write programs which don’t crash, the first is
arguably better.

The rules for reading from files are completely different, and indeed
“r:UTF-8” is the right thing to do if you are reading from a file which
contains UTF-8 text and you don’t want this to be affected by
environment variable magic.

luofeiyu · April 18, 2010, 10:17pm

On Apr 18, 2010, at 12:14 PM, Brian C. wrote:

Perhaps, but I was talking about an identical platform, O/S, and
installation of ruby - but different configured locale (such as LANG,
LC_CTYPE or LC_ALL environment variables)

So your main complaint is that Ruby honors the settings of your
environment?

James Edward G. II

luofeiyu · April 19, 2010, 10:02am

On Mon, Apr 19, 2010 at 3:42 PM, Brian C. [email protected]
wrote:

… File.open(“myfile.ext”,“w:UTF-8”) { |f| f.puts str}

will attempt to re-encode str from its current encoding to UTF-8, and
may raise an exception if it cannot do so.

good

So if you want to write programs which don’t crash, the first is
arguably better.

we disagree there but what do you mean by “crash”?

best regards -botp

luofeiyu · April 19, 2010, 10:29am

botp wrote:

So if you want to write programs which don’t crash, the first is
arguably better.

we disagree there but what do you mean by “crash”?

I mean “raise an exception”. The first example I wrote will never raise
an exception. The second can.

Code to demonstrate:

str = “\xff”
File.open(“out1”,“w”) { |f| f.puts str }
File.open(“out2”,“w:UTF-8”) { |f| f.puts str }

Line 2 will never raise an exception, regardless of the content or the
encoding of str, and regardless of environment variable settings. It
just writes the string to the file.

Line 3 may raise an exception. It does in this particular program
because str has data tagged as ASCII-8BIT which cannot be transcoded to
UTF-8.

luofeiyu · April 19, 2010, 10:03am

Perhaps, but I was talking about an identical platform, O/S, and
installation of ruby - but different configured locale (such as LANG,
LC_CTYPE or LC_ALL environment variables)

So your main complaint is that Ruby honors the settings of your
environment?

My complaints are listed at
string19/soapbox.rb at master · candlerb/string19 · GitHub - but I guess
the main one is what the OP saw. Same program, same data, same ruby,
different behaviour.

Normally when analysing a program you only need to look at the program
and its input, but ruby 1.9 has extra “hidden” input data in the form of
environment variables which can alter your program’s behaviour, or not,
depending on the content of the input data as well.

I wonder how many Ruby users are fully aware of which environment
variables influence POSIX locales, and which ones take precendence over
the others?

I also note that there is an effort underway to standardise the Ruby
language definition, and this has chosen 1.8.7 as its baseline.

luofeiyu · April 19, 2010, 5:46pm

On Apr 19, 2010, at 3:31 AM, Brian C. wrote:

Code to demonstrate:

str = “\xff”
File.open(“out1”,“w”) { |f| f.puts str }
File.open(“out2”,“w:UTF-8”) { |f| f.puts str }

Line 2 will never raise an exception, regardless of the content or the
encoding of str, and regardless of environment variable settings. It
just writes the string to the file.

That’s grossly inaccurate. You may not have write permission to the
file, the volume you are trying to place the file on may be out of
space, etc.

These are more examples of how you could move the same code to a new
machine and have it fail. Ignoring the environment code runs in will
not make it go away.

James Edward G. II

luofeiyu · April 19, 2010, 10:28pm

James Edward G. II wrote:

On Apr 19, 2010, at 3:31 AM, Brian C. wrote:

Code to demonstrate:

str = “\xff”
File.open(“out1”,“w”) { |f| f.puts str }
File.open(“out2”,“w:UTF-8”) { |f| f.puts str }

Line 2 will never raise an exception, regardless of the content or the
encoding of str, and regardless of environment variable settings. It
just writes the string to the file.

That’s grossly inaccurate. You may not have write permission to the
file, the volume you are trying to place the file on may be out of
space, etc.

Of course syscalls can fail due to insufficient resources and other
system-level problems. I’m talking about the normal flow of execution.

The point remains: Benoit said that one way to make your program immune
to influence from environment variables was to use
File.open(“myfile.ext”,“w:UTF-8”). I was trying to highlight that advice
is incorrect, because the regular File.open(“myfile.ext”,“w”) is immune
to environment variables already. Furthermore, “w:UTF-8” can crash in
the normal flow under more circumstances than “w” - and those
circumstances depend on string contents and encodings, which can be
affected by environment variables.