Invalid byte sequence in US-ASCII (ArgumentError)

luther · February 16, 2009, 1:18am

I’m having some trouble migrating from 1.8 to 1.9.1. I have this line of
code:

text.gsub! “\C-m”, ‘’

…which generates this error:

/home/luther/bin/dos2gnu:16:in `gsub!’: invalid byte sequence in
US-ASCII (ArgumentError)

The purpose is to strip out any ^M characters from the string. I’ve
tried a couple of different magic comments with utf-8, but the error
message still shows the same “US-ASCII”. I also tried changing the \C-m
to 13.chr, but I still got the same error, suggesting that control
characters aren’t even allowed in strings anymore.

I’m sure this must be a common migration problem, but I can’t find a
solution no matter how hard I search the web. Any help would be greatly
appreciated.

Luther

luther · February 16, 2009, 1:26am

Luther wrote:

The purpose is to strip out any ^M characters from the string. I’ve

Since Ruby is claiming the source file is US-ASCII it seems likely that
it’s not noticing the magic comment. Make sure your magic comment is the
first line in the script, or if you’re using a shebang line, the second
line. That is, either

encoding: utf-8

or

#! /usr/local/bin/ruby

encoding: utf-8

luther · February 16, 2009, 3:22am

On Mon, 2009-02-16 at 09:19 +0900, Tim H. wrote:

encoding: utf-8

I put the encoding line right after my shebang line, but it had no
effect.

In further investigation, I tried running my program on a different text
file, and it worked fine. The original text file had some very odd
characters at the beginning and the end of the file. Once I deleted that
metadata, my program worked fine.

This means the problem was with the “text” variable rather than the
arguments. This seems very wrong to me since it threw an ArgumentError.
Or maybe I don’t know anything about exceptions.

So, my problem is partially solved, but now I know my program will puke
on any text file with multibyte characters.

Luther

luther · February 16, 2009, 6:27am

but now I know my program will puke
on any text file with multibyte characters.

Not necessarily.

Here is a useful summary of encodings in 1.9:
http://blog.nuclearsquid.com/writings/ruby-1-9-encodings

Basically, you have script encoding, internal encoding, and external
encoding. In you case, you should probably read the files as ASCII8BIT
or binary, I guess.

luther · February 16, 2009, 10:50am

Brian C. wrote:

Yes. IMO this is a horrendous misfeature of ruby 1.9: it asserts that
all external data is text, unless explicitly told otherwise.

And worse: the encoding chosen comes from the environment. So your
program which you developed on one system and runs correctly there may
fail totally on another.

I’m not saying that Ruby shouldn’t handling encodings and conversions;
I’m just saying you should ask for them. For example:

File.open("…", :encoding => “UTF-8”) # Use this encoding
File.open("…", :encoding => “ENV”) # Follow the environment
File.open("…") # No idea, treat as binary

I’m not going to use 1.9 without wrapper scripts to invoke Ruby with
appropriate flags to force the external encoding to a fixed value. And
that’s a pain.

luther · February 16, 2009, 10:44am

Tom L. wrote:

but now I know my program will puke
on any text file with multibyte characters.

Not necessarily.

Here is a useful summary of encodings in 1.9:
http://blog.nuclearsquid.com/writings/ruby-1-9-encodings

Basically, you have script encoding, internal encoding, and external
encoding. In you case, you should probably read the files as ASCII8BIT
or binary, I guess.

Yes. IMO this is a horrendous misfeature of ruby 1.9: it asserts that
all external data is text, unless explicitly told otherwise.

So if you deal with data which is not text (as I do all the time), you
need to put

File.open(“…”, :encoding => “BINARY”)

everywhere. And even then, if you ask the open File object what it’s
encoding is, it will say ASCII8BIT, even though you explicitly told it
that it’s BINARY.

This is because “BINARY” is just a synonym for “ASCII8BIT” in ruby. Of
course, there is plenty of data out there which is not encoded using the
American Standard Code for Information Interchange. MIME distinguishes
clearly between 8BIT (text with high bit set) and BINARY (non-text). In
terms of Ruby’s processing it makes no difference, but it’s annoying for
Ruby to tell me that my data is text, when it is not.

Note: in more recent 1.9’s, I believe that

File.open(“…”, “rb”)

has the effect of doing two things:

Disabling line-ending translation under Windows
Setting encoding to ASCII8BIT

So this may be sufficient for your needs, and it has the advantage that
the same code will run under ruby <1.9.

luther · February 16, 2009, 1:22pm

Stefan L. wrote:

Ruby must choose between treating all external data as
text unless told otherwise or treat everything as binary
unless told otherwise, because there is no general way
to know if a file is binary or text.

Yes (and I wouldn’t want it to try to guess)

Given that Ruby is mostly used to work with text, it’s a
sensible decision to use text mode by default.

That’s where I disagree. There are tons of non-text applications:
images, compression, PDFs, Marshall, DRB… Furthermore, as the OP
demonstrated, there are plenty of usage cases where files are presented
which are almost ASCII, but not quite. The default behaviour now is to
crash, rather than to treat these as streams of bytes.

I don’t want my programs to crash in these cases.

It has to default to some encoding.

That’s where I also disagree. It can default to stream of bytes.

File.open("…", :encoding => “ENV”) # Follow the environment

This is the default.

That’s what I don’t want. Given this default, I must either:

(1) Force all my source to have the correct encoding flag set
everywhere. If I don’t test for this, my programs will fail in
unexpected ways. Tests for this are awkward; they’d have to set the
environment to a certain locale (e.g. UTF-8), pass in data which is not
valid in that locale, and check no exception is raised.

(2) Use a wrapper script either to call Ruby with the correct
command-line flags, or to sanitise the environment.

Encoding.default_external=

I guess I can use that at the top of everything in bin/ directory. It
may be sufficient, but it’s annoying to have to remember that too.

luther · February 16, 2009, 3:12pm

2009/2/16 Brian C. [email protected]:

That’s where I disagree. There are tons of non-text applications:
images, compression, PDFs, Marshall, DRB…

Point taken.

Furthermore, as the OP
demonstrated, there are plenty of usage cases where files are presented
which are almost ASCII, but not quite. The default behaviour now is to
crash, rather than to treat these as streams of bytes.

I don’t want my programs to crash in these cases.

Let’s compare. Situation: I’m reading binary files and
forget to specify the “b” flag when opening the file(s).

Result in Ruby 1.8:

My stuff works fine on Linux/Unix. Somebody else runs
the script on Windows, the script corrupts data because
Windows does line ending conversion.

Result in Ruby 1.9:

On the first run on my Linux machine, I get an EncodingError.
I fix the problem by specifying the “b” flag on open. Done.

I definitely prefer Ruby 1.9 behavior.

It has to default to some encoding.

That’s where I also disagree. It can default to stream of bytes.

File.open(“…”, :encoding => “ENV”) # Follow the environment

This is the default.

That’s what I don’t want. Given this default, I must either:

Assuming the default is always wrong.

I guess I can use that at the top of everything in bin/ directory. It
may be sufficient, but it’s annoying to have to remember that too.

Here’s why the default is good, IMO. The cases where I really
don’t want to explicitly specify encodings is when I write one liners
(-e) and short throwaway scripts. If the default encoding were binary,
string operations would deal incorrectly with German (my native
language)
accents. Using the locale encoding does the right thing here.

If I write a longer program, explicitly setting the default external
encoding isn’t an effort worth mentioning. Set it to ASCII_8BIT
and it behaves like 1.8.

Stefan

luther · February 16, 2009, 4:33pm

Result in Ruby 1.8:

My stuff works fine on Linux/Unix. Somebody else runs
the script on Windows, the script corrupts data because
Windows does line ending conversion.

Result in Ruby 1.9:

On the first run on my Linux machine, I get an EncodingError.
I fix the problem by specifying the “b” flag on open. Done.

Actually, this is a point I have never quite understood. Why does only
the windows version convert line endings? Is it out of question that
somebody could want to process a text file created under windows on a
linux box or virtual machine? Regular expressions that check only for
\n but not \r won’t work then. Now you could of course take the stance
that you simply have to check for \r too, but then why automatically
convert line separators under windows? Or did I miss something
obvious?

This is also the reason why I think opening text files as binary isn’t
really a solution. It leads to either convoluted regexps or non-
portable code. (Unless I missed something obvious, which is quite
possible.)

I personally find it somewhat confusing having to juggle with
different encodings. IMHO it would have been preferable to define a
fixed internal encoding (uft16 or whatever) and to transcode every
string that is known to be text and identifiers to that canonical/
uniform encoding and to deal with everything else as a sequence of
bytes.

BTW I recently skimmed through the python3000 user guide. From what I
understand, they seem to distinguish between strings as (binary) data
and strings as text (encoded as utf).

luther · February 16, 2009, 12:46pm

2009/2/16 Brian C. [email protected]:

Brian C. wrote:

Yes. IMO this is a horrendous misfeature of ruby 1.9: it asserts that
all external data is text, unless explicitly told otherwise.

Ruby must choose between treating all external data as
text unless told otherwise or treat everything as binary
unless told otherwise, because there is no general way
to know if a file is binary or text.

Given that Ruby is mostly used to work with text, it’s a
sensible decision to use text mode by default.

Also, if you open a file with the “b” flag, it sets the files
encoding to binary. You should use that flag in 1.8, too,
otherwise Windows will do line ending conversion, corrupting
your binary data.

And worse: the encoding chosen comes from the environment. So your
program which you developed on one system and runs correctly there may
fail totally on another.

It has to default to some encoding. Your OS installation has a
default encoding. It’s a sane decision to use that, because
otherwise many scripts wouldn’t work by default on your machine.

I’m not saying that Ruby shouldn’t handling encodings and conversions;
I’m just saying you should ask for them. For example:

File.open(“…”, :encoding => “UTF-8”) # Use this encoding

Well, you can do exactly that…

File.open(“…”, :encoding => “ENV”) # Follow the environment

This is the default.

File.open(“…”) # No idea, treat as binary

Use “b” flag, which you should do on 1.8 anyway.

I’m not going to use 1.9 without wrapper scripts to invoke Ruby with
appropriate flags to force the external encoding to a fixed value. And
that’s a pain.

You can set it with Encoding.default_external= at the top of your
script.

Stefan

luther · February 16, 2009, 4:59pm

2009/2/16 Tom L. [email protected]:

the windows version convert line endings? Is it out of question that
somebody could want to process a text file created under windows on a
linux box or virtual machine? Regular expressions that check only for
\n but not \r won’t work then. Now you could of course take the stance
that you simply have to check for \r too, but then why automatically
convert line separators under windows? Or did I miss something
obvious?

It’s the underlying C API that does the line ending conversion.
Ruby inherited that behavior.

bytes.
Ruby does that when you set the internal encoding with
Encoding.default_internal=

BTW I recently skimmed through the python3000 user guide. From what I
understand, they seem to distinguish between strings as (binary) data
and strings as text (encoded as utf).

There were many and long discussions about the encoding API,
mostly on Ruby core. If you search the archives you can find
why the current API is how it is.

IIRC, these were important issues:

We don’t have a single internal string encoding (like Java and Python)
because there are many Ruby users, especially Asian, that still
have to work with legacy encodings for which a lossless Unicode
round-trip is not possible. They’d be forced to use the
binary API.
Because Ruby already has a rich String API, and because it simplifies
porting of 1.8 code, there is no separate data type for binary
strings.

Stefan

luther · February 16, 2009, 4:47pm

On Feb 16, 2009, at 10:33 AM, Tom L. wrote:

Actually, this is a point I have never quite understood. Why does only
the windows version convert line endings?

Because at the operating system level Windows distinguishes between
text and binary files and Unix doesn’t.

The “b” option has been part of the standard C library for decades and
Windows is not the only operating system that distinguishes between
text and binary files.

Proper handling of line termination requires the library to know if it
is working with a binary or text file. On Unix it doesn’t matter if
you fail to give the library correct information (i.e. omit the b flag
for binary files) but your code becomes non-portable. It will fail on
systems that treat text and binary files differently.

Gary W.

luther · February 16, 2009, 5:34pm

It’s the underlying C API that does the line ending conversion.
Ruby inherited that behavior.

Thanks for the clarification (also thanks to Gary).

Ruby does that when you set the internal encoding with
Encoding.default_internal=

Unfortunately this isn’t entirely true – see:
http://groups.google.com/group/ruby-core-google/browse_frm/thread/9103687d4ee9f336?hl=en#

It doesn’t convert strings & identifiers in scripts. Of course, there
are good reasons for that (see the responses in the thread) if you
don’t define a canonical internal encoding.

There were many and long discussions about the encoding API,
mostly on Ruby core.

I know that people who understand the issues at hand much better than
I do discussed this subject extensively. I still struggle to fully
understand their conclusions though. But this probably is only a
matter of time.

Regards,
Thomas.

luther · February 17, 2009, 2:17am

Hi,

In message “Re: invalid byte sequence in US-ASCII (ArgumentError)”
on Mon, 16 Feb 2009 09:15:54 +0900, Luther [email protected]
writes:

I feel some smell of a bug. Could you show me the whole code and
reproducing input please?

          matz.

luther · February 17, 2009, 2:34am

On Mon, 2009-02-16 at 14:20 +0900, Tom L. wrote:

or binary, I guess.

Thank you. I’ve put ‘r:binary’ in the line where I open the file, and
now it seems to work fine. Although if I didn’t want to be lazy, I would
probably read it with the default encoding, catch the ArgumentError,
then reread the file.

Thanks again…
Luther

luther · February 17, 2009, 5:27am

On Tue, 2009-02-17 at 10:16 +0900, Yukihiro M. wrote:

I feel some smell of a bug. Could you show me the whole code and
reproducing input please?

Sure, here you go…

#!/usr/bin/ruby -w

the GNU General Public License, version 3 or any later version.

Strips out ^M characters from the files given as arguments. This

helps ensure that emacs can display the text properly.

Warning: This program will overwrite the original file(s), so pray

there aren’t any serious bugs.

ARGV.each do |filename|

puts “Removing ^M characters from #{filename}”

text = String.new

File.open filename do |f|
text = f.gets nil
end

text.gsub! “\C-m”, ‘’

File.open filename, ‘w’ do |f|
f.puts text
end

end

END

Using Ubuntu 8.10 with ruby installed from source and linked over
to /usr/bin.

$ /usr/bin/ruby -v
ruby 1.9.1p0 (2009-01-30 revision 21907) [x86_64-linux]

The attached file contains all of the apparently binary data that was in
the original text file that I downloaded.

Luther

luther · February 17, 2009, 8:26am

I feel some smell of a bug. Could you show me the whole code and
reproducing input please?

Sure, here you go…

When I recently stumbled over not so different problems (one of which
is described here [1]) it was because the external encoding (see
Encoding.default_external) defaulted to US-ASCII on cygwin because
ruby191RC0 ignored the windows locale and the value of the LANG
variable – the part with the windows locale was fixed in the
meantime. AFAIK if ruby 191 cannot determine the environment’s locale,
it defaults to US-ASCII which causes the described problem if a
character is > 127.

[1]
http://groups.google.com/group/ruby-talk-google/browse_frm/thread/865fc72d8fb808ba/

luther · March 13, 2009, 11:52am

Hi,

In message “Re: invalid byte sequence in US-ASCII (ArgumentError)”
on Tue, 17 Feb 2009 22:55:23 +0900, Luther T.
[email protected] writes:

|Actually, I always set my LANG to C. Since my original post, I found
|that I had forgotten to set my LC_CTYPE to en_US.UTF-8, which is
|Ubuntu’s default. After fixing that, I still got the same error, but
|with “UTF-8” instead of “US-ASCII”.

Ruby 1.9 distinguish text files and binary files, so please specify
“rb” instead of plain “r” (or omitting the mode) for binary files.
This restriction may be loosen in the future for simple substitution
like this one.

          matz.

luther · November 10, 2010, 7:39pm

The magic encoding comment didn’t cut it for me. I found the answer in
my case by adding the following to my environment.rb (I run a mixed 1.8
and 1.9 environment):

if RUBY_VERSION =~ /1.9/
Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8
end

luther · February 17, 2009, 2:55pm

Tom L. wrote:

When I recently stumbled over not so different problems (one of which
is described here [1]) it was because the external encoding (see
Encoding.default_external) defaulted to US-ASCII on cygwin because
ruby191RC0 ignored the windows locale and the value of the LANG
variable – the part with the windows locale was fixed in the
meantime. AFAIK if ruby 191 cannot determine the environment’s locale,
it defaults to US-ASCII which causes the described problem if a
character is > 127.

Actually, I always set my LANG to C. Since my original post, I found
that I had forgotten to set my LC_CTYPE to en_US.UTF-8, which is
Ubuntu’s default. After fixing that, I still got the same error, but
with “UTF-8” instead of “US-ASCII”.

I believe the metadata in that text file must be binary code that was
put there by some word processor, because I remember seeing “Helvetica”
somewhere in there.

Luther

Invalid byte sequence in US-ASCII (ArgumentError)

encoding: utf-8

encoding: utf-8

encoding: utf-8

Copyright 2007-2009 Luther T… This file is distributed under

the GNU General Public License, version 3 or any later version.

Strips out ^M characters from the files given as arguments. This

helps ensure that emacs can display the text properly.

Warning: This program will overwrite the original file(s), so pray

there aren’t any serious bugs.