Forum: Ruby invalid byte sequence in US-ASCII (ArgumentError)

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
B4fea04b7c434ae0d8291c28facc8017?d=identicon&s=25 Luther (Guest)
on 2009-02-16 01:18
(Received via mailing list)
I'm having some trouble migrating from 1.8 to 1.9.1. I have this line of
code:

text.gsub! "\C-m", ''

...which generates this error:

/home/luther/bin/dos2gnu:16:in `gsub!': invalid byte sequence in
US-ASCII (ArgumentError)

The purpose is to strip out any ^M characters from the string. I've
tried a couple of different magic comments with utf-8, but the error
message still shows the same "US-ASCII". I also tried changing the \C-m
to 13.chr, but I still got the same error, suggesting that control
characters aren't even allowed in strings anymore.

I'm sure this must be a common migration problem, but I can't find a
solution no matter how hard I search the web. Any help would be greatly
appreciated.

Luther
3afd3e5e05dc9310c89aa5762cc8dd1d?d=identicon&s=25 Tim Hunter (Guest)
on 2009-02-16 01:26
(Received via mailing list)
Luther wrote:
> The purpose is to strip out any ^M characters from the string. I've
>
>
>

Since Ruby is claiming the source file is US-ASCII it seems likely that
it's not noticing the magic comment. Make sure your magic comment is the
first line in the script, or if you're using a shebang line, the second
line. That is, either

# encoding: utf-8

or

#! /usr/local/bin/ruby
# encoding: utf-8
B4fea04b7c434ae0d8291c28facc8017?d=identicon&s=25 Luther (Guest)
on 2009-02-16 03:22
(Received via mailing list)
On Mon, 2009-02-16 at 09:19 +0900, Tim Hunter wrote:
> >
> # encoding: utf-8
>

I put the encoding line right after my shebang line, but it had no
effect.

In further investigation, I tried running my program on a different text
file, and it worked fine. The original text file had some very odd
characters at the beginning and the end of the file. Once I deleted that
metadata, my program worked fine.

This means the problem was with the "text" variable rather than the
arguments. This seems very wrong to me since it threw an ArgumentError.
Or maybe I don't know anything about exceptions.

So, my problem is partially solved, but now I know my program will puke
on any text file with multibyte characters.

Luther
9b905791cbdbb1af35b65e02c3217e23?d=identicon&s=25 Tom Link (Guest)
on 2009-02-16 06:27
(Received via mailing list)
> but now I know my program will puke
> on any text file with multibyte characters.

Not necessarily.

Here is a useful summary of encodings in 1.9:
http://blog.nuclearsquid.com/writings/ruby-1-9-encodings

Basically, you have script encoding, internal encoding, and external
encoding. In you case, you should probably read the files as ASCII8BIT
or binary, I guess.
753dcb78b3a3651127665da4bed3c782?d=identicon&s=25 Brian Candler (candlerb)
on 2009-02-16 10:44
Tom Link wrote:
>> but now I know my program will puke
>> on any text file with multibyte characters.
>
> Not necessarily.
>
> Here is a useful summary of encodings in 1.9:
> http://blog.nuclearsquid.com/writings/ruby-1-9-encodings
>
> Basically, you have script encoding, internal encoding, and external
> encoding. In you case, you should probably read the files as ASCII8BIT
> or binary, I guess.

Yes. IMO this is a horrendous misfeature of ruby 1.9: it asserts that
all external data is text, unless explicitly told otherwise.

So if you deal with data which is not text (as I do all the time), you
need to put

  File.open("....", :encoding => "BINARY")

everywhere. And even then, if you ask the open File object what it's
encoding is, it will say ASCII8BIT, even though you explicitly told it
that it's BINARY.

This is because "BINARY" is just a synonym for "ASCII8BIT" in ruby. Of
course, there is plenty of data out there which is not encoded using the
American Standard Code for Information Interchange. MIME distinguishes
clearly between 8BIT (text with high bit set) and BINARY (non-text). In
terms of Ruby's processing it makes no difference, but it's annoying for
Ruby to tell me that my data is text, when it is not.

Note: in more recent 1.9's, I believe that

  File.open("....", "rb")

has the effect of doing two things:
1. Disabling line-ending translation under Windows
2. Setting encoding to ASCII8BIT

So this may be sufficient for your needs, and it has the advantage that
the same code will run under ruby <1.9.
753dcb78b3a3651127665da4bed3c782?d=identicon&s=25 Brian Candler (candlerb)
on 2009-02-16 10:50
Brian Candler wrote:
> Yes. IMO this is a horrendous misfeature of ruby 1.9: it asserts that
> all external data is text, unless explicitly told otherwise.

And worse: the encoding chosen comes from the environment. So your
program which you developed on one system and runs correctly there may
fail totally on another.

I'm not saying that Ruby shouldn't handling encodings and conversions;
I'm just saying you should ask for them. For example:

   File.open("....", :encoding => "UTF-8")   # Use this encoding
   File.open("....", :encoding => "ENV")     # Follow the environment
   File.open("....")                         # No idea, treat as binary

I'm not going to use 1.9 without wrapper scripts to invoke Ruby with
appropriate flags to force the external encoding to a fixed value. And
that's a pain.
40613e55d7082e5f08429dfb50d0680e?d=identicon&s=25 Stefan Lang (Guest)
on 2009-02-16 12:46
(Received via mailing list)
2009/2/16 Brian Candler <b.candler@pobox.com>:
> Brian Candler wrote:
>> Yes. IMO this is a horrendous misfeature of ruby 1.9: it asserts that
>> all external data is text, unless explicitly told otherwise.

Ruby must choose between treating all external data as
text unless told otherwise or treat everything as binary
unless told otherwise, because there is no general way
to know if a file is binary or text.

Given that Ruby is mostly used to work with text, it's a
sensible decision to use text mode by default.

Also, if you open a file with the "b" flag, it sets the files
encoding to binary. You should use that flag in 1.8, too,
otherwise Windows will do line ending conversion, corrupting
your binary data.

> And worse: the encoding chosen comes from the environment. So your
> program which you developed on one system and runs correctly there may
> fail totally on another.

It has to default to some encoding. Your OS installation has a
default encoding. It's a sane decision to use that, because
otherwise many scripts wouldn't work by default on your machine.

> I'm not saying that Ruby shouldn't handling encodings and conversions;
> I'm just saying you should ask for them. For example:
>
>   File.open("....", :encoding => "UTF-8")   # Use this encoding

Well, you can do exactly that...

>   File.open("....", :encoding => "ENV")     # Follow the environment

This is the default.

>   File.open("....")                         # No idea, treat as binary

Use "b" flag, which you should do on 1.8 anyway.

> I'm not going to use 1.9 without wrapper scripts to invoke Ruby with
> appropriate flags to force the external encoding to a fixed value. And
> that's a pain.

You can set it with Encoding.default_external= at the top of your
script.

Stefan
753dcb78b3a3651127665da4bed3c782?d=identicon&s=25 Brian Candler (candlerb)
on 2009-02-16 13:22
Stefan Lang wrote:
> Ruby must choose between treating all external data as
> text unless told otherwise or treat everything as binary
> unless told otherwise, because there is no general way
> to know if a file is binary or text.

Yes (and I wouldn't want it to try to guess)

> Given that Ruby is mostly used to work with text, it's a
> sensible decision to use text mode by default.

That's where I disagree. There are tons of non-text applications:
images, compression, PDFs, Marshall, DRB... Furthermore, as the OP
demonstrated, there are plenty of usage cases where files are presented
which are almost ASCII, but not quite. The default behaviour now is to
crash, rather than to treat these as streams of bytes.

I don't want my programs to crash in these cases.

> It has to default to some encoding.

That's where I also disagree. It can default to stream of bytes.

>>   File.open("....", :encoding => "ENV")     # Follow the environment
>
> This is the default.

That's what I don't want. Given this default, I must either:

(1) Force all my source to have the correct encoding flag set
everywhere. If I don't test for this, my programs will fail in
unexpected ways. Tests for this are awkward; they'd have to set the
environment to a certain locale (e.g. UTF-8), pass in data which is not
valid in that locale, and check no exception is raised.

(2) Use a wrapper script either to call Ruby with the correct
command-line flags, or to sanitise the environment.

> Encoding.default_external=

I guess I can use that at the top of everything in bin/ directory. It
may be sufficient, but it's annoying to have to remember that too.
40613e55d7082e5f08429dfb50d0680e?d=identicon&s=25 Stefan Lang (Guest)
on 2009-02-16 15:12
(Received via mailing list)
2009/2/16 Brian Candler <b.candler@pobox.com>:
>
> That's where I disagree. There are tons of non-text applications:
> images, compression, PDFs, Marshall, DRB...

Point taken.

> Furthermore, as the OP
> demonstrated, there are plenty of usage cases where files are presented
> which are almost ASCII, but not quite. The default behaviour now is to
> crash, rather than to treat these as streams of bytes.
>
> I don't want my programs to crash in these cases.

Let's compare. Situation: I'm reading binary files and
forget to specify the "b" flag when opening the file(s).

Result in Ruby 1.8:
* My stuff works fine on Linux/Unix. Somebody else runs
  the script on Windows, the script corrupts data because
  Windows does line ending conversion.

Result in Ruby 1.9:
* On the first run on my Linux machine, I get an EncodingError.
   I fix the problem by specifying the "b" flag on open. Done.

I definitely prefer Ruby 1.9 behavior.

>> It has to default to some encoding.
>
> That's where I also disagree. It can default to stream of bytes.
>
>>>   File.open("....", :encoding => "ENV")     # Follow the environment
>>
>> This is the default.
>
> That's what I don't want. Given this default, I must either:

Assuming the default is always wrong.

>
> I guess I can use that at the top of everything in bin/ directory. It
> may be sufficient, but it's annoying to have to remember that too.

Here's why the default is good, IMO. The cases where I really
don't want to explicitly specify encodings is when I write one liners
(-e) and short throwaway scripts. If the default encoding were binary,
string operations would deal incorrectly with German (my native
language)
accents. Using the locale encoding does the right thing here.

If I write a longer program, explicitly setting the default external
encoding isn't an effort worth mentioning. Set it to ASCII_8BIT
and it behaves like 1.8.

Stefan
9b905791cbdbb1af35b65e02c3217e23?d=identicon&s=25 Tom Link (Guest)
on 2009-02-16 16:33
(Received via mailing list)
> Result in Ruby 1.8:
> * My stuff works fine on Linux/Unix. Somebody else runs
>   the script on Windows, the script corrupts data because
>   Windows does line ending conversion.
>
> Result in Ruby 1.9:
> * On the first run on my Linux machine, I get an EncodingError.
>    I fix the problem by specifying the "b" flag on open. Done.

Actually, this is a point I have never quite understood. Why does only
the windows version convert line endings? Is it out of question that
somebody could want to process a text file created under windows on a
linux box or virtual machine? Regular expressions that check only for
\n but not \r won't work then. Now you could of course take the stance
that you simply have to check for \r too, but then why automatically
convert line separators under windows? Or did I miss something
obvious?

This is also the reason why I think opening text files as binary isn't
really a solution. It leads to either convoluted regexps or non-
portable code. (Unless I missed something obvious, which is quite
possible.)

I personally find it somewhat confusing having to juggle with
different encodings. IMHO it would have been preferable to define a
fixed internal encoding (uft16 or whatever) and to transcode every
string that is known to be text and identifiers to that canonical/
uniform encoding and to deal with everything else as a sequence of
bytes.

BTW I recently skimmed through the python3000 user guide. From what I
understand, they seem to distinguish between strings as (binary) data
and strings as text (encoded as utf).
E7559e558ececa67c40f452483b9ac8c?d=identicon&s=25 Gary Wright (Guest)
on 2009-02-16 16:47
(Received via mailing list)
On Feb 16, 2009, at 10:33 AM, Tom Link wrote:
> Actually, this is a point I have never quite understood. Why does only
> the windows version convert line endings?

Because at the operating system level Windows distinguishes between
text and binary files and Unix doesn't.

The "b" option has been part of the standard C library for decades and
Windows is not the only operating system that distinguishes between
text and binary files.

Proper handling of line termination requires the library to know if it
is working with a binary or text file.  On Unix it doesn't matter if
you fail to give the library correct information (i.e. omit the b flag
for binary files) but your code becomes non-portable.  It will fail on
systems that treat text and binary files differently.

Gary Wright
40613e55d7082e5f08429dfb50d0680e?d=identicon&s=25 Stefan Lang (Guest)
on 2009-02-16 16:59
(Received via mailing list)
2009/2/16 Tom Link <micathom@gmail.com>:
> the windows version convert line endings? Is it out of question that
> somebody could want to process a text file created under windows on a
> linux box or virtual machine? Regular expressions that check only for
> \n but not \r won't work then. Now you could of course take the stance
> that you simply have to check for \r too, but then why automatically
> convert line separators under windows? Or did I miss something
> obvious?

It's the underlying C API that does the line ending conversion.
Ruby inherited that behavior.

> bytes.
Ruby does that when you set the internal encoding with
Encoding.default_internal=

> BTW I recently skimmed through the python3000 user guide. From what I
> understand, they seem to distinguish between strings as (binary) data
> and strings as text (encoded as utf).

There were many and long discussions about the encoding API,
mostly on Ruby core. If you search the archives you can find
why the current API is how it is.

IIRC, these were important issues:

* We don't have a single internal string encoding (like Java and Python)
  because there are many Ruby users, especially Asian, that still
  have to work with legacy encodings for which a lossless Unicode
  round-trip is not possible. They'd be forced to use the
  binary API.

* Because Ruby already has a rich String API, and because it simplifies
  porting of 1.8 code, there is no separate data type for binary
strings.

Stefan
9b905791cbdbb1af35b65e02c3217e23?d=identicon&s=25 Tom Link (Guest)
on 2009-02-16 17:34
(Received via mailing list)
> It's the underlying C API that does the line ending conversion.
> Ruby inherited that behavior.

Thanks for the clarification (also thanks to Gary).

> Ruby does that when you set the internal encoding with
> Encoding.default_internal=

Unfortunately this isn't entirely true -- see:
http://groups.google.com/group/ruby-core-google/br...

It doesn't convert strings & identifiers in scripts. Of course, there
are good reasons for that (see the responses in the thread) if you
don't define a canonical internal encoding.

> There were many and long discussions about the encoding API,
> mostly on Ruby core.

I know that people who understand the issues at hand much better than
I do discussed this subject extensively. I still struggle to fully
understand their conclusions though. But this probably is only a
matter of time.

Regards,
Thomas.
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2009-02-17 02:17
(Received via mailing list)
Hi,

In message "Re: invalid byte sequence in US-ASCII (ArgumentError)"
    on Mon, 16 Feb 2009 09:15:54 +0900, Luther <lutheroto@gmail.com>
writes:

|I'm having some trouble migrating from 1.8 to 1.9.1. I have this line of
|code:
|
|text.gsub! "\C-m", ''
|
|...which generates this error:
|
|/home/luther/bin/dos2gnu:16:in `gsub!': invalid byte sequence in
|US-ASCII (ArgumentError)
|
|The purpose is to strip out any ^M characters from the string.

I feel some smell of a bug.  Could you show me the whole code and
reproducing input please?

              matz.
B4fea04b7c434ae0d8291c28facc8017?d=identicon&s=25 Luther (Guest)
on 2009-02-17 02:34
(Received via mailing list)
On Mon, 2009-02-16 at 14:20 +0900, Tom Link wrote:
> or binary, I guess.
>

Thank you. I've put 'r:binary' in the line where I open the file, and
now it seems to work fine. Although if I didn't want to be lazy, I would
probably read it with the default encoding, catch the ArgumentError,
then reread the file.

Thanks again...
Luther
B4fea04b7c434ae0d8291c28facc8017?d=identicon&s=25 Luther (Guest)
on 2009-02-17 05:27
(Received via mailing list)
Attachment: eva_pt11.txt (3 KB)
On Tue, 2009-02-17 at 10:16 +0900, Yukihiro Matsumoto wrote:
> I feel some smell of a bug.  Could you show me the whole code and
> reproducing input please?

Sure, here you go...

#!/usr/bin/ruby -w

# Copyright 2007-2009 Luther Thompson. This file is distributed under
# the GNU General Public License, version 3 or any later version.

# Strips out ^M characters from the files given as arguments. This
# helps ensure that emacs can display the text properly.
# Warning: This program will overwrite the original file(s), so pray
# there aren't any serious bugs.

ARGV.each do |filename|

  puts "Removing ^M characters from #{filename}"

  text = String.new

  File.open filename do |f|
    text = f.gets nil
  end

  text.gsub! "\C-m", ''

  File.open filename, 'w' do |f|
    f.puts text
  end

end

__END__

Using Ubuntu 8.10 with ruby installed from source and linked over
to /usr/bin.

$ /usr/bin/ruby -v
ruby 1.9.1p0 (2009-01-30 revision 21907) [x86_64-linux]

The attached file contains all of the apparently binary data that was in
the original text file that I downloaded.

Luther
9b905791cbdbb1af35b65e02c3217e23?d=identicon&s=25 Tom Link (Guest)
on 2009-02-17 08:26
(Received via mailing list)
> > I feel some smell of a bug.  Could you show me the whole code and
> > reproducing input please?
>
> Sure, here you go...

When I recently stumbled over not so different problems (one of which
is described here [1]) it was because the external encoding (see
Encoding.default_external) defaulted to US-ASCII on cygwin because
ruby191RC0 ignored the windows locale and the value of the LANG
variable -- the part with the windows locale was fixed in the
meantime. AFAIK if ruby 191 cannot determine the environment's locale,
it defaults to US-ASCII which causes the described problem if a
character is > 127.


[1]
http://groups.google.com/group/ruby-talk-google/br...
B4fea04b7c434ae0d8291c28facc8017?d=identicon&s=25 Luther Thompson (luther)
on 2009-02-17 14:55
Tom Link wrote:
> When I recently stumbled over not so different problems (one of which
> is described here [1]) it was because the external encoding (see
> Encoding.default_external) defaulted to US-ASCII on cygwin because
> ruby191RC0 ignored the windows locale and the value of the LANG
> variable -- the part with the windows locale was fixed in the
> meantime. AFAIK if ruby 191 cannot determine the environment's locale,
> it defaults to US-ASCII which causes the described problem if a
> character is > 127.

Actually, I always set my LANG to C. Since my original post, I found
that I had forgotten to set my LC_CTYPE to en_US.UTF-8, which is
Ubuntu's default. After fixing that, I still got the same error, but
with "UTF-8" instead of "US-ASCII".

I believe the metadata in that text file must be binary code that was
put there by some word processor, because I remember seeing "Helvetica"
somewhere in there.

Luther
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2009-03-13 11:52
(Received via mailing list)
Hi,

In message "Re: invalid byte sequence in US-ASCII (ArgumentError)"
    on Tue, 17 Feb 2009 22:55:23 +0900, Luther Thompson
<lutheroto@gmail.com> writes:

|Actually, I always set my LANG to C. Since my original post, I found
|that I had forgotten to set my LC_CTYPE to en_US.UTF-8, which is
|Ubuntu's default. After fixing that, I still got the same error, but
|with "UTF-8" instead of "US-ASCII".

Ruby 1.9 distinguish text files and binary files, so please specify
"rb" instead of plain "r" (or omitting the mode) for binary files.
This restriction may be loosen in the future for simple substitution
like this one.

              matz.
8409679303ddfec8a990ca41a8a862ef?d=identicon&s=25 Jason O. (jason_o)
on 2010-11-10 19:39
The magic encoding comment didn't cut it for me. I found the answer in
my case by adding the following to my environment.rb (I run a mixed 1.8
and 1.9 environment):

if RUBY_VERSION =~ /1.9/
    Encoding.default_external = Encoding::UTF_8
    Encoding.default_internal = Encoding::UTF_8
end
This topic is locked and can not be replied to.