Forum: Ruby ruby 1.9.1: Encoding trouble: broken US-ASCII String

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Tom L. (Guest)
on 2008-12-14 19:10
(Received via mailing list)
Hi,

Right now, I'm not exactly thrilled by the way ruby 1.9 handles
encodings. Could somebody please explain things or point me to some
reference material:

I have the following files:

testEncoding.rb:
#!/usr/bin/env ruby
# encoding: ISO-8859-1

p __ENCODING__

text = File.read("text.txt")
text.each_line do |line|
    p line =~ /foo/
end


text.rb:
Foo äöü bar.

I use: ruby 1.9.1 (2008-12-01 revision 20438) [i368-cygwin]

If I run: ruby19 testEncoding.rb, I get:
#<Encoding:ISO-8859-1>
testEncoding.rb:8:in `block in <main>': broken US-ASCII string
(ArgumentError)

Ruby detects the encoding line but suspects the text file to be 7bit
ascii nevertheless. The source file encoding is only respected if I
add the command line option -E ISO-8859-1. I could also set the
encoding explicitly for each string but ...

I found some hints that the default charset for external sources is
deduced from the locale. So I set LANG to de_AT, de_AT.ISO-8859-1 and
some more variants with no avail.

How exactly is this supposed to work? What other options do I have to
make ASCII8BIT or Latin-1 the default encoding without having to
supply an extra command-line option and without having to rely on an
environment variable? Why isn't ASCII8BIT the default in the first
place? Why isn't __ENCODING__ a global variable I can assign a value
too?

Thanks,
Thomas.
Brian C. (Guest)
on 2008-12-15 14:17
Tom L. wrote:
> Right now, I'm not exactly thrilled by the way ruby 1.9 handles
> encodings. Could somebody please explain things or point me to some
> reference material:

I asked the same over at ruby-core recently. There were some useful
replies:

http://www.ruby-forum.com/topic/173179#759661

But the upshot is that this is all pretty much undocumented so far.
(Well it might be documented in the 3rd ed Pickaxe, but I'm not buying
that yet)

> text = File.read("text.txt")

This should work:

text = File.read("text.txt", :encoding=>"ISO-8859-1")

I still don't know how the default is worked out though.

Regards,

Brian.
Tom L. (Guest)
on 2008-12-15 15:01
(Received via mailing list)
> text = File.read("text.txt", :encoding=>"ISO-8859-1")

Unfortunately, this isn't compatible with ruby 1.8. A script that uses
such a construct runs only with ruby 1.9. Sigh.

Many thanks for the pointer to the other thread over at ruby core.

Regards,
Thomas.
Brian C. (Guest)
on 2008-12-15 15:40
Tom L. wrote:
>> text = File.read("text.txt", :encoding=>"ISO-8859-1")
>
> Unfortunately, this isn't compatible with ruby 1.8. A script that uses
> such a construct runs only with ruby 1.9. Sigh.

If all else fails, read the source.

I see that the encoding falls back to rb_default_external_encoding(),
which returns default_external, setting it if necessary from
rb_enc_from_index(default_external_index)

This in turn is set from rb_enc_set_default_external

This in turn is set from cmdline_options.ext.enc.name

And this in turn is set from the -E flag (or certain legacy settings on
-K). So:

$ ruby19 -E ISO-8859-1 -e 'puts File.open("/etc/passwd").gets.encoding'
ISO-8859-1

Yay. However, if it is possible to set the default external encoding
programatically (i.e. not via the command line options) I couldn't see
how.
Brian C. (Guest)
on 2008-12-15 15:48
Brian C. wrote:
> $ ruby19 -E ISO-8859-1 -e 'puts File.open("/etc/passwd").gets.encoding'
> ISO-8859-1

D'oh. I see from original post that you knew this already.

It seems that Ruby keeps state for:
- default external encoding (e.g. for files being read in)
- default internal encoding (not sure what this is, you can set using -E
too but it defaults to nil)

and these are independent from the encodings of source files, which use
the magic comments to declare their encoding.

You can read these using Encoding.default_external and
Encoding.default_internal, but there don't appear to be setters for
them.
Brian C. (Guest)
on 2008-12-15 16:02
Ah, there is a preview here:

http://books.google.co.uk/books?id=jcUbTcr5XWwC&pg...

Something like this may do the trick:

text = File.open("..") do |f|
  f.set_encoding("ISO-8859-1") rescue nil
  f.read
end

But then you may as well just do:

text.force_encoding("ISO-8859-1") rescue nil

I'm not sure in which way the regexp is incompatible with the data read.
I would have thought that a US-ASCII regexp should be able to match
ISO-8859-1 data, and perhaps vice versa, but it seems not.

I can't really replicate without a hexdump of your text.txt. But it
would be interesting to see the result of:

text.each_line do |line|
    p line.encoding
    p /foo/.encoding
    p line =~ /foo/
end

Maybe what's really needed is a sort of "anti-/u" option which means "my
regexp literals are meant to match byte-at-a-time, not
character-at-a-time"

Anyway, I'm afraid all this increases my inclination to stick with ruby
1.8.6 :-(
James G. (Guest)
on 2008-12-15 16:09
(Received via mailing list)
On Dec 15, 2008, at 6:10 AM, Brian C. wrote:

> But the upshot is that this is all pretty much undocumented so far.
> (Well it might be documented in the 3rd ed Pickaxe, but I'm not buying
> that yet)

The Pickaxe does cover a lot of the new encoding behavior.

James Edward G. II
James G. (Guest)
on 2008-12-15 16:12
(Received via mailing list)
On Dec 15, 2008, at 7:41 AM, Brian C. wrote:

> - default internal encoding (not sure what this is, you can set
> using -E
> too but it defaults to nil)

Default internal is the encoding IO objects will transcode incoming
data into, by default.  So you could set this for UTF-8 and then read
from various different encodings (specifying each type in the open()
call), but only work with Unicode in your script.

James Edward G. II
James G. (Guest)
on 2008-12-15 16:27
(Received via mailing list)
On Dec 15, 2008, at 7:55 AM, Brian C. wrote:

> I would have thought that a US-ASCII regexp should be able to match
> ISO-8859-1 data, and perhaps vice versa, but it seems not.

It does:

$ ruby_dev -e 'p "résumé".encode("ISO-8859-1") =~ /foo/'
nil
$ ruby_dev -e 'p "résumé foo".encode("ISO-8859-1") =~ /foo/'
7

> Maybe what's really needed is a sort of "anti-/u" option which means
> "my
> regexp literals are meant to match byte-at-a-time, not
> character-at-a-time"

That's what BINARY means.

> Anyway, I'm afraid all this increases my inclination to stick with
> ruby
> 1.8.6 :-(

Perhaps it's a bit early to make this judgement since you've just
started learning about the new system?

There's a lot going on here, so it's a lot to take in.  In places, the
behavior is a little complex.  However, the core team has put a lot of
effort into making the system easier to use.  It's getting there.

Also, even in it's current draft form, the Pickaxe answers every
question you've thrown at both mailing lists.  Thus it should be a big
help when you decide the time is right to pick it up.

James Edward G. II
Brian C. (Guest)
on 2008-12-15 16:57
James G. wrote:
>> I would have thought that a US-ASCII regexp should be able to match
>> ISO-8859-1 data, and perhaps vice versa, but it seems not.
>
> It does:
>
> $ ruby_dev -e 'p "r�sum�".encode("ISO-8859-1") =~ /foo/'
> nil
> $ ruby_dev -e 'p "r�sum� foo".encode("ISO-8859-1") =~ /foo/'
> 7

I found that too, but was confused by the "broken US-ASCII string"
exception which the OP saw.

I suppose the external_encoding is defaulting to US-ASCII on that
system.

This means his program will break on every file passed into it which has
a character with the top bit set. You can argue that's "failsafe", in
the sense of bombing out rather than continuing processing with the
wrong encoding, and it therefore forces you to change your program or
the command-line args to specify the actual encoding in use.

However, that's pretty unforgiving. I can use Unix grep on a file with
unknown character set or broken UTF-8 characters and it works quite
happily.

Wouldn't it be kinder to default to BINARY if the encoding is
unspecified?

irb(main):011:0> s = "foo\xff\xff\xffbar".force_encoding("BINARY")
=> "foo\xFF\xFF\xFFbar"
irb(main):012:0> s =~ /foo/
=> 0

>> Maybe what's really needed is a sort of "anti-/u" option which means
>> "my
>> regexp literals are meant to match byte-at-a-time, not
>> character-at-a-time"
>
> That's what BINARY means.

On the String side, yes.

I was thinking of an option on the Regexp: /foo/b or somesuch.
(In contrast to /foo/u in 1.8 meaning 'this Regexp matches unicode')

Or you can you set BINARY encoding on the Regexp too? I couldn't see
how.
Tom L. (Guest)
on 2008-12-15 17:15
(Received via mailing list)
> There's a lot going on here, so it's a lot to take in.  In places, the  
> behavior is a little complex.  However, the core team has put a lot of  
> effort into making the system easier to use.  It's getting there.

It would have been nice though if the defaults had been chosen so that
they don't break 1.8 scripts -- or use some 8bit clean encoding if the
data contains 8bit wide characters instead of throwing an error.
James G. (Guest)
on 2008-12-15 17:28
(Received via mailing list)
On Dec 15, 2008, at 9:07 AM, Tom L. wrote:

>> There's a lot going on here, so it's a lot to take in.  In places,
>> the
>> behavior is a little complex.  However, the core team has put a lot
>> of
>> effort into making the system easier to use.  It's getting there.
>
> It would have been nice though if the defaults had been chosen so that
> they don't break 1.8 scripts -- or use some 8bit clean encoding if the
> data contains 8bit wide characters instead of throwing an error.

I think it's probably more important to get this encoding interface
right than to worry about 1.8 compatibility.  We knew 1.9 was going to
break some things, so the time was right.

Also, if you've been using the -KU switch in Ruby 1.8 and working with
UTF-8 data, 1.9 may work pretty well for you:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/...

That's a pretty common "best practice" in the Ruby community, from
what I've seen.  Even Rails pushes this approach now.

If you have gone this way though, you may want to migrate to the even
better -U switch in 1.9.

James Edward G. II
James G. (Guest)
on 2008-12-15 17:31
(Received via mailing list)
On Dec 15, 2008, at 8:50 AM, Brian C. wrote:

> Wouldn't it be kinder to default to BINARY if the encoding is
> unspecified?

The default encoding is pulled from your environment:  LANG or
LC_CTYPE, I believe.  This is very important and it makes simple
scripting fit in well with the environment.

James Edward G. II
Ollivier R. (Guest)
on 2008-12-15 17:55
(Received via mailing list)
In article <removed_email_address@domain.invalid>,
James G.  <removed_email_address@domain.invalid> wrote:
>Perhaps it's a bit early to make this judgement since you've just =20
>started learning about the new system?

From what I've seen and experimented with 1.9 for a few months, my main
gripe
is that the whole encoding support is overly complex. I know m17n is not
solved by the magic unicode wand but I'd love to have a more simple way.
Brian C. (Guest)
on 2008-12-15 18:00
>> Wouldn't it be kinder to default to BINARY if the encoding is
>> unspecified?
>
> The default encoding is pulled from your environment:  LANG or
> LC_CTYPE, I believe.  This is very important and it makes simple
> scripting fit in well with the environment.

The code seems to say:
- if an encoding is chosen in the environment but is unknown to Ruby,
  use ASCII-8BIT (aka BINARY)
- if Ruby was built on a system where it doesn't know how to ask the
  environment for a language, then use US-ASCII

So I would read from this that the OP has either fallen foul of the
US-ASCII fallback (e.g. no langinfo.h when building under Cygwin), or
else his environment has explicitly picked US-ASCII.

There must have been a good reason why US-ASCII was chosen, rather than
ASCII-8BIT, for systems without langinfo.h.

Regards,

Brian.

rb_locale_encoding(void)
{
    VALUE charmap = rb_locale_charmap(rb_cEncoding);
    int idx;

    if (NIL_P(charmap))
        idx = rb_usascii_encindex();
    else if ((idx = rb_enc_find_index(StringValueCStr(charmap))) < 0)
        idx = rb_ascii8bit_encindex();

    if (rb_enc_registered("locale") < 0) enc_alias("locale", idx);

    return rb_enc_from_index(idx);
}

...

VALUE
rb_locale_charmap(VALUE klass)
{
#if defined NO_LOCALE_CHARMAP
    return rb_usascii_str_new2("ASCII-8BIT");
#elif defined HAVE_LANGINFO_H
    char *codeset;
    codeset = nl_langinfo(CODESET);
    return rb_usascii_str_new2(codeset);
#elif defined _WIN32
    return rb_sprintf("CP%d", GetACP());
#else
    return Qnil;
#endif
}
Yukihiro M. (Guest)
on 2008-12-15 18:13
(Received via mailing list)
Hi,

In message "Re: ruby 1.9.1: Encoding trouble: broken US-ASCII String"
    on Tue, 16 Dec 2008 00:47:37 +0900, Ollivier R.
<removed_email_address@domain.invalid> writes:

|From what I've seen and experimented with 1.9 for a few months, my main gripe
|is that the whole encoding support is overly complex. I know m17n is not
|solved by the magic unicode wand but I'd love to have a more simple way.

The whole picture must be complex, since encoding support itself is
VERY complex indeed.  History sucks.  But for daily use, just remember
specifying encoding if you are not sure what is the default_encoding,
e.g.

  f = open(path, "r:iso-8859-1")

or

  f = open(path, "r", encoding: "iso-8859-1")

Simple?  If you want to convert your data into Unicode every time you
read, just put -U at your shebang (#!) line, in addition.

              matz.
Brian C. (Guest)
on 2008-12-15 18:24
Yukihiro M. wrote:
> The whole picture must be complex, since encoding support itself is
> VERY complex indeed.  History sucks.  But for daily use, just remember
> specifying encoding if you are not sure what is the default_encoding,
> e.g.
>
>   f = open(path, "r:iso-8859-1")

It seems to go against DRY to have to write "r:binary" or "rb:binary"
when opening lots of binary files. But if I remember to use
#!/usr/bin/ruby -Knw everywhere that should be OK.

However, I also don't like the unstated assumption that all Strings
contain text.

In RFC2045 (MIME), there is a distinction made between 7bit text, 8bit
text, and binary data.

But if you label a string as "binary", Ruby changes this to
"ASCII-8BIT". I think that is a misrepresentation of that data, if it is
not actually ASCII-based text. I would much rather it made no assertion
about the content than a wrong assertion.
Tom L. (Guest)
on 2008-12-15 18:30
(Received via mailing list)
> Also, if you've been using the -KU switch in Ruby 1.8 and working with
> UTF-8 data, 1.9 may work pretty well for you

Well, I'm still stuck with latin-1. It's interesting though that
according to B Candler the fallback for unknown encodings should be 8-
bit clean and that US-ASCII should be only used as last resort. Maybe
it's just a cygwin thing?

Could we/I please get more information on how exactly the charset is
chosen depending on which environment variable and if this applies for
cygwin too? It appears to me that neither LANG nor LC_TYPE have any
effect on charset selection. But maybe I'm doing it wrong.

Regards,
Thomas.
Dave T. (Guest)
on 2008-12-15 18:31
(Received via mailing list)
On Dec 15, 2008, at 10:16 AM, Brian C. wrote:

> It seems to go against DRY to have to write "r:binary" or "rb:binary"
> when opening lots of binary files. But if I remember to use
> #!/usr/bin/ruby -Knw everywhere that should be OK.

You used to have to do that. In recent HEADS, rb sets binary encoding
automatically (unless overridden).


Dave
Yukihiro M. (Guest)
on 2008-12-15 19:39
(Received via mailing list)
Hi,

In message "Re: ruby 1.9.1: Encoding trouble: broken US-ASCII String"
    on Tue, 16 Dec 2008 01:16:55 +0900, Brian C.
<removed_email_address@domain.invalid> writes:

|It seems to go against DRY to have to write "r:binary" or "rb:binary"
|when opening lots of binary files. But if I remember to use
|#!/usr/bin/ruby -Knw everywhere that should be OK.
|
|However, I also don't like the unstated assumption that all Strings
|contain text.

open(path, "rb") is your friend.  It sets encoding to binary.


|In RFC2045 (MIME), there is a distinction made between 7bit text, 8bit
|text, and binary data.
|
|But if you label a string as "binary", Ruby changes this to
|"ASCII-8BIT". I think that is a misrepresentation of that data, if it is
|not actually ASCII-based text. I would much rather it made no assertion
|about the content than a wrong assertion.
|--
|Posted via http://www.ruby-forum.com/.
|
Brian C. (Guest)
on 2008-12-15 22:52
Yukihiro M. wrote:
> open(path, "rb") is your friend.  It sets encoding to binary.

Thanks.

"rb" is now performing two jobs then - prevent line-ending translation
(on those platforms which do it), and set encoding to binary. Something
to remember.
Tom L. (Guest)
on 2008-12-16 22:42
(Received via mailing list)
> So I would read from this that the OP has either fallen foul of the
> US-ASCII fallback (e.g. no langinfo.h when building under Cygwin), or
> else his environment has explicitly picked US-ASCII.

Somebody mentions on http://bugs.python.org/issue3824 that:
"And nl_langinfo(CODESET) is useless on cygwin because it's always US-
ASCII."

And here: http://svn.xiph.org/trunk/vorbis-tools/intl/localcharset.c
"Cygwin 2006 does not have locales.  nl_langinfo (CODESET) always
returns "US-ASCII"."

If I understood you right, this could cause the problems I
encountered.

Cygwin 1.7 is currently in beta. Maybe this improves things in this
respect?

Regards,
Thomas.
This topic is locked and can not be replied to.