$KCODE and encodings


#1

I was searching for string encoding issues in Ruby. Here is the summary
of what I learnt, in case its useful to anyone else of if anyone has any
corrections to this.

Ruby 1.8 support for encoding:

  •     A comment like "# -*- coding: utf-8 -*-" at the start of the 
    

file is supposed to determine how to parse a .rb file, but I haven’t
really figured out how to make this work. Non-ansi characters cause an
error while loading the file.

  •     ruby.exe -K<kcode> sets $KCODE (which can also be set 
    

programmaticaly)

  •     $KCODE affects the following:
    
  •     Determines the encoding to use to parse .rb files. Normally, 
    

identifiers have to be ANSI, but the limitation is removed if $KCODE is
set to “UTF8”.

  •     Affects whether inspect escapes non-ascii chars, or if it 
    

leaves them as is.

  •     Affects how regexps without an explicit encoding interpret the 
    

input string.

Ruby 1.9 support for encodings:

  •     Identifiers can be non-ANSI by default.
    

Ruby 2.0 support for encodings:

  •     Each string and symbol knows its own encoding, and 
    

String#force_encoding can change the encoding of an existing string.

  •     IO#encoding to control encoding to use for reading/writing 
    

from disk


#2

On Fri, Feb 13, 2009 at 5:01 PM, Shri B.
removed_email_address@domain.invalidwrote:

Ruby 1.8 support for encoding:

· A comment like “# -- coding: utf-8 --” at the start of the
file is supposed to determine how to parse a .rb file, but I haven’t really
figured out how to make this work. Non-ansi characters cause an error while
loading the file.

Did the utf-8 file(s) you tried have a BOM or not?

-Matthew


#3

If I use Notepad2’s menu to set the encoding to “UTF8 with signature”,
and run either “ruby utf8_with_signature.rb” or “ruby -Ku
utf8_with_signature.rb”, the file fails to parse. The file is attached.

If I save the file with encoding set just as “UTF8”, the file is 3 bytes
smaller. “ruby utf8.rb” fails, but “ruby -Ku utf8.rb” works. With “-Ku”,
things work even if I do not have “# -- coding: utf-8 --” in the file.

The repro files are attached.

From: removed_email_address@domain.invalid
[mailto:removed_email_address@domain.invalid] On Behalf Of Matthew Wilson
Sent: Friday, February 13, 2009 5:11 PM
To: removed_email_address@domain.invalid
Subject: Re: [Ironruby-core] $KCODE and encodings

On Fri, Feb 13, 2009 at 5:01 PM, Shri B.
<removed_email_address@domain.invalidmailto:removed_email_address@domain.invalid> wrote:
Ruby 1.8 support for encoding:

  •     A comment like "# -*- coding: utf-8 -*-" at the start of the 
    

file is supposed to determine how to parse a .rb file, but I haven’t
really figured out how to make this work. Non-ansi characters cause an
error while loading the file.

Did the utf-8 file(s) you tried have a BOM or not?

-Matthew


#4

AFAIK Ruby 1.8 doesn’t support magic comments that specify encodings at
all, 1.9 does. Ruby 1.8 also doesn’t recognize BOM.
Even version 1.9 has full encoding support, not just 2.0.

Tomas

From: removed_email_address@domain.invalid
[mailto:removed_email_address@domain.invalid] On Behalf Of Shri B.
Sent: Friday, February 13, 2009 3:01 PM
To: removed_email_address@domain.invalid
Subject: [Ironruby-core] $KCODE and encodings

I was searching for string encoding issues in Ruby. Here is the summary
of what I learnt, in case its useful to anyone else of if anyone has any
corrections to this.

Ruby 1.8 support for encoding:

  •     A comment like "# -*- coding: utf-8 -*-" at the start of the 
    

file is supposed to determine how to parse a .rb file, but I haven’t
really figured out how to make this work. Non-ansi characters cause an
error while loading the file.

  •     ruby.exe -K<kcode> sets $KCODE (which can also be set 
    

programmaticaly)

  •     $KCODE affects the following:
    
  •     Determines the encoding to use to parse .rb files. Normally, 
    

identifiers have to be ANSI, but the limitation is removed if $KCODE is
set to “UTF8”.

  •     Affects whether inspect escapes non-ascii chars, or if it 
    

leaves them as is.

  •     Affects how regexps without an explicit encoding interpret the 
    

input string.

Ruby 1.9 support for encodings:

  •     Identifiers can be non-ANSI by default.
    

Ruby 2.0 support for encodings:

  •     Each string and symbol knows its own encoding, and 
    

String#force_encoding can change the encoding of an existing string.

  •     IO#encoding to control encoding to use for reading/writing 
    

from disk