R1.9 mixed encoding in file

vo_x · August 7, 2009, 3:49pm

Hello

I wonder if it is possible to enforce encoding of string in ruby 1.9.
Let say I have following example:

C:\enc>echo p ‘test’.encoding > encoding.rb
C:\enc>ruby encoding.rb
#Encoding:US-ASCII

Thats fine. But what if I like to have in single file ASCII, UTF-8 or
strings with other encodings, i.e.

C:\enc>echo p ‘zufÃ¤llige_Å¾luÅ¥ouÄkÃ½’.encoding > encoding.rb
C:\enc>ruby encoding.rb
encoding.rb:1: invalid multibyte char (US-ASCII)

I know that for this particular case I could use directive on top of the
file, but I would like to see something in following manner:

String.new ‘zufÃ¤llige_Å¾luÅ¥ouÄkÃ½’, Encoding.CP852

It means read the content in between quotes binary and interpret it
according to specified encoding.

Vit

vo_x · August 7, 2009, 4:06pm

On Aug 7, 2009, at 8:49 AM, VÃt Ondruch wrote:

Hello

Hello.

C:\enc>echo p ‘zufÃ¤llige_Å¾luÅ¥ouÄkÃ½’.encoding > encoding.rb
according to specified encoding.
The problem with an idea like this is that before your String is ever
created the code to create it must be read (correctly) by Ruby’s
parser and formed into a proper String literal. That would be
impossible to do if String literals could be in any random Encoding.

You have a couple of options though:

Just set an Encoding like UTF-8 for the source code, enter
everything in UTF-8, and transcode it into the needed Encoding. This
would make your example something like:

encoding: UTF-8

cp852 = “zufÃ¤llige_Å¾luÅ¥ouÄkÃ½”.encode(“CP852”) # literal in
UTF-8
Have one or more data files the program reads needed String objects
from. Those files can be in any Encoding you need and you can specify
it to IO operations, so your String objects are returned with that
Encoding.

I hope that helps.

James Edward G. II

vo_x · August 7, 2009, 4:47pm

James G. wrote:

On Aug 7, 2009, at 8:49 AM, VÃt Ondruch wrote:

Hello

Hello.

C:\enc>echo p ‘zufÃ¤llige_Å¾luÅ¥ouÄkÃ½’.encoding > encoding.rb
according to specified encoding.
The problem with an idea like this is that before your String is ever
created the code to create it must be read (correctly) by Ruby’s
parser and formed into a proper String literal. That would be
impossible to do if String literals could be in any random Encoding.

Yes, I understand that you have to parse the file. However, if I am
right, you still have to read the file binary in case you are looking
for some encoding directive on top of file. So from my point of view, it
shouldn’t be big problem to read until first quotes, suppose the file is
stored in the encoding designed on top of the file. Then read whatever
in between quotes as binary and decide later how to interpret that
binary data, by suggested encoding in second parameter of string
constructor.

You have a couple of options though:

Just set an Encoding like UTF-8 for the source code, enter
everything in UTF-8, and transcode it into the needed Encoding. This
would make your example something like:

encoding: UTF-8

cp852 = “zufÃ¤llige_Å¾luÅ¥ouÄkÃ½”.encode(“CP852”) # literal in
UTF-8

Have one or more data files the program reads needed String objects
from. Those files can be in any Encoding you need and you can specify
it to IO operations, so your String objects are returned with that
Encoding.

Both your suggestions are valid of course, but I consider them as
solutions far from ideal. They brings far more complexity than desired.

I hope that helps.

James Edward G. II

Of course my idea could be considered naive and there might be many
technical issues with parser, etc. which prevents the implementation.
Nevertheless, it would be nice feature.

Thank you for you suggestion anyway.

Vit

vo_x · August 7, 2009, 4:58pm

On Aug 7, 2009, at 9:47 AM, VÃt Ondruch wrote:

created the code to create it must be read (correctly) by Ruby’s
parser and formed into a proper String literal. That would be
impossible to do if String literals could be in any random Encoding.

Yes, I understand that you have to parse the file. However, if I am
right, you still have to read the file binary in case you are looking
for some encoding directive on top of file.

You don’t really have to:

$ cat source_encoding.rb

encoding: UTF-8

output = “”
open(FILE, “r:US-ASCII”) do |source|
first_line = source.gets
if first_line =~ /coding:\s*(\S+)/
source.set_encoding($1)
else
output << first_line
end
output << source.read
end
p [output.encoding, output[0…20] + “â€¦”]
$ ruby_dev source_encoding.rb
[#Encoding:UTF-8, “\noutput = “”\nopen(__â€¦”]

James Edward G. II

vo_x · August 7, 2009, 5:20pm

James G. wrote:

On Aug 7, 2009, at 9:47 AM, VÃt Ondruch wrote:

You don’t really have to:

It is disturbing that this approach will fail as soon as the file is
UTF-16 encoded or it has BOM for UTF-8, etc.

Vit

vo_x · August 7, 2009, 5:41pm

You are not allowed to set the source encoding to a non-ASCII
compatible encoding, if memory serves.

Where is it documented please?

That eliminates any issues
with encodings like UTF-16. This makes perfect sense as there’s no
way to reliably support the magic encoding comment unless we can count
on being able to read at least that far.

Needed to say that XML parsers can handle such cases, i.e. when xml
header is in different encoding than the rest of document.

A BOM could be handled similarly to what I showed. You need to open
the file in ASCII-8BIT and check the beginning bytes, then you could
switch to US-ASCII and finish reading the first line (or to the second
if a shebang line is includes), then switch encodings again if needed
and finish processing.

May be this technique could be used for reading UTF-16 encoded files, if
needed? However this is too far from my initial post

James Edward G. II

Vit

vo_x · August 7, 2009, 5:30pm

On Aug 7, 2009, at 10:20 AM, Vít Ondruch wrote:

James G. wrote:

On Aug 7, 2009, at 9:47 AM, Vít Ondruch wrote:

You don’t really have to:

It is disturbing that this approach will fail as soon as the file is
UTF-16 encoded or it has BOM for UTF-8, etc.

You are not allowed to set the source encoding to a non-ASCII
compatible encoding, if memory serves. That eliminates any issues
with encodings like UTF-16. This makes perfect sense as there’s no
way to reliably support the magic encoding comment unless we can count
on being able to read at least that far.

A BOM could be handled similarly to what I showed. You need to open
the file in ASCII-8BIT and check the beginning bytes, then you could
switch to US-ASCII and finish reading the first line (or to the second
if a shebang line is includes), then switch encodings again if needed
and finish processing.

James Edward G. II

vo_x · August 7, 2009, 7:46pm

On Aug 7, 2009, at 10:41 AM, Vít Ondruch wrote:

You are not allowed to set the source encoding to a non-ASCII
compatible encoding, if memory serves.

Where is it documented please?

I’m not sure it’s officially documented yet.

Ruby does throw an error in this scenario though:

$ ruby_dev

encoding: UTF-16BE

ruby_dev: UTF-16BE is not ASCII compatible (ArgumentError)

and:

$ ruby_dev -e ‘puts “\uFEFF# encoding: UTF-16BE”.encode(“UTF-16BE”)’ |
ruby_dev
-:1: invalid multibyte char (UTF-8)

I believe this is the relevant code from Ruby’s parser:

static void
parser_set_encode(struct parser_params *parser, const char *name)
{
int idx = rb_enc_find_index(name);
rb_encoding *enc;

 if (idx < 0) {

rb_raise(rb_eArgError, “unknown encoding name: %s”, name);
}
enc = rb_enc_from_index(idx);
if (!rb_enc_asciicompat(enc)) {
rb_raise(rb_eArgError, “%s is not ASCII compatible”,
rb_enc_name(enc));
}
parser->enc = enc;
}

That eliminates any issues
with encodings like UTF-16. This makes perfect sense as there’s no
way to reliably support the magic encoding comment unless we can
count
on being able to read at least that far.

Needed to say that XML parsers can handle such cases, i.e. when xml
header is in different encoding than the rest of document.

I doubt we can say that universally.

Also, what you said isn’t very accurate. For example, “in different
encoding than the rest of document” is not a possible occurrence
according to the XML 1.1 specification
(Extensible Markup Language (XML) 1.1 (Second Edition)
) which states:

“It is a fatal error when an XML processor encounters an entity with
an encoding that it is unable to process. It is a fatal error if an
XML entity is determined (via default, encoding declaration, or higher-
level protocol) to be in a certain encoding but contains byte
sequences that are not legal in that encoding.”

All XML parsers are required to assume UTF-8 unless told otherwise and
to be able to recognize UTF-16 by a required BOM. Beyond that, they
are not required to recognize any other encodings, though they may of
course. Their encoding declaration can be expressed in ASCII and,
since they assume UTF-8 by default, this is similar to what Ruby
does. It allows a switch to an ASCII-compatible encoding.

XML processors may do more. For example, they can accept a different
encoding from an external source to support things like HTTP headers
and MIME types. Ruby doesn’t really have access to such sources at
execution time, so that option doesn’t apply to the case we are
discussing. However, XML processors may also recognize other BOM’s
and Ruby could do this.

A BOM could be handled similarly to what I showed. You need to open
the file in ASCII-8BIT and check the beginning bytes, then you could
switch to US-ASCII and finish reading the first line (or to the
second
if a shebang line is includes), then switch encodings again if needed
and finish processing.

May be this technique could be used for reading UTF-16 encoded
files, if
needed?

Yes, Ruby could recognize BOM’s for non-ASCII compatible encodings to
support them. A BOM would be required in this case though, just as it
is in an XML processor that doesn’t have external information.

Ruby doesn’t currently do this, as near as I can tell.

Note that this would not give what you purposed in your initial
message: multiple encodings in the same file. Ruby doesn’t support
that and isn’t ever likely to. An XML processor that supports such
things is in violation of its specification as I understand it.

Besides, not many text editors that I’m aware of make it super easy to
edit in multiple encodings.

James Edward G. II

vo_x · August 7, 2009, 9:48pm

VÃt Ondruch wrote:

I know that for this particular case I could use directive on top of the
file, but I would like to see something in following manner:

String.new ‘zufÃ¤llige_Å¾luÅ¥ouÄkÃ½’, Encoding.CP852

It’s not pretty, but

str = “zuf\x84llige_\xA7lu\x9Cou\x9Fk\xEC”.force_encoding(“CP852”)

will probably do the job.

vo_x · August 7, 2009, 7:49pm

On 8/7/09, VÃt Ondruch [email protected] wrote:

file, but I would like to see something in following manner:

String.new ‘zufÃ¤llige_Å¾luÅ¥ouÄkÃ½’, Encoding.CP852

You seem to be asking for the ability to have individual string
literals have encoding different from that of the program as a whole.
Why not this:

#encoding: ascii-8bit
‘zufÃ¤llige_Å¾luÅ¥ouÄkÃ½’.force_encoding ‘cp852’
‘some utf8 data’.force_encoding ‘utf-8’
‘some sjis data’.force_encoding ‘sjis’

I am far from an expert on encodings, but in my (admittedly minimalist
and perhaps inadequate) testing, this seems to basically work.

There are going to be holes in this; data in nonascii compatible
encodings in particular may give trouble. However, if the string data
does not contain the bytes 0x27 (ascii ') or 0x5C (ascii ) there will
be no problem. Whether this will work in particular circumstances
given a known encoding and data to be represented in it is unknown in
general, but surely very often the case. If it’s the single quote
character that causes the problem, you can switch to a different
character using the%q[] quote syntax. In extremis, a single quoted
here document may be called for:

<<-‘end’
lotsa ’ and \ here, but ruby don’t care
end

This form of string has the advantage of having no special characters
at all, and you can choose the sequence of bytes that makes up the
string terminator to be anything you want. (but you do end up with an
extra (ascii) newline at the end…)

Another challenge will be editing this file. There’s no editor out
there that could actually display this kind of thing correctly; you’ll
have to become proficient at editing it as binary, or at least find an
editor than can tolerate arbitrary binary chars in its ascii.

vo_x · August 7, 2009, 10:11pm

Caleb C. wrote:

On 8/7/09, VÃt Ondruch [email protected] wrote:

file, but I would like to see something in following manner:

String.new ‘zufÃ¤llige_Å¾luÅ¥ouÄkÃ½’, Encoding.CP852

You seem to be asking for the ability to have individual string
literals have encoding different from that of the program as a whole.
Why not this:

#encoding: ascii-8bit
‘zufÃ¤llige_Å¾luÅ¥ouÄkÃ½’.force_encoding ‘cp852’
‘some utf8 data’.force_encoding ‘utf-8’
‘some sjis data’.force_encoding ‘sjis’

Hmmm, that is a good idea!!!

Which leads me to the question why is default encoding US-ASCII instead
of ASCII-8BIT?

Another challenge will be editing this file. There’s no editor out
there that could actually display this kind of thing correctly; you’ll
have to become proficient at editing it as binary, or at least find an
editor than can tolerate arbitrary binary chars in its ascii.

Its almost the same challenge if you want to edit single file in
different encoding than is your system encoding … so its not relevant
… in contrary, it could be even easier. Because in my case, I don’t
care much about content, since I need more encodings for testing.