Unicode roadmap?

Hi,

In message “Re: Unicode roadmap?”
on Thu, 22 Jun 2006 08:46:08 +0900, “Michal S.”
[email protected] writes:

|I do not see how converting the strings on input will make the
|situation better than converting them later. The exact place where the
|text is garbled because it is converted incorrectly does not change
|the fact it is no longer usable, does it?

It does. But if you convert encoding lazily, you will have hard time
to track down the source of the error causing data. It may be input
data from IO, or from some GUI toolkit, or the result of operation
with variety of sources.

|> For only rare case, there might be need to handle multiple encoding in
|> an application. I do want to allow it. But I am not sure how we can
|> help that kind of applications, since they are fundamentally complex.
|> And we don’t have enough experience to design a framework for such
|> applications.
|
|I do no think it is that rare. Most people want new web (or any other)
|stuff in utf-8 but there is need to interface legacy databases or
|applications. Sometimes converting the data to fit the new application
|is not practical. For one, the legacy application may be still used as
|well.

I understand the challenge, but I don’t think it is common to run some
part of your program in legacy encoding (without conversion), and
other part in UTF-8. You need to convert them into universal encoding
anyway for most of the cases. That’s why I said it rare.

						matz.

Hi,

In message “Re: Unicode roadmap?”
on Thu, 22 Jun 2006 02:17:53 +0900, “Dmitry S.”
[email protected] writes:

|Things shouldn’t be that complicated.

Agreed in principle. But it seems to be fundamental complexity of the
world of multiple encoding. I don’t think automatic conversion would
improve the situation. It would cause conversion error almost
randomly. Do you have any idea to simplify things?

I am eager to hear.

						matz.

Hi,

In message “Re: Unicode roadmap?”
on Thu, 22 Jun 2006 15:55:18 +0900, “Lugovoi N.”
[email protected] writes:
|> I am eager to hear.
|
|So what will be semantic for encoding tag:
| a) weak suggestion?
| b) strong assertion?

Weak suggestion, if I understand you correctly.

|I’d prefer encoding tag as strong assertion, mostly for reliability reasons.

Hmm, your idea of combination of strong assertion and automatic
conversion seems too complex for me, but it may be worth considering.
Thank you for idea.

|uhm, how to convert compiled extension library?

Every extension that does input/output need to specify (either
explicitly or implicitly) encoding it uses anyway. I will add
an encoding option to rb_tainted_str_new() and its family. If it’s
possible, I’d like to allow extensions to declare their default
encoding in their initialize function (Init_xxx).

						matz.

2006/6/22, Yukihiro M. [email protected]:

randomly. Do you have any idea to simplify things?

I am eager to hear.

So what will be semantic for encoding tag:
a) weak suggestion?
b) strong assertion?

If encoding tag is only weak suggestion (and for now I see it will be
just that), it will imply:

  • performance win (no need to check conformance to told encoding)
  • win in having less complexity (most tasks use source code, text
    data input and output all in the same [default host] encoding)
  • portability drawbacks (assumtions made by original coders will be
    implicit, but they have to be figured out, when porting to another
    environement)
  • reliability drawbacks (weak suggestions are too often ignored, and
    you don’t know when, where and why they will hit your app, but someday
    they will!)

If encoding tag is strong assertion, it will imply:

  • probable performance loss:
    • to assure this string with encoding = “none” (raw) represents
      valid encoding sequence of bytes,
      at the same price as String#length
    • need to recode bytes, when changing tag
  • slightly more complexity (developers will have to declare these
    assertions explicitly)
  • portability win
  • reliability win

What compromise on this issues would be acceptable?

I’d prefer encoding tag as strong assertion, mostly for reliability
reasons.

And for operations on Strings with different encodings, I’d like
implicit automatic encoding coercion:

NOTES:

a) String#recode!(new_encoding) replaces current internal byte

representation with new byte sequence,

that is recoded current. must raise IncompatibleCharError, if

can’t convert char to destination encoding

b) downgrading string from some stated encoding to “none” tag must

be done only explicitly.

it is not an option for implicit conversion

c) $APPLICATION_UNIVERSAL_ENCODING is a global var, allowed to be

set once and only once per application run.

Intent: we want all strings which aren’t raw bytes to be in one

single predefined encoding,

so all operations on string must return string in conformant

encoding.

Desired encoding is value of $APPLICATION_UNIVERSAL_ENCODING.

If $APPLICATION_UNIVERSAL_ENCODING is nil, we go in "democracy

mode", see below.

def coerce_encodings(str1, str2)
enc1 = str1.encoding
enc2 = str2.encoding

simple case, same encodings, will return fast in most cases

return if enc1 == enc2

another simple but rare case, totally incompatible encodings, as

they represent incompatible charsets
if fully_incompatible_charsets?(enc1, enc2)
raise(IncompatibleCharError, “incompatible charsets %s and %s”,
enc1, enc2)
end

uncertainity, handling “none” and preset encoding

if enc1 == “none” || enc2 == “none”
raise(UnknownIntentEncodingError, “can’t implicitly coerce
encodings %s and %s, use explicit conversion”, enc1, enc2)
end

Tirany mode:

we want all strings which aren’t raw bytes to be in one single

predefined encoding
if $APPLICATION_UNIVERSAL_ENCODING
str1.recode!($APPLICATION_UNIVERSAL_ENCODING)
str2.recode!($APPLICATION_UNIVERSAL_ENCODING)
return
end

Democracy mode:

first try to perform non-loss conversion from one encoding to

another:

1) direct conversion, without loss, to another encoding, e.g. UTF8

  • UTF16
    if exists_direct_non_loss_conversion?(enc1, enc2)
    if exists_direct_non_loss_conversion?(enc2, enc1)

    performance hint if both available

     if str1.byte_length < str2.byte_length
     	str1.recode!(enc2)
     else
     	str2.recode!(enc1)
     end
    

    else
    str1.recode!(enc2)
    end
    return
    end
    if exists_direct_non_loss_conversion?(enc2, enc1)
    str2.recode!(enc1)
    return
    end

    2) non-loss conversion to superset

    (I see no reason to raise exception on KOI8R + CP1251, returning

string in Unicode will be OK)
if superset_encoding = find_superset_non_loss_conversion?(enc1, enc2)
str1.recode!(superset_encoding)
str2.recode!(superset_encoding)
return
end

A case for incomplete compatibility:

Check if subset of enc1 is also subset of enc2,

so some strings in enc1 can be safely recoded to enc2,

e.g. two pure ASCII strings, whatever ASCII-compatible encoding

they have
if exists_partial_loss_conversion?(enc1, enc2)
if exists_partial_loss_conversion?(enc2, enc1)
# performance hint if both available
if str1.byte_length < str2.byte_length
str1.recode!(enc2)
else
str2.recode!(enc1)
end
else
str1.recode!(enc2)
end
return
end

the last thing we can try

str2.recode!(enc1)
end

So, when operation involves two Strings or String and Regexp, with
different encodings, automatic coercion should be done, as described
above.

That will, probably, solve coding problems (no need to think about
encodings most time), but can have following impacts:

  1. after several operations, when one sends string to external IO, it
    might be internally encoded in superset of that IO encoding. One has
    to remember that and perform external IO accordingly, i.e. to resolve
  • to fail on invalid chars or use replacement chars (like U+FFFD),-
    but that is unavoidable.
  1. some performance hits, which I expect to be rare.

Besides, there can be another class of problems with automatic
coercion: how to ensure consistent work of character ranges in Regexps
and String methods like [count, delete, squeeze, tr, succ, next, upto]
when encodings are coerced?

What I, as Ruby user, wish for Unicode/M17N support:

  1. reliability and consistency:
    a) String should be abstraction for character sequence,
    b) String methods shouldn’t allow me to garble internal
    representation;
    c) treating String as byte sequence is handy, but must be explict
    stated.
  2. coding comfort:
    a) no need to care what encodings have strings while working with
    them;
    b) no need to care what encodings have strings returned from
    third-party code;
    c) using explicit stated conversion options for external IO.
  3. on Unicode and i18n : at least to have a set of classes for
    Unicode-specific tasks (collation, normalization, string search,
    locale-aware formatting etc.) that would efficiently work with Ruby
    strings.

And, for all out there, just ask “Which charset/encoding will fit all
the [present and future] needs?”. You know the exact answer: “NONE”.

I understand the challenge, but I don’t think it is common to run some
part of your program in legacy encoding (without conversion), and
other part in UTF-8. You need to convert them into universal encoding
anyway for most of the cases. That’s why I said it rare.

uhm, how to convert compiled extension library?

On 6/22/06, Yukihiro M. [email protected] wrote:

Weak suggestion, if I understand you correctly.

|I’d prefer encoding tag as strong assertion, mostly for reliability reasons.

Hmm, your idea of combination of strong assertion and automatic
conversion seems too complex for me, but it may be worth considering.
Thank you for idea.

What I had in mind was much simpler. If the strings do not match just
try to recode to the default encoding which would be unicode most of
the time. Or just try to find a superset.

|uhm, how to convert compiled extension library?

Every extension that does input/output need to specify (either
explicitly or implicitly) encoding it uses anyway. I will add
an encoding option to rb_tainted_str_new() and its family. If it’s
possible, I’d like to allow extensions to declare their default
encoding in their initialize function (Init_xxx).

But if recoding is not automatic you still have to recode the strings
manually. Both the input to the extension and the results. That is an
annoyance an repetitive code everywhere.

Thanks

Michal

On Wed, Jun 21, 2006 at 01:04:55AM +0900, Tim B. wrote:

details and, when you were ready to output, allowed you to say “Give
me that in ISO-8859 or UTF-8 or whatever”. -Tim

That’s what I suggested basically. The problem seems to be non-Unicode
demands mainly, and performance issues on the other hand. And it makes
Strings useless as byte buffers, since you have to specifiy the
encoding of the external representation you create the String from at
creation time. To recap:

Private extensions to Unicode are deemed too complex to implement
(Matz).

Transforming legacy or special (non Unicode) data to a ruby-private
internal storage format on I/O is too performance/space intensive
(Matz).

Strings as byte buffers are important to some people, and they don’t
want to use another class or array for it, even if RegExp et al would
be extended to handle these too.

While it would be proper OO design, encapsulating the internal String
implementation hampers direct access to the “raw” data for C-hackers,
creating unwanted hurdles, and again performance issues.

I am still not convinced the arguments against this approach really
will hold in the long run, but since I am not the one implementing it
and can’t really participate there due to language barriers, I can
only lean back and wait for the first realease of M17N. Learning
English was hard enough for me.

-Jürgen

Yukihiro M. wrote:

Alright, then what specific features are you (both) missing? I don’t
think it is a method to get number of characters in a string. It
can’t be THAT crucial. I do want to cover “your missing features” in
the future M17N support in Ruby.

Sorry for maybe getting into, but here are my 5 cents. When I first
found out about ruby, I practically almost fell in love with the
language. Unfortunately, after some studying and experimenting I
suddenly found that it lacks proper unicode support on win32, in
particular with file IO and ole automation, i.e. in two cases where I
had to interoperate with the rest of the world. Win32 really differs
from Linux and maybe other Unixes in API because in *nix you don’t have
to worry about unicode/whatever, because all of the system depends on
your current locale. In win32 there are two sets of API, ansi and
unicode, maybe that was a bad microsoft’s decision, but that’s a
reality. Now I am a Russian, and when I write scripts I have to worry
that not only Russian characters don’t get messed up, but characters of
other languages as well. So that if I receive, say, excel file with a
lot of languages in that, and I have to process that file somehow I have
to be sure that no letters will be lost, nor messed up, thus converting
it to current codepage (1251) is no option for me. The same is with
filenames, the fact that I’m running russian winxp doesn’t mean that I
have only filenames that fall in 1251 codepage, I also have filenames
with european characters (umlauts and such), as well as japanese, and
when I want to write some script that processes these files, I have to
be able to work with them. At that time this caused me to move to Tcl
(it has utf-8 encoding everywhere, and it converts to required encoding
when interoperating with the world). Since then I’m still waiting for
proper unicode support in ruby (read: proper interoperability with
operating system and its components using unicode API versions: the ones
ending with W) and maybe a way to define in which locale (specific code
page, utf-8, etc) the current script is running.

Hope that clarifies what is currently missing for me (and maybe others,
I don’t know).

On 22.6.2006, at 10:17, Yukihiro M. wrote:

|I’d prefer encoding tag as strong assertion, mostly for
reliability reasons.

Hmm, your idea of combination of strong assertion and automatic
conversion seems too complex for me, but it may be worth considering.

Strong assertion + auto conversion is the only solution which will
relieve programmers from manually checking/changing string encodings
in their programs.

Remember, string input/output points in a program are not only system
IO classes, but also all the third party libraries/classes which deal
with strings. So most of the existing Ruby and other external (e.g.
Java) libraries, which can be used from Ruby.

The assumption that only system IO is the entry/exit point for string
encoding is very wrong. This assumption holds only for scripts which
use no third party libraries.

So we have two possibilities:
a) every programmer is forced to implement the above solution in
every program (this is starting to happen already, and current
experience tells us that the future in this direction is disaster!)
b) Ruby interpreter implements this solution, and programmers happily
ignore all the complexity.

So, it is true that we move the complexity into Ruby, but this is
(IMHO) much less complicated and much more needed than e.g.
infinitely big integers which we already have.

If Ruby wants to move forward, it needs transparent String support
and hopefully separation of String and ByteArray, since this un-
separation brought us code which is mostly wrong (currently most of
existing Ruby code breaks if string encoding is honoured, as can be
seen from experience of brave people who modified String class).

Ruby is my favourite language, and if it would have String support as
suggested, software development would be just pure joy…

Please listen to the people which tell of disastrous experience in
other languages. And for good experience, I develop in Cocoa in Mac
OS X for many many years, and it has great String class (ok, the
suggested Ruby class would be even better, but still). Plus it has
separated String and Byte array. The results are superb. There is no
problems, and nobody ever worries about strings and encodings. Ever.
You can check the mailing lists.

izidor

On 25-jun-2006, at 19:18, Izidor J. wrote:

Please listen to the people which tell of disastrous experience in
other languages. And for good experience, I develop in Cocoa in Mac
OS X for many many years, and it has great String class (ok, the
suggested Ruby class would be even better, but still). Plus it has
separated String and Byte array. The results are superb. There is
no problems, and nobody ever worries about strings and encodings.
Ever. You can check the mailing lists.

The greatest about Cocoa is that I’m able to suspect that 99 percent
of the programs I use do The Right Thing when I want to input russian
text in there, and NOT because the programmer did something special to
make it work. Because if he had to, he wouldn’t. In contrast, 70
percent of Carbon applications are not even capable of displaying the
text properly (let alone letting me type it in).

On 6/25/06, Yukihiro M. [email protected] wrote:

Or using Win32 API ending with W could allow you living in the
Unicode?

Matz,

I’ve mentioned it before, but I will be happy to make the Windows APIs
work with Unicode once the m17n Strings exist. Yes, I will be making
them use either UTF-8 (conversion required, most likely to be compatible
with existing code) or UTF-16 (no conversion required). It will work
well: I have done a similar implementation for code that I have written
at work.

-austin

Hi,

In message “Re: Unicode roadmap?”
on Sun, 25 Jun 2006 23:41:48 +0900, Snaury M. [email protected]
writes:

|Hope that clarifies what is currently missing for me (and maybe others,
|I don’t know).

Unfortunately, not. I understand Russian people having problem with
multiple encoding, but I don’t know how can we help you.

You said Tcl has Unicode support that works well with you. So that I
think treating all of them in UTF-8 is OK for you. Then how can it
determine which should be in the current code page, or in Unicode?
Or using Win32 API ending with W could allow you living in the
Unicode?

						matz.

On 6/25/06, Austin Z. [email protected] wrote:

Ruby does not need a String with an internal representation in Unicode;
Ruby does not need a separate byte vector. An unencoded string can be
treated as a byte vector with no problems; if it is determined to have
textual meaning, it can be tagged with an encoding very simply and from
that point be treated as a meaningful string. There are times when the
encoding is not best treated in Unicode, especially if there are
potential conversion errors.

When is a ByteArray not a ByteArray? When is a String not a String? Is
it
correct to mingle the two concepts perpetually, when they each have
fairly
specific definitions? My problem with continuing to treat String as a
byte
vector is that it forces two somewhat incompatible concepts on the same
class and the same methods. If you can use a String as both a byte
vector
and as a sequence of characters by calling the same methods, then
setting or
clearing encoding suddenly has the side-effect of changing how elements
of
String are to be treated. If you are providing separate methods for
working
with bytes as opposed to working with characters, then you are already
splitting the two concepts.

(As an aside, does it make sense that I read from a binary file into a
String? Can I reliably assume that binary content in a String should be
logically manipulable as text strings are? Should my binary String work
anywhere and everywhere a text-based String does? I would think that
binary
content neither walks nor quacks like a String.)

By your definition, a String can be treated as a ByteArray so long as
its
internal string does not have an encoding. What do I use if I want to
have
an encoding and still use byte vector semantics?

It is appropriate that a String is no longer usable as a ByteArray as a
result of changing some state? If there exists any state where String
cannot
be logically treated as a byte array, then String != ByteArray in the
general case either. The encoding of a String’s internal representation
should not dictate the outward behavior of the String.

If, however, you completely separate the two concepts, there’s no
dichotomy.
In that case, a String deals with characters, and you do not have
guarantees
about byte-boundaries or indexed elements. You only have guarantees
about
characters, as it should be. Simultaneously, ByteArray would allow you
to
always work with a vector (array) of bytes, regardless of what those
bytes
contain.

I’ll end it off saying this: I think it’s a no-brainer that for dealing
with
streams of bytes, there should be a non-string byte vector class. If
folks
are insistent on keeping them the same class, you can’t logically
continue
to call it a String and have it fulfill the dual purposes of byte vector
and
character vector at the same time. If you plan to provide methods for
supporting both behaviors, you’re putting two distinct behaviors into
the
same type.

I understand the unwillingness to move away from String as a byte
vector,
but with multibyte support coming you really can’t have String ==
ByteArray
without causing problems somewhere. They simply don’t have the same
behavior, and trying to pretend they do is asking for trouble.

On 25.6.2006, at 21:12, Austin Z. wrote:

String and ByteArray.
Well, if it is a byte array, it is not a String (an array of
characters), is it?

If Ruby would have RegEx operations on byte arrays, there would be no
need for untyped quasi String. API that has two incompatible things
as one class is just plain ugly and wrong.

Reading jpeg image in a String is totally wrong. You need bytes. You
get characters, but they aren’t really characters, they are bytes.
Until something happens (maybe) and they are characters (maybe), or
they are not (maybe). img_var[5] is what? 6th byte? 9th 2 bytes if
encoding is utf8? What exactly? This is a clear API? There is no need
for bytes masquerading as Strings. None. This practice just confuses
the writer and the reader of the code. You need either bytes or
Strings. Never both in the same variable. They are semantically
totally different. At least they should be (we would not have
problems if people would honour this distinction).

Please don’t try to assume that the problem is this completley
unnecessary division. The problem is that existing strings are
completely unencoded and have no way of being flagged with an encoding
that is supported in any way across all of Ruby.

The problem is exactly this: the separation between bytes and
characters. This is the general problem we have and discuss right
now. API should help us solve the problem.

And you apparently missed all the attempts to extend String (also
with encodings a la 1.9) that failed because of existing software,
not because of Ruby.

Ruby does not need a String with an internal representation in
Unicode;

Nobody says at this point of conversation that we need internal
representation in unicode for all strings. We just want to avoid
thinking about ANY encoding. We have other things to do. So having a
transparent conversions between compatible encodings is a must.

Ruby does not need a separate byte vector. An unencoded string can be
treated as a byte vector with no problems.
; if it is determined to have
textual meaning, it can be tagged with an encoding very simply

It can be, but it is not and will not be. Do you read emails? The
problem is that people do not do things like that. And then other
people have problems. If all the code you run is yours, then you are
right. For many people that is not true.

. There are times when the
encoding is not best treated in Unicode, especially if there are
potential conversion errors.

Why do you keep on about this?

Once again - WE DO NOT CARE WHAT ENCODING IS THERE. We just want the
string operations to work without any extra programming work when
operands have compatible encodings.

As written very well by Lugovoi N.:

b) no need to care what encodings have strings returned from third-
party code;
c) using explicit stated conversion options for external IO.
3) on Unicode and i18n : at least to have a set of classes for
Unicode-specific tasks (collation, normalization, string search,
locale-aware formatting etc.) that would efficiently work with Ruby
strings.

Me too, please.

izidor

On 6/25/06, Izidor J. [email protected] wrote:

If Ruby wants to move forward, it needs transparent String support and
hopefully separation of String and ByteArray, since this un-
separation brought us code which is mostly wrong (currently most of
existing Ruby code breaks if string encoding is honoured, as can be
seen from experience of brave people who modified String class).

This is an incorrect and unsupportable statement. It is completely
unnecessary to separate unencoded (e.g., binary) String support into
String and ByteArray.

Please don’t try to assume that the problem is this completley
unnecessary division. The problem is that existing strings are
completely unencoded and have no way of being flagged with an encoding
that is supported in any way across all of Ruby.

People are making really stupid assumptions based on what choices
other development teams have made, and it’s irritating.

Ruby does not need a String with an internal representation in Unicode;
Ruby does not need a separate byte vector. An unencoded string can be
treated as a byte vector with no problems; if it is determined to have
textual meaning, it can be tagged with an encoding very simply and from
that point be treated as a meaningful string. There are times when the
encoding is not best treated in Unicode, especially if there are
potential conversion errors.

-austin

On 6/25/06, Izidor J. [email protected] wrote:

This is an incorrect and unsupportable statement. It is completely
unnecessary to separate unencoded (e.g., binary) String support into
String and ByteArray.

Well, if it is a byte array, it is not a String (an array of
characters), is it?

If Ruby would have RegEx operations on byte arrays, there would be no
need for untyped quasi String. API that has two incompatible things
as one class is just plain ugly and wrong.

Here you contradict yourself. Regexes are string (character)
operations, and you want them on byte arrays. So the concepts aren’t
really separate. Similarily, when you read part of a file, and use it
to determine what kind of file it was you do not want to convert that
part into another class or re-read it because somebody decided String
and ByteVector are separate.

Plus this has been already mentioned here.

Michal

Here you contradict yourself. Regexes are string (character)
operations, and you want them on byte arrays. So the concepts aren’t
Similarily, when you read part of a file, and use it
to determine what kind of file it was you do not want to convert that
part into another class or re-read it because somebody decided String
and ByteVector are separate.

Why not? When I read CGI params I get them as strings, but if I want
to add them together I need to convert them to integers, because
someone decided that “1” != 1. This is a good thing, so you don’t get
“5 purple elephants”+“3 monkeys” = 7, like you do in PHP. Likewise,
when you read from a file/socket/whatever you might not be getting a
real string, you might be getting a byte array. They are fundamentally
different things, a byte array may happen to contain text at some
point, but some time later it may be just a stream of data. Conversely
a String always contains human-readble text in whatever encoding you
want.

As someone who has to work with Unicode in PHP, I’d say it’s important
to separate the types. If you want to display something to a user you
have to know what it is, but when you’re reading a file you don’t
care, unless you know what’s in it.

A Unicode String could be a subclass of the byte array with some
niceties for dealing with multibyte characters. Just a thought.

Hi,

In message “Re: Unicode roadmap?”
on Mon, 26 Jun 2006 05:38:46 +0900, “Charles O Nutter”
[email protected] writes:

|When is a ByteArray not a ByteArray? When is a String not a String? Is it
|correct to mingle the two concepts perpetually, when they each have fairly
|specific definitions? My problem with continuing to treat String as a byte
|vector is that it forces two somewhat incompatible concepts on the same
|class and the same methods.

A string is a sequence of data that can be represented by small
integers. Some may want to treat them as CharacterStrings, other may
want to treat them ByteStrings. They are not different as you say.
On many platforms, a file can contain text data or binary data. Is a
chunk of data read from a open file a text, or binary? If you
separate ByteArray and (Character) String, you will need to have two
separate IO classes, BinaryIO and TextIO, etc. Or you will need
explicit conversion from read ByteArray to CharacterString. That
makes Ruby programs look a lot like Java programs, which I don’t want
them to be.

One of the good property of Ruby class library is a small number of
classes. A class might have multiple roles. For example, a Ruby
Array can be treated as Stacks, Queues, etc. And it is a good thing,
rather than having separate classes for each role. Why can’t Strings
be both sequence of text and binary data?

						matz.

On Jun 25, 2006, at 1:45 PM, Izidor J. wrote:

Well, if it is a byte array, it is not a String (an array of
characters), is it?

+1 to this and to Nutter previously. Text strings and byte arrays
are different kinds of things and both are useful and I don’t see any
benefit from trying to pretend they’re the same thing. But some
apparently-smart people seem to think there is a benefit; perhaps
they could explain it in simple terms for those of us insufficiently-
clued to see it? -Tim

On 6/25/06, Izidor J. [email protected] wrote:

Well, if it is a byte array, it is not a String (an array of
characters), is it?

It could be indistinguishable from such. Even a Unicode string is
ultimately an array of bytes in memory. It just happens that there’s a
higher level abstraction that can be used to interpret that particular
array of bytes. What you’re asking for is rather like the difference
between std::string and std::vector. They represent the
same thing, but don’t work the same. If you’re going to have a String
and ByteVector that work the same (except that the String also has the
higher-level interpretation of characters), is it meaningfully a
different object?

I think not. Indeed, I think that having a separate object for these
would increase the overall complexity and reduce the usability overall.

Please don’t try to assume that the problem is this completley
unnecessary division. The problem is that existing strings are
completely unencoded and have no way of being flagged with an
encoding that is supported in any way across all of Ruby.
The problem is exactly this: the separation between bytes and
characters. This is the general problem we have and discuss right now.
API should help us solve the problem.

And you apparently missed all the attempts to extend String (also with
encodings a la 1.9) that failed because of existing software, not
because of Ruby.

Excuse me? You don’t know what you’re talking about here. No existing
version of Ruby has a String with encodings. Not even Ruby 1.9. Any
extension which tries to do this will fail because there is no way to
enforce this extension’s semantics on all of Ruby and all extensions.
Ruby 1.9 will be different because the m17n String will be a guaranteed
behaviour in Ruby.

The problem is not the separation between bytes and characters, but
that there’s no way in Ruby to distinguish between the two, at least
not reliably.

Ruby does not need a String with an internal representation in
Unicode;
Nobody says at this point of conversation that we need internal
representation in unicode for all strings. We just want to avoid
thinking about ANY encoding. We have other things to do. So having a
transparent conversions between compatible encodings is a must.

I think that you’re confusing me with someone else. Most people who have
advocated a separate ByteVector have been unable to articulate exactly
what this would buy us, and most have also advocated an internal Unicode
representation of Strings. I have been one of the ones who have
advocated transparent conversions all along. Frankly, with coersion, it
would be possible to upconvert to a compatible conversion between any
encoding.

Ruby does not need a separate byte vector. An unencoded string can be
treated as a byte vector with no problems. ; if it is determined to
have textual meaning, it can be tagged with an encoding very simply
It can be, but it is not and will not be. Do you read emails? The
problem is that people do not do things like that. And then other
people have problems. If all the code you run is yours, then you are
right. For many people that is not true.

“Is not” is a useless term. OF COURSE IT ISN’T – right now. In the
future, with the m17n Strings, it could be – and would be. And yes, I
have read every single one of these emails about Unicode. Most of them
have been ignorant of anything but their own narrow needs and clueless
about good API design.

There are times when the encoding is not best treated in Unicode,
especially if there are potential conversion errors.
Why do you keep on about this?

Once again - WE DO NOT CARE WHAT ENCODING IS THERE. We just want the
string operations to work without any extra programming work when
operands have compatible encodings.

I suggest you look through the Unicode threads again. You’ll find
your statement is untrue. There are a lot of people who (foolishly) want
Unicode to be the only internal representation of Strings in Ruby.

As written very well by Lugovoi N.:

What I, as Ruby user, wish for Unicode/M17N support:

  1. reliability and consistency:
    a) String should be abstraction for character sequence,
    b) String methods shouldn’t allow me to garble internal
    representation;
    c) treating String as byte sequence is handy, but must be explict
    stated.

An unencoded – raw – String would be only interpretable as a byte
sequence unless “recoded.” Aside from that, everything said above would
be true.

  1. coding comfort:
    a) no need to care what encodings have strings while working with
    them;
    b) no need to care what encodings have strings returned from third-
    party code;
    c) using explicit stated conversion options for external IO.

You’ll always need to care, even if you’re using Unicode. You can’t
not care and claim to be doing Unicode or m17n work. We can reduce
those concerns, but you CANNOT be ignorant of this at any time.

  1. on Unicode and i18n : at least to have a set of classes for
    Unicode-specific tasks (collation, normalization, string search,
    locale-aware formatting etc.) that would efficiently work with Ruby
    strings.
    Me too, please.

That would be useful.

-austin

Sorry, but “reading” CGI params is a red herring. You may get it as one
thing and then convert it to something else.

Exactly.

Likewise, when you read from a file/socket/whatever you might not be
getting a real string, you might be getting a byte array. They are
fundamentally different things, a byte array may happen to contain
text at some point, but some time later it may be just a stream of
data. Conversely a String always contains human-readble text in
whatever encoding you want.

Okay. What class should I get here?

data = File.open(“file.txt”, “rb”) { |f| f.read }

A byte vector. Unknown input, so you just get a stream of bytes.

Under the people who want separate ByteVector and String class, I’ll
need two APIs:

st = File.open(“file.txt”, “rb”) { |f| f.read_string }
bv = File.open(“file.txt”, “rb”) { |f| f.read_bytes }

Why? This looks needlessly complex.

string = File.open(‘file.txt’, ‘r’) {f.read.to_s(:utf-8)}

Or possibly
string = File.open(‘file.txt’, ‘r’) {f.read(:utf8)}
bytes = File.open(‘file.txt’, ‘r’) {f.read(:bytearray)}

with no argument assuming it’s a default encoding. But with this
approach the same class could be used for both, which takes us full
circle :wink:

As someone who has to work with Unicode in PHP, I’d say it’s important
to separate the types. If you want to display something to a user you
have to know what it is, but when you’re reading a file you don’t
care, unless you know what’s in it.

The problem here is not unification. The problem here is that PHP is
stupid. It is generally recognised that Ruby’s API decisions are much
smarter than most other languages, and this is a good example of where
this would happen.

Hence why I’m using Ruby, but I’m paid for PHP. Ruby is by far the
nicer language.

The best approach to my untrained eye would be for some sort of global
setting for all libraries to operate on, and the developer has to
ensure that all data are read in that encoding. Hopefully it will make
dealing with legacy data will be easier. The ideal situation would be
for everything to be in one encoding, but that just doesn’t happen.