Unicode roadmap?

On 6/25/06, Phillip H. [email protected] wrote:

Here you contradict yourself. Regexes are string (character)
operations, and you want them on byte arrays. So the concepts aren’t
Similarily, when you read part of a file, and use it to determine
what kind of file it was you do not want to convert that part into
another class or re-read it because somebody decided String and
ByteVector are separate.
Why not? When I read CGI params I get them as strings, but if I want
to add them together I need to convert them to integers, because
someone decided that “1” != 1. This is a good thing, so you don’t get
“5 purple elephants”+“3 monkeys” = 7, like you do in PHP.

Sorry, but “reading” CGI params is a red herring. You may get it as one
thing and then convert it to something else.

Likewise, when you read from a file/socket/whatever you might not be
getting a real string, you might be getting a byte array. They are
fundamentally different things, a byte array may happen to contain
text at some point, but some time later it may be just a stream of
data. Conversely a String always contains human-readble text in
whatever encoding you want.

Okay. What class should I get here?

data = File.open(“file.txt”, “rb”) { |f| f.read }

Under the people who want separate ByteVector and String class, I’ll
need two APIs:

st = File.open(“file.txt”, “rb”) { |f| f.read_string }
bv = File.open(“file.txt”, “rb”) { |f| f.read_bytes }

Stupid, stupid, stupid, stupid. If I have guessed wrong about the
contents of file.txt, I have to rewind and read it again. Better to
always read as bytes and then say, “this is actually UTF-8”. This
would be as stupid in C++, Java, or C#:

class File
{
bool read(string& st);
bool read(byte_vector& bv);
}

Yes, I can’t actually read into the item, but have to call an accessor.
Moronic design, mostly because I can’t do:

class File
{
string read(void);
byte_vector read(void);
}

That would help in static languages, but they can’t do that – and Ruby
can’t do it either, since variables are just labels.

As someone who has to work with Unicode in PHP, I’d say it’s important
to separate the types. If you want to display something to a user you
have to know what it is, but when you’re reading a file you don’t
care, unless you know what’s in it.

The problem here is not unification. The problem here is that PHP is
stupid. It is generally recognised that Ruby’s API decisions are much
smarter than most other languages, and this is a good example of where
this would happen.

A Unicode String could be a subclass of the byte array with some
niceties for dealing with multibyte characters. Just a thought.

Unnecessary and overcomplex.

-austin

Hi,

In message “Re: Unicode roadmap?”
on Mon, 26 Jun 2006 10:22:15 +0900, “Phillip H.”
[email protected] writes:

|> st = File.open(“file.txt”, “rb”) { |f| f.read_string }
|> bv = File.open(“file.txt”, “rb”) { |f| f.read_bytes }
|
|Why? This looks needlessly complex.
|
|string = File.open(‘file.txt’, ‘r’) {f.read.to_s(:utf-8)}
|
|Or possibly
|string = File.open(‘file.txt’, ‘r’) {f.read(:utf8)}
|bytes = File.open(‘file.txt’, ‘r’) {f.read(:bytearray)}

They are equally more complex than the current design. If File can
return String or ByteArray, why shouldn’t String with “no encoding”
behave as sequence of bytes instead of separating? Are there any
specific operations that should be in ByteArray but not in String, or
vise versa?

						matz.

One clarification I’d like to add to this: I’m not saying that a
ByteArray
needs to be added, but if you’re going to treat String as a ByteArray,
then
perhaps there should be another type for character vectors?

Perhaps through some logic (perhaps the fact that this is the “way it
is” in
Ruby 1.8) String does == ByteArray. If I could play devil’s advocate for
a
moment, maybe the new, fancy m17n String, however it’s implemented,
should
be a different class?

String == ByteArray in form and function
CharString == a string of characters with some particular encoding,
character logic, and so on

Perhaps even CharString < String, so it retains byte-level read/write
operations.

There’s another obvious advantage here…APIs that currently return a
byte
array String will continue to do so, as they work in Ruby 1.8.
CharString
could also be implemented today for Ruby 1.8, providing an encoding and
character-aware String implementation for applications that need it.

My only point about the dichotomy between and is that at some
level, they imply different behaviors, different APIs, different
interfaces.
Perhaps the answer is not to change existing Ruby code to use a m17n
String
while trying to retain byte array capabilities in the same time…but
maybe
it’s worth considering that the new behavior warrants a separate type?

String.to_cs(:utf8) => CharString
String retains current interface and semantics
CharString gains the [n] => character or single-char string rather than
int,
etc.

I know you (matz) want to break as much as possible with the 2.0
release,
but I still don’t see the advantage of marrying the “byte array string”
and
“char string” types in the same class when separate types and behaviors
would be more logical and break far less.

Hi,

In message “Re: Unicode roadmap?”
on Mon, 26 Jun 2006 11:37:45 +0900, “Charles O Nutter”
[email protected] writes:
|I know you (matz) want to break as much as possible with the 2.0 release,
|but I still don’t see the advantage of marrying the “byte array string” and
|“char string” types in the same class when separate types and behaviors
|would be more logical and break far less.

I still don’t see how separate types and behaviors would be more
logical and break far less. For example, if I want to check EXIF
conformance of a jpeg file, I do

def self.exif_file? (filename)
exif_header = “\xff\xd8\xff\xe1”
magic = File.open(filename) {|f| f.read(4) }
magic == exif_header
end

I am not sure what you expect about separation, but I doubt separation
would make above code to “be more logical and break far less”.

						matz.

On Jun 25, 2006, at 6:11 PM, Austin Z. wrote:

Under the people who want separate ByteVector and String class, I’ll
need two APIs:

st = File.open(“file.txt”, “rb”) { |f| f.read_string }
bv = File.open(“file.txt”, “rb”) { |f| f.read_bytes }

Maybe I’m missing something, but in today’s networked heterogeneous
environment, that first call looks deeply dangerous to me. I don’t
see how you can expect to get a String out of a file in the general
case. Files contain bytes, strings contain characters, and
pretending you can get from one to the other without explicit
encoding specification or inference is unsound.

Pardon me if I’m missing something obvious. -Tim

On 6/25/06, Yukihiro M. [email protected] wrote:

| bytes = File.open(‘file.txt’, ‘r’) {f.read(:bytearray)}
They are equally more complex than the current design. If File can
return String or ByteArray, why shouldn’t String with “no encoding”
behave as sequence of bytes instead of separating? Are there any
specific operations that should be in ByteArray but not in String, or
vise versa?

There are operations for Strings (#each_character, perhaps) that make
less sense for ByteVectors than for character-based Strings. But
everything or nearly everything you would want to do with a ByteVector
you would want to do with a String, and some operations from Strings
make sense on ByteVectors (regexp operations).

I would much rather keep the API – and the class library – simple. I
would rather do this:

st = File.open(“file.txt”, “rb”, :encoding => :utf8) { |f| f.read }

or

bv = File.open(“file.txt”, “rb”) { |f| f.read }
st = bv.to_encoding(:utf8)

-austin

On Jun 25, 2006, at 7:21 PM, Yukihiro M. wrote:

Are there any
specific operations that should be in ByteArray but not in String, or
vise versa?

Well, on strings, indexing and substring operations and iterators and
regular expressions should (at least optionally) have character
rather than byte semantics, right? Another example is encoding-
normalization (combining diacritics, etc) which doesn’t apply to byte
arrays. -Tim

On 26.6.2006, at 5:01, Yukihiro M. wrote:

I am not sure what you expect about separation, but I doubt separation
would make above code to “be more logical and break far less”.

Above code assumes all file operations return byte arrays. What is
the code when we want to obtain String of characters?

What if there is some $KCODE (or equivalent) setting somewhere in the
program before these lines? What would be the effect of that?

The problem is the auto-magic encoding handling which is required to
have text processing be as simple as it is now. You can have either
text processing (which adds encoding handling for us, combines bytes
in characters etc.) or byte processing (which does not). How do we
distinguish between the two modes of operation?

The obvious way is by adding a ByteArray. But maybe there is better
way…

izidor

Yukihiro M. wrote:

I am not sure what you expect about separation, but I doubt separation
would make above code to “be more logical and break far less”.

Just jumping into the discussion here, I have to agree with Matz. A
char-vector
is simply a higher-level representation of a byte-vector, not different
enough
to warrant two entirely separate classes.

I think the real issue is not technical but rather a problem of
perception and
education. Ever since C-style strings, programmers have learned to view
a string
as an array of chars. So when we need to do char-string manipulation, we
resort
to pointer arithmetic when it fact the “correct” and ruby-native way of
manipulating strings is with regular expressions. Instead of giving in
to this
old string-as-array mentality, maybe we should teach people to use
regular
expressions? Hmmm, probably impossible.

A string can be interpreted as both a sequence of bytes or a sequence of
characters, but the methods can be confusing. Obviously, upcase and
downcase are
operations at the character level, but what is [] supposed to do? From
the ruby
point of view, str[0…3] gives you the first 4 bytes and
str.scan(/^…/) gives
you the first 4 characters. But for the majority with the
string-as-array
mentality, [] is ambiguous; does it give you access to the bytes or to
the
characters of the string? In the interest of facilitating education,
there needs
to be a clear disambiguation; instead of str[0…3] it should be
str.byte(0…3)
and str.char(0…3) – with maybe the latter one giving a warning along
the lines
of “use regular expressions!” :wink: That way the ambiguity between
byte-vector
and char-vector could be resolved.

Daniel

On 6/25/06, Charles O Nutter [email protected] wrote:

One clarification I’d like to add to this: I’m not saying that a ByteArray
needs to be added, but if you’re going to treat String as a ByteArray, then
perhaps there should be another type for character vectors?

There’s no meaningful distinction between the division of
ByteArray/String and String/CharString. I do not believe that this
is a viable option. The sole argument in favour is that we could add
a CharString to Ruby 1.8 – but I believe that this would be
stampeding us in the wrong direction.

Even if CharString < String, there will be problems – people already
note that there are issues with subclasses of the built-in classes.

My only point about the dichotomy between and is that at some
level, they imply different behaviors, different APIs, different interfaces.
Perhaps the answer is not to change existing Ruby code to use a m17n String
while trying to retain byte array capabilities in the same time…but maybe
it’s worth considering that the new behavior warrants a separate type?

This is where I disagree with you completely. If I have a String that
contains ISO-8859-15 data, it happens that s#byte_count and s#length
are the same value. It differs with UTF-8 data, but the interpretation
of a Character is, at best, a trait of the data being stored. I have
really given this a lot of thought, and I really do think that Matz
is right about this and that the people who want Unicode-native
strings are wrong. This sort of sucks for JRuby because of problems
with Java. But I do not think that Sun made the right decision with
Java. If nothing else, they ended up backing a dead “standard” during
the initial phases, and have had to hack out since then.

I know you (matz) want to break as much as possible with the 2.0 release,
but I still don’t see the advantage of marrying the “byte array string” and
“char string” types in the same class when separate types and behaviors
would be more logical and break far less.

It isn’t more logical. It doubles the number of required APIs for
IO. It completely complicates things from that perspective, with
little value for the people who have to implement character-oriented
data routines.

-austin

On Jun 26, 2006, at 1:24 AM, Daniel DeLorme wrote:

I think the real issue is not technical but rather a problem of
perception and education. Ever since C-style strings, programmers
have learned to view a string as an array of chars. So when we need
to do char-string manipulation, we resort to pointer arithmetic
when it fact the “correct” and ruby-native way of manipulating
strings is with regular expressions. Instead of giving in to this
old string-as-array mentality, maybe we should teach people to use
regular expressions? Hmmm, probably impossible.

Regular expressions are a very powerful tool, but they do not
describe the entire set of operations one would reasonably want to
perform on a string. Or perhaps they do but in a needlessly complex
way. I want to get the first letter (character?) of a sentence, in
pure regexp terms I’d do this: str.match(/\A./)[0] It’s needlessly
cryptic. Note that I’m not trying to make a commentary on whether or
not character string/byte string should be separate, just trying to
point out that “use regular expressions” shouldn’t always be the answer.

Hello Tim,

TB> Well, on strings, indexing and substring operations and iterators
and
TB> regular expressions should (at least optionally) have character
TB> rather than byte semantics, right?

For UTF-8 which hopefully will rule the world soon, the worst libraries
i have seen are trying to do this. But it is not the intention of the
designers and with an implementation that works on characters
you loose the genius encoding style of UTF-8.

Of course some operation are more difficult, but this is left for good
reasons to the application programmer. Only few cases of string
manipulation need some special (non ASCII) character handling.

Hi,

In message “Re: Unicode roadmap?”
on Mon, 26 Jun 2006 13:51:33 +0900, Izidor J.
[email protected] writes:

|>
|> I am not sure what you expect about separation, but I doubt separation
|> would make above code to “be more logical and break far less”.
|
|Above code assumes all file operations return byte arrays. What is
|the code when we want to obtain String of characters?

line = File.open(filename, “r”, “utf8”) {|f| f.gets }

|What if there is some $KCODE (or equivalent) setting somewhere in the
|program before these lines? What would be the effect of that?

I think IO#read shall always return “binary” string, since its
specified length should always be in bytes. Anyway, when in doubt,
you can explicitly specify “binary” encoding,

|The problem is the auto-magic encoding handling which is required to
|have text processing be as simple as it is now. You can have either
|text processing (which adds encoding handling for us, combines bytes
|in characters etc.) or byte processing (which does not). How do we
|distinguish between the two modes of operation?

By explicitly setting their encoding to “binary”, e.g.

text = obtain_string_data()
text.encoding = “binary”

|The obvious way is by adding a ByteArray. But maybe there is better
|way…

Show me the pseudo code using ByteArray, I will show you its
counterpart using String with encoding tag.

						matz.

On 6/26/06, Tim B. [email protected] wrote:

pretending you can get from one to the other without explicit
encoding specification or inference is unsound.

Um. You’re not missing anything – I’m mocking the API pair that would
be required to make this work as certain advocates have suggested.

Pardon me if I’m missing something obvious. -Tim

You’re not. IO should be done on byte buffers. There’s no meaningful
and useful distinction between a byte buffer and a string at the most
basic level. There’s an additional interpretation that’s possible at a
higher level (giving character-oriented operations), but that in and
of itself does not imply a need for a separation of the two concepts.
(Indeed, I find myself infuriated in C++ when I have to do something
that would work well with std::vector and I’m actually
working with std::string – or vice versa.)

-austin

On 6/26/06, Tim B. [email protected] wrote:

On Jun 25, 2006, at 7:21 PM, Yukihiro M. wrote:

Are there any
specific operations that should be in ByteArray but not in String, or
vise versa?
Well, on strings, indexing and substring operations and iterators and
regular expressions should (at least optionally) have character
rather than byte semantics, right? Another example is encoding-
normalization (combining diacritics, etc) which doesn’t apply to byte
arrays. -Tim

Those are interpretations of the data underlying the String, though.
Nothing says we can’t use these sort of operations still, especially
with Ruby’s dynamic objects. But I firmly believe that it can be
done in a way so as to not require the separation of a String from a
Byte Array.

-austin

Logan C. wrote:

Regular expressions are a very powerful tool, but they do not describe
the entire set of operations one would reasonably want to perform on a
string. Or perhaps they do but in a needlessly complex way. I want to
get the first letter (character?) of a sentence, in pure regexp terms
I’d do this: str.match(/\A./)[0] It’s needlessly cryptic. Note that I’m
not trying to make a commentary on whether or not character string/byte
string should be separate, just trying to point out that “use regular
expressions” shouldn’t always be the answer.

irb(main):001:0> “It’s needlessly cryptic.”[/./]
=> “I”

Not disagreeing, just trying to get more credit for regexes.

irb(main):010:0> “It’s needlessly cryptic.”[/.{17}(.)/, 1]
=> “r”

That’s a bit more cryptic.

On 6/26/06, Izidor J. [email protected] wrote:

I am not sure what you expect about separation, but I doubt separation
would make above code to “be more logical and break far less”.
Above code assumes all file operations return byte arrays. What is
the code when we want to obtain String of characters?

As Tim B. pointed out in a response to me, trying to get a String
from a file is a ludicrous operation. I was mocking the API required
(e.g., File#read_string or something equally bozonic). You need to
read your data and then mark it as a String with a particular
encoding. And if you globally change the interpretation of File#read
to be String, you will be breaking the ability to read truly binary
data.

The problem is the auto-magic encoding handling which is required to
have text processing be as simple as it is now. You can have either
text processing (which adds encoding handling for us, combines bytes
in characters etc.) or byte processing (which does not). How do we
distinguish between the two modes of operation?

The obvious way is by adding a ByteArray. But maybe there is better
way…

Yes. It’s to actually read what has been suggested. The m17n String
won’t be a magic bullet. But you’ll be able to do something like:

bv = File.open(“file.txt”, “rb”) { |f| f.read }
sv = bv.with_encoding(:utf8)

Or something like that. And you can still do bv == “\xff\xd8\xff\xe1”
as appropriate.

-austin

Logan C. wrote:

Regular expressions are a very powerful tool, but they do not describe
the entire set of operations one would reasonably want to perform on a
string. Or perhaps they do but in a needlessly complex way. I want to
get the first letter (character?) of a sentence, in pure regexp terms
I’d do this: str.match(/\A./)[0] It’s needlessly cryptic. Note that I’m
not trying to make a commentary on whether or not character string/byte
string should be separate, just trying to point out that “use regular
expressions” shouldn’t always be the answer.

It’s funny, maybe I’m just dumb but I can’t think of a single
real-world
example where you’d want to access particular characters of a string.
Why do you
want the first char? In the context of a byte string there might be
something
special at position n (e.g. exif header), but in the context of a
human-readable
string what is there? For example, if you want that first char in order
to check
if it’s a space or not, you should use str =~ /^ /, etc, etc. I honestly
can’t
think of any real-world examples where regular expressions are less
appropriate
than pointer arithmetic. Can you illuminate me with some?

Daniel

On 26.6.2006, at 8:08, Yukihiro M. wrote:

|What if there is some $KCODE (or equivalent) setting somewhere in the
|program before these lines? What would be the effect of that?

I think IO#read shall always return “binary” string, since its
specified length should always be in bytes. Anyway, when in doubt,
you can explicitly specify “binary” encoding,

Oh, I see. So basically IO always returns ByteArray, and one needs to
convert it to String of characters explicitly (or implicitly by
specifying a parameter to IO).

No magic tagging with encoding. Well, this is nice and easy to
understand.

But how will this influence the simplicity of small programs in Ruby
which deal with data in known (single) encoding? I was under
impression that there would be some magic global setting which will
enable such programs to use Strings in correct encoding.

Thank you for clarifications. They are most welcome…

izidor

Hi,

In message “Re: Unicode roadmap?”
on Mon, 26 Jun 2006 15:58:30 +0900, Izidor J.
[email protected] writes:

|But how will this influence the simplicity of small programs in Ruby
|which deal with data in known (single) encoding? I was under
|impression that there would be some magic global setting which will
|enable such programs to use Strings in correct encoding.

The detail is not fixed yet but it would honor locales for the default
encoding.

						matz.