Unicode roadmap?

rhaus · June 28, 2006, 4:03am

Daniel M. wrote:

I’ll point you at my solution to ruby quiz #83: (short but unique)

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/197973

How would you write the method string_similarity without access to
each character? (This method computes the length of the longest
common substring)

How would you compute the Levenshtein distance (edit distance) between
two strings without access to each character?

I’ll grant that I don’t have enough imagination and that there are
cases where
you want character access. But it seems to me that the main use case is
for
something like this:
str = “cogito ergo sum”
i = str.index(“”) + 3
j = str.index(“”,i)
str[i…j]
=> “ergo”
and for that common case, regexes are far more appropriate:
str.match(/(.*?)</b>/)[1]
=> “ergo”

Advocating regexes-only for character manipulation is certainly extreme.
I’m
just saying that byte access and character access needs to have
different
semantics. If you look at the current ruby String API, bytes are
accessed
through integer positions and characters are accessed through regexes.
The byte
are char APIs are quite distinct, it’s just that everybody is using the
byte API
and expecting to get characters as a result.

From what I understand (and please correct me if I’m wrong), ruby2 will
fix
that by changing the api so that integer positions represent characters
instead
of bytes. For binary strings, those two concepts map to the same reality
so it
won’t be such a backward-incompatible change. I just wonder what will be
the
behavior of str[0]. Will it return a 0…255 integer in the case of
binary string
and a 1-character string in the case of encoding-set string? Now that
would be
an API nightmare.

How would you pull strings out of a file with fixed-width fields?
With regular expressions? Really? What if you had a hundred fields?

Hmm, fixed width records and fields were created for the purpose of fast
access
to data, i.e. seek to position recnum*reclength and extract reclength
bytes;
they only make sense in the case of single-byte characters. So this is
more a
case of byte access.

Daniel

rhaus · June 28, 2006, 6:50pm

On 6/28/06, Izidor J. [email protected] wrote:

On 27.6.2006, at 19:19, Austin Z. wrote:

As I said, most of my opposition is based on
(1) stupid statically typed languages and (2) an inability to tell
Ruby what type you want back from a method call (this is a good thing,
because it in part prevents #1 ;).
First, “most of my opposition” is not useful in discussion and is a
straw-man, because we are not counting people here, we try to
evaluate reasons for and against. One person with good reason should
overcome 1000 not-so-good posts. This is not about winning the
argument, it’s about having the best solution.

You have misread my English. I am not referring to people who oppose
my position; I am referring to my opposition to a separate ByteArray
class. However, I have yet to see even a mediocre reason for a
separate ByteArray.

About (2), inability to tell in advance in your program whether you
get bytes or characters from a method in core (or any other) API is
NOT a good thing. This causes innumerable problems and unexpected
behaviour if programmer expects one and code sometimes gets the
other. The API should prevent such errors, either by very simple and
strict rules that enable easy prediction, or by introducing
ByteArray, which makes prediction trivial. This is not about duck-
typing, it’s about randomly having semantically different results.

You’ll never get that without type hinting. And type hinting for
return types would be as bad as anything else for Ruby. Consider this
copy function:

def copy_file(inf, outf)
open(inf, “rb”) { |fin|
File.open(outf, “wb”) { |fout|
fout.write fin.read
}
}
end

Why didn’t I use File.open? Because I can now do this:

require ‘open-uri’
copy_file(“http://www.ruby-lang.org/en”, “ruby-lang-en.html”)

I didn’t get a “File” object from Kernel#open; I got (in this case) a
Tempfile.

Since the rules are not fixed yet, nobody can say whether one or the
other solution is better. But if the API is not very clear or
requires lots of manual specifying in code, we will be in a mess,
similar to today.

Quite simply, you’re either wrong or you don’t understand the
parameters of the problem. I’d rather assume the latter.

However, if you want to ensure a particular class is returned from a
Ruby method, you must have a method which guarantees that it will only
return that class (or nil, perhaps). Therefore, with a separate
ByteArray class, we would of necessity see parallel File operations
or a separate IO class hierarchy or (worst of all!) constructors which
tell the File to return String or ByteArray depending on how it was
constructed.

There is no possible good argument for separating ByteArray from
String in Ruby. Not with what it would do to the rest of the API, and
I don’t think that anyone who wants a ByteArray is thinking beyond
String issues.

-austin

rhaus · June 28, 2006, 8:42pm

On 28.6.2006, at 18:48, Austin Z. wrote:

About (2), inability to tell in advance in your program whether you
get bytes or characters from a method in core (or any other) API is
NOT a good thing. This causes innumerable problems and unexpected
behaviour if programmer expects one and code sometimes gets the
other. The API should prevent such errors, either by very simple and
strict rules that enable easy prediction, or by introducing
ByteArray, which makes prediction trivial. This is not about duck-
typing, it’s about randomly having semantically different results.

You’ll never get that without type hinting.

I think you do not understand what the problem is, because your claim
is so obviously false.

How can I get that with very simple rule: all IO#read (and similar)
calls always return binary Strings.

No type hinting in sight, but I always know whether my code receives
Strings or binary Strings. But this simple option is clearly not
possible, because it complicates the text processing in simple
scripts. We’ll see how complicated the final rules will be.

Alternative (actually equivalent) to the above is: all IO#readbytes
calls return ByteArray objects, and we need separate call
IO#readstring which always return Strings with encoding.

izidor

rhaus · June 28, 2006, 8:45pm

On 6/28/06, Juergen S. [email protected] wrote:

Any additional complexity here should be offset later, when doing
operations on the read data as appropriate for its type.

It won’t be. All of the complexity of the m17n String will be inside of
the String, not exposed (by default) to the user. Stop thinking of the
encoding of a String as something that makes the String a unique object;
instead it is a lens that gives meaning to the bytes of the String.

Of course, the first line should raise an exception if file.txt is not
utf8 encoded,

The internal format of String is not going to be Unicode by default.
Matz has already said that. I happen to agree with him.

this saves extra complexity down the line, and is a real difference
between the two. I imagine Bytevector would be implemented with
maximum performance and space efficiency in mind, while String is a
higher level class streamlined for easy of use.

These two items are not mutually exclusive. Think a little more about
humane design and you’ll see that two wholly separate classes require a
lot more than what you’re assuming and would end up in programmers
making even dumber assumptions than they do today, because they’d think
they’re “protected” during IO because they’re getting a String. This is
not a safe assumption. Ever.

The separate byte vector class is needlessly complex and solves exactly
nothing that isn’t already solved in a better way.

-austin

rhaus · June 28, 2006, 8:53pm

On 6/28/06, Izidor J. [email protected] wrote:

I think you do not understand what the problem is, because your claim
is so obviously false.

Oh, bollocks. Go ahead, pull the other one.

IO#readstring which always return Strings with encoding.
Nope. Not nearly equivalent and a lot dumber. I’ve just spent the last
week explaining in simple terms why it’s dumb. You want to at least
double the complexity of the IO API because you’re either unwilling or
incapable of considering anything but your ByteArray concept.

I, for one, am not willing to consider an extensively more complex API
because your imagination is lacking.

-austin

rhaus · June 28, 2006, 8:56pm

On 28.6.2006, at 18:48, Austin Z. wrote:

There is no possible good argument for separating ByteArray from
String in Ruby. Not with what it would do to the rest of the API, and
I don’t think that anyone who wants a ByteArray is thinking beyond
String issues.

Oh, really? So it is OK for this code to sometimes receive binary
String and sometimes String with encoding:
io = SomeIO.open( … )
v = io.read( 1000 )

This is the most problematic part of String handling. Because if my
code expects this ‘v’ to be binary string, v[0…15] is the first 16
bytes (maybe a message header or something). If this is encoded
string (because some setting changed outside of my code), v[0…15]
will be some random amount of data.

This is the error that happens right now and will happen in the
future also, if the rules are not clear.

izidor

rhaus · June 28, 2006, 7:48pm

On Mon, Jun 26, 2006 at 11:21:59AM +0900, Yukihiro M. wrote:

|string = File.open(‘file.txt’, ‘r’) {f.read.to_s(:utf-8)}
  					matz.

Any additional complexity here should be offset later, when doing
operations on the read data as appropriate for its type.

Of course, the first line should raise an exception if file.txt is not
utf8 encoded, this saves extra complexity down the line, and is a real
difference between the two. I imagine Bytevector would be implemented
with maximum performance and space efficiency in mind, while String is
a higher level class streamlined for easy of use.

There could be accessor for the Bytevector to convert it (or parts of
it) to a String, for cases where you really need to read mixed
string/data from somewhere.

string = bytes.to_str(:utf8)
string2 = bytes[1…5].to_str(:utf8)

Or maybe a StrStream-like interface:

bytes.stream_open(“r”) do |b|
s = b.read(:utf8)
…
end

-JÃ¼rgen

rhaus · June 28, 2006, 9:06pm

On 28.6.2006, at 20:43, Austin Z. wrote:

Think a little more about
humane design and you’ll see that two wholly separate classes
require a
lot more than what you’re assuming and would end up in programmers
making even dumber assumptions than they do today, because they’d
think
they’re “protected” during IO because they’re getting a String.
This is
not a safe assumption. Ever.

True. That’s why most solutions do not offer String IO, but only
ByteArray. But for language with large part of usage being text
processing, this brings lots of conversions into code, which as Matz
said, makes it like Java. But it is the safe way. Just the way you
like it - no automatic conversion

But most of us would not like the language which makes you type all
the conversions manually in code, even for single-line scripts. Which
we would not have any more. Scripts would be at least two lines - one
line for conversion code

izidor

rhaus · June 28, 2006, 9:19pm

On Thu, 29 Jun 2006, Austin Z. wrote:

There is no possible good argument for separating ByteArray from String in
Ruby. Not with what it would do to the rest of the API, and I don’t think
that anyone who wants a ByteArray is thinking beyond String issues.

i woulnd’t go that far. i’m wanting a byte array and thinking beyond
string
issues about every 1-2 hrs in my job. for example

f = open ‘grayscale_image.dat’

n_rows.times do
row = f.read n_cols

 # now i have to this this
 row = row.split(//).map{|char| char[0]}

 # because here i need to do
 avg_pixel_value = row[31,5].inject(0){|avg,n| avg += n} / 5.0

 if some_range.include? avg_pixel_value
   ...
 end

end

this may have nothing to do with unicode issues - but would love to have
‘array of bytes’ style io operations, though i’ve not thought about api
for
more that 1 second.

anyhow - we actually want byte arrays more often than strings.

regards.

-a

rhaus · June 28, 2006, 9:19pm

On 28.6.2006, at 20:51, Austin Z. wrote:

Nope. Not nearly equivalent and a lot dumber. I’ve just spent the last
week explaining in simple terms why it’s dumb.

Equivalent in their prediction power. This is the problem I discuss -
both give 100% results independent of environment, and ByteArray
version is maybe even somewhat firmer because there is even different
class of result, not only encoding.

You have not given any solution to any of the problems in code
examples I have given, related to the problem of predicting the class/
encoding of result.

I’d say the solution would prove you know what the problem is.

Except if you say that this (random String encoding in result) is not
a problem.
Then this discussion really can’t progress. And we can agree that we
disagree and stop right here.

izidor

rhaus · June 28, 2006, 9:47pm

On 28.6.2006, at 21:18, Izidor J. wrote:

I’d say the solution would prove you know what the problem is.

Except if you say that this (random String encoding in result) is
not a problem.
Then this discussion really can’t progress. And we can agree that
we disagree and stop right here.

And to clear the air - I am not advocating ByteArray unconditionally.
I have just explained one crucial problem, and the ByteArray is
simplistic solution to the problem. I would much prefer some really
creative and simple String solution. But I do not have it and have
not seen it yet.

Hopefully Matz (or anybody, really) will surprise us with elegant,
balanced solution.

izidor

rhaus · June 28, 2006, 10:19pm

On Jun 28, 2006, at 3:16 PM, [email protected] wrote:

# now i have to this this
row = row.split(//).map{|char| char[0]}

This is off on a tangent here but, ara why not just

row = row.to_enum(:each_byte).to_a

rhaus · June 29, 2006, 7:13am

On 6/28/06, Izidor J. [email protected] wrote:

io = SomeIO.open( … )
v = io.read( 1000 )

This is the most problematic part of String handling. Because if my
code expects this ‘v’ to be binary string, v[0…15] is the first 16
bytes (maybe a message header or something). If this is encoded
string (because some setting changed outside of my code), v[0…15]
will be some random amount of data.

This is the error that happens right now and will happen in the
future also, if the rules are not clear.

I would think that STD* should use locale (or equvialent) for default
encoding. So should popen. And open should use locale to determine the
encoding of file names. This migt be different from the encoding of
STD* (ie on Windows).

For file io it might be reasonable to set the default encoding from
locale as well. However, there is no reason why the files should
contain text. So to make things clear the io should be binary by
default for files, network, and anything else (except the pipes
mentioned above).

For short scripts one could change that by assigning some global that
specifies the default encoding. For anything else it is reasonable to
demand that everybody sets the encoding when calling open. Even issue
a warning about that. If you want to know what encoding you get there
is not other way.
And it is not addding complexity. today you do not specify encoding
but you also do not get anything that deals with it.

Thanks

Michal

rhaus · June 29, 2006, 8:45am

On Thu, Jun 29, 2006 at 03:43:54AM +0900, Austin Z. wrote:

On 6/28/06, Juergen S. [email protected] wrote:

Any additional complexity here should be offset later, when doing
operations on the read data as appropriate for its type.

It won’t be. All of the complexity of the m17n String will be inside of
the String, not exposed (by default) to the user. Stop thinking of the
encoding of a String as something that makes the String a unique object;
instead it is a lens that gives meaning to the bytes of the String.

Having said lens adds complexity. I’ll always have to think of the
data and the lens. You are very absolute in denying this, I wonder
why.

Of course, the first line should raise an exception if file.txt is not
utf8 encoded,

The internal format of String is not going to be Unicode by default.
Matz has already said that. I happen to agree with him.

Please stop beating this dead horse. Noone is disputing Matz’s right
to implement as he likes.

And this is not about the String per se, in that line of code clearly
something supposed to be UTF8 is read from a file, and if the file
doesn’t contain valid UTF8, I’ll expect an exception. Not getting that
exception adds complexity to my code, because I’ll have to verify it
later on manually, and it may obscure the source of the error if I
forget. Complexity added in both cases.

Prior point proven.

not a safe assumption. Ever.

The separate byte vector class is needlessly complex and solves exactly
nothing that isn’t already solved in a better way.

Without a prototype, this is speculation at best. Programmers would be
protected by exceptions from invalid String I/O operations. Human
interface design hinges on a lot more and different things than this
one special detail, I can’t imagine it will change a lot, and many
ruby programmers aren’t that dumb as you make it. This is a red herring.

OT, you should watch whom you call dumb, stupid or foolish here, even
by implication.

That said, I am waiting for M17N as Matz has decided on that, and I
suspect noone else is going to implement anything else for now. But
don’t tell me it’ll be just perfect for everyone, when discussed use
cases already show it won’t be. Matz himself said, that in order to
cater to his own special interest group, he is willing to sacrifice
some convenience for others.

-JÃ¼rgen

rhaus · June 29, 2006, 2:07am

Hi,

In message “Re: Unicode roadmap?”
on Thu, 29 Jun 2006 03:53:52 +0900, Izidor J.
[email protected] writes:

|Oh, really? So it is OK for this code to sometimes receive binary
|String and sometimes String with encoding:
|io = SomeIO.open( … )
|v = io.read( 1000 )

No, as I said before, reading with length specified shall always
return binary strings, since it counts in bytes, whereas gets,
readline etc. would return encoded strings.

						matz.

rhaus · June 29, 2006, 8:57am

Hi,

In message “Re: Unicode roadmap?”
on Thu, 29 Jun 2006 15:44:10 +0900, Juergen S.
[email protected] writes:

|That said, I am waiting for M17N as Matz has decided on that, and I
|suspect noone else is going to implement anything else for now. But
|don’t tell me it’ll be just perfect for everyone, when discussed use
|cases already show it won’t be. Matz himself said, that in order to
|cater to his own special interest group, he is willing to sacrifice
|some convenience for others.

Did I said so? I am not going to sacrifice anybody. At least I am
trying not to, even though I cannot promise.

						matz.

rhaus · June 29, 2006, 9:35am

On Thu, Jun 29, 2006 at 03:56:55PM +0900, Yukihiro M. wrote:

|some convenience for others.

Did I said so? I am not going to sacrifice anybody. At least I am
trying not to, even though I cannot promise.
  					matz.

I don’t think you can possibly cater to everyone here. Simplicissity,
Flexibility, Performance, take any two. My impression is that M17N is
going for maximum flexibility with good performance, but for
e.g. Unicode only users there’ll be some extra complexity to be aware
of. I don’t think you’ll sacrifice Unicode users totally, but it is
not your top priority either.

And I understood you expressed this yourself in the following quote.

On Tue, Jun 27, 2006 at 05:21:27PM +0900, Yukihiro M. wrote:

|ever to come along. It just seems based on a lot of anecdotal evidence that
  					matz.

-JÃ¼rgen

rhaus · June 29, 2006, 9:44am

Hi,

In message “Re: Unicode roadmap?”
on Thu, 29 Jun 2006 16:33:19 +0900, Juergen S.
[email protected] writes:

|I don’t think you can possibly cater to everyone here. Simplicissity,
|Flexibility, Performance, take any two. My impression is that M17N is
|going for maximum flexibility with good performance, but for
|e.g. Unicode only users there’ll be some extra complexity to be aware
|of. I don’t think you’ll sacrifice Unicode users totally, but it is
|not your top priority either.

I can’t promise implementation simplicity. Because it would not be
inside. But I am trying to build “pseudo simplicity”, which means
simplicity in the appearance. For example, text processing code with
file I/O in Ruby will keep being much simpler than Java.

|And I understood you expressed this yourself in the following quote.

Don’t get me wrong without context. You’ve said that “this approach
is complex, and worth it for 10% or less of Ruby users”. And I said,
“unfortunately I am one of those 10% or less. You cannot stop Ruby
being (implementation) complex”. Clear?

						matz.

rhaus · June 29, 2006, 1:13pm

On 6/29/06, Juergen S. [email protected] wrote:

I don’t think you can possibly cater to everyone here. Simplicissity,
Flexibility, Performance, take any two. My impression is that M17N is
going for maximum flexibility with good performance, but for
e.g. Unicode only users there’ll be some extra complexity to be aware
of. I don’t think you’ll sacrifice Unicode users totally, but it is
not your top priority either.

Um. You make the same error, I think, that some others have. There are
two measures of complexity to be measured. The first is implementation
complexity. The second is use complexity. I fully expect that the
implementation of the m17nString is going to be complex. (I think it
will be simpler than most naysayers are suggesting, but it will
certainly be more complex than anything that currently exists.)
However, I believe that the use complexity – that is, the external
API in both C (for extensions) and Ruby — is going to be relatively
low. Maybe a little more complex than we have today.

The actual complexity in use is going to depend on your needs. If
you’re dealing with Unicode and binary data only – as will likely be
the case – you will find it much easier to use than someone who has
to deal with multiple encodings at once.

-austin

rhaus · June 30, 2006, 1:06am

On Thu, Jun 29, 2006 at 04:42:49PM +0900, Yukihiro M. wrote:

|not your top priority either.
“unfortunately I am one of those 10% or less. You cannot stop Ruby
being (implementation) complex”. Clear?
  					matz.

First, it wasn’t me who brought this up, the quote about the 10% is
from “Charles O Nutter”. Second, I know a complex implementation
doesn’t mean the interface has to be complex, on the contrary.

My fear is that the interface will still be more complex than really
neccessary for me – not that I would expect this is reason enough
you deviate from your plans. Voicing my own concerns and wishes about
the interface design is a thing I can do though, in the hope that such
feedback will be useful to you, or at least informative to other
readers.

I still think that you won’t be able to please everybody, that’s just
not possible. No evangelist will ever convince me. But I am eager to
see for myself how close you can come (and where you will compromise).

-JÃ¼rgen