Unicode roadmap?

rhaus · June 27, 2006, 1:52am

On 26-Jun-06, at 1:03 PM, Austin Z. wrote:

argument syntax we can do:
s[byte:0] # The first byte (of a string with some non ascii
compatible encoding)

I kinda like that.

Presumably this is general arm waving, because s[/./] need not return
the first character of a non-empty string, unless you mean s[/./m] or
some uglier alternative

ratdog:~ mike$ irb --simple-prompt

“\nx”[/./]
=> “x”
“\nx”[/./m]
=> “\n”

Mike

–

Mike S. [email protected]
http://www.stok.ca/~mike/

The “`Stok’ disclaimers” apply.

rhaus · June 27, 2006, 4:21am

Charles O Nutter wrote:

Hey, the uber-string m17n impl might be the most amazing, remarkable thing
ever to come along. It just seems based on a lot of anecdotal evidence that
this approach is very complex and very dangerous, and arguably has never
been done right yet. matz and company are amazing hackers, but is it a good
risk to take? Is it worth it for 10% of Ruby users or less?

I’d like to point out that MySQL has m17n strings, and it rocks.

Daniel

rhaus · June 27, 2006, 6:05am

On Jun 26, 2006, at 10:16 PM, [email protected] wrote:

Gary W.

Must defend random syntax that I invented

It only has to allocate a hash depending on the named argument interface

e.g.

not real ruby syntax, affaik

def [](char_index = nil, byte: nil)
…
end

rhaus · June 27, 2006, 4:18am

On Jun 26, 2006, at 9:57 PM, Austin Z. wrote:

I’m referring to s[byte: 0]. It’s elegant.

It seems a bit weighty. It requires the allocation of a Hash simply
to index a byte vector.

s.byte(0)

seems just as readable without the overhead.

Gary W.

rhaus · June 27, 2006, 9:46am

Hi,

In message “Re: Unicode roadmap?”
on Tue, 27 Jun 2006 06:52:14 +0900, “Austin Z.”
[email protected] writes:

|> Austin Z. wrote:
|> > d1 = File.open(“file.txt”, “rb”, encoding: :utf8) { |f| f.read }
|> Question: Does the encoding parameter specify the encoding of the file,
|> or the encoding of the strings you get back (my guess is both).
|
|I would assume both, based on what I’ve seen from Matz.

I think so.

|> Another Question: When you set the encoding, are you:
|>
|> (A) Just changing the encoding specifier without changing the
|> underlaying string.
|> (B) Re-encoding the string according to the new encoding specifier.
|
|> (B) seems to be implied by the attribute notation, but that seems a bit
|> dangerous in my mind.
|
|I personally consider it to be (A) because I believe that encoding is
|a lens. If you want (B) it should be s1.recode(:utf8). But #recode
|would not work on an encoding of “binary” (or “raw”); #recode would be
|similar to the Iconv steps you would use today.

str.encoding=“ascii” would cause (A).

						matz.

rhaus · June 27, 2006, 8:32am

On Jun 26, 2006, at 7:20 PM, Daniel DeLorme wrote:

I’d like to point out that MySQL has m17n strings, and it rocks.

I am often unable to get Unicode strings from Perl into MySQL and
back out without breaking them. Haven’t tried the Ruby/MySQL combo;
does it work better? -Tim

rhaus · June 27, 2006, 10:05am

Hi,

In message “Re: Unicode roadmap?”
on Tue, 27 Jun 2006 00:05:22 +0900, “Dmitry S.”
[email protected] writes:

|And what about minilanguages, incorporated in Ruby: regexp patterns,
|sprintf, strftime patterns etc.?

Good point. Currently they don’t support non ASCII compatible
encoding (including UTF-16 and UTF-32, but this is not fundamental
restriction).

						matz.

rhaus · June 27, 2006, 10:24am

Hi,

In message “Re: Unicode roadmap?”
on Tue, 27 Jun 2006 06:43:30 +0900, “Charles O Nutter”
[email protected] writes:
|All due respect to matz and companyand the wondrous thing they have wrought,
|but nobody is perfect. Accepting a decision blindly based on who is making
|it is a recipe for trouble. My only concern is that while the proposed m17n
|implementation may make Ruby more perfect and more ideal for at least one
|person, it may (emphasis on ‘may’) make it harder for many thousands of
|others. Does that make sense? I’m sure there will be those who argue that
|Ruby is matz’s creation and matz’s creation alone, but there’s a lot of
|people with a vested interest in “the Ruby way”. A little critical analysis
|of the “benevolent dictator’s” decisions is always prudent.

Good point.

|If we get unicode and it’s a lot harder than people like, or if it causes
|unpleasant compatibility, portability, or interoperability issues, then
|we’re no better off.
|
|Hey, the uber-string m17n impl might be the most amazing, remarkable thing
|ever to come along. It just seems based on a lot of anecdotal evidence that
|this approach is very complex and very dangerous, and arguably has never
|been done right yet. matz and company are amazing hackers, but is it a good
|risk to take? Is it worth it for 10% of Ruby users or less?

But unfortunately, the implementer is living among those “10% or
less”. So it’s a risk already taken, choosing a language designed by
such a person.

Anyway, please give me a chance to be proven wrong (or right).
I will try not to make lives of thousands of others hard.

						matz.

rhaus · June 27, 2006, 3:30pm

On 6/26/06, Charles O Nutter [email protected] wrote:

versus those that could. Would it be reasonable to say that if 90% of Ruby
users would never have a pressing need for a non-unicode-encodable String,
then an uber-String that’s entirely encoding-agnostic would be better
written as an extension for those special cases? Do we really need to
encumber all of Ruby for the needs of a relative few?

Its’ been asked already.

Again: How does the possibility to store non-unicode characters in
strings encumber you?

Michal

rhaus · June 27, 2006, 10:27am

Charles O Nutter schrieb:

(…)
Hey, the uber-string m17n impl might be the most amazing, remarkable
thing ever to come along. It just seems based on a lot of anecdotal
evidence that this approach is very complex and very dangerous, and
arguably has never been done right yet. matz and company are amazing
hackers, but is it a good risk to take? Is it worth it for 10% of
Ruby users or less?
(…)

Charles, could it be that “the uber-string m17n implementation” would
make your life as JRuby implementer a lot harder? ;->

Regards,
Pit

rhaus · June 27, 2006, 4:31pm

It won’t matter much either way to JRuby, since Java’s going to
internalize
all strings as UTF-16 anyway. Those encodings that can’t be represented
in
unicode simply won’t work, since that’s just a platform limitation we’ll
probably live with. There’s always the option of building our own
uber-string based on what matz creates (porting to Java wouldn’t be
impossible, or perhaps even difficult) but we’ll cross that bridge when
we
come to it.

I’m just trying to play both sides of the fence here since there seems
to be
a number of people opposed to or doubtful of the m17n uberstring. As a
Ruby
platform implementer of a sort I’d like to make sure those concerns are
considered.

rhaus · June 27, 2006, 4:54pm

On 6/27/06, Michal S. [email protected] wrote:

non-unicode-encodable String, then an uber-String that’s entirely
encoding-agnostic would be better written as an extension for those
special cases? Do we really need to encumber all of Ruby for the
needs of a relative few?
Its’ been asked already.

Again: How does the possibility to store non-unicode characters in
strings encumber you?

To be fair to Charles, he would benefit immensely from a Unicode
internal representation because he could then simply and cleanly use
Java Strings as Ruby Strings in JRuby.

With an m17n String, he will need to have something else that isn’t
compatible with Java Strings, which hurts JRuby’s use as a Java glue
language. I think that there are ways around this. Maybe make the JRuby
String class have an internal something like:

class JRuby.String
{
private Java.Lang.String unicode;
private ByteVector m17n;
private Java.Lang.String encoding;
private bool isUnicode;
}

That way, if it’s a Unicode encoding – regardless of what’s desired –
he could use the unicode member; otherwise internally he uses the
ByteVector. (Strictly speaking, for non-“raw” or “binary” encodings, he
could always use the unicode member and convert as necessary.)

-austin

rhaus · June 27, 2006, 4:34pm

On 6/27/06, Yukihiro M. [email protected] wrote:

But unfortunately, the implementer is living among those “10% or
less”. So it’s a risk already taken, choosing a language designed by
such a person.

That’s certainly fair to say, and I’m optimistic that whatever the best
decision is you’ll make it right. It’s especially heartening that you
are an
active participant in this debate; I know certain other language
designers
that are less open to comment and criticism.

Anyway, please give me a chance to be proven wrong (or right).

I will try not to make lives of thousands of others hard.

                                                    matz.

It seems you’re giving yourself the change to be proven wrong already.
I’ll
just watch that process as it moves forward and do what I can to mix
things
up.

rhaus · June 27, 2006, 6:28pm

On 6/27/06, Austin Z. [email protected] wrote:

    private Java.Lang.String        encoding;
    private bool                            isUnicode;

}

This would certainly be an option once matz has solved all the hard
problems
of an encoding-free String. Some minimal testing of a byte[] based UTF-8
Java String replacement has shown that there are very few general
performance issues arising from reimplementing string with a different
data
structure (a testament to Java’s JIT, since most Java code runs faster
without native bits). When there’s something concrete in the m17n plan,
we
shouldn’t have much difficulty supporting it. We could also run with
pure
unicode internally as well, for folks who didn’t need any
unicode-incompatible encodings. Without the m17n code ready for general
consumption, it’s hard to say what path will be best.

The other advantage of a byte[] or ByteVector-based JRuby string is for
IO;
currently we use Java’s StringBuffer for handling mutable string
operations.
This works well, but StringBuffer maintains a char[] internally, so for
every byte of IO we waste a byte. We’re considering various options to
improve that, and the end result may be closer to the UberString than to
Java’s own.

So yes, there’s some alterior motive in my support for pure Unicode and
ByteArray, but any path taken will be implementable in JRuby. However, I
support those because I feel they simplify rather than complicate, and
not
because they might be easier to implement in Java.

rhaus · June 27, 2006, 7:21pm

On 6/27/06, Charles O Nutter [email protected] wrote:

So yes, there’s some alterior motive in my support for pure Unicode and
ByteArray, but any path taken will be implementable in JRuby. However, I
support those because I feel they simplify rather than complicate, and not
because they might be easier to implement in Java.

IME, more classes complicates. Sometimes the complexity is necessary
because it is simpler than the alternative, but I don’t believe that
this is the case here. As I said, most of my opposition is based on
(1) stupid statically typed languages and (2) an inability to tell
Ruby what type you want back from a method call (this is a good thing,
because it in part prevents #1 ;).

-austin

rhaus · June 27, 2006, 11:56pm

On 6/26/06, Daniel DeLorme [email protected] wrote:

It’s funny, maybe I’m just dumb but I can’t think of a single real-world
example where you’d want to access particular characters of a string.

If that is the case, then why doesn’t Ruby remove all substring
notation? If everyone is so comfortable with manipulating strings
via regexp’s, then why does the language bother to support
my_str[a…b] , my_str[a…b] , and my_str[a,b] ?

I don’t mean to sound all-worked-up over this, but it does seem
hard to believe that those method calls for String are never used
in real-world code.

rhaus · June 27, 2006, 5:23pm

On 6/27/06, Yukihiro M. [email protected] wrote:

But unfortunately, the implementer is living among those “10% or
less”. So it’s a risk already taken, choosing a language designed by
such a person.

That also means that the implementor has much better understanding of
internationalization issues than those who live in the US

This should give us at least sound base string class. And since the
class is open in Ruby automatic this or that can be added.

Thanks

Michal

rhaus · June 28, 2006, 3:28am

Tim B. wrote:

On Jun 26, 2006, at 7:20 PM, Daniel DeLorme wrote:

I’d like to point out that MySQL has m17n strings, and it rocks.

I am often unable to get Unicode strings from Perl into MySQL and back
out without breaking them. Haven’t tried the Ruby/MySQL combo; does it
work better? -Tim

I’ve never had any problems. You just have to make sure the client
correctly
tells the server what encoding it is using. The only annoyance is that
MySQL
will silently change inconvertible characters to ‘?’, but that’s part of
the
MySQL design philosophy rather than inherent to m17n strings.

Daniel

rhaus · June 28, 2006, 3:00am

Daniel DeLorme [email protected] writes:

It’s funny, maybe I’m just dumb but I can’t think of a single
real-world example where you’d want to access particular characters
of a string.

I’ll point you at my solution to ruby quiz #83: (short but unique)

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/197973

How would you write the method string_similarity without access to
each character? (This method computes the length of the longest
common substring)

How would you compute the Levenshtein distance (edit distance) between
two strings without access to each character?

How would you pull strings out of a file with fixed-width fields?
With regular expressions? Really? What if you had a hundred fields?

rhaus · June 28, 2006, 7:44am

On 27.6.2006, at 19:19, Austin Z. wrote:

As I said, most of my opposition is based on
(1) stupid statically typed languages and (2) an inability to tell
Ruby what type you want back from a method call (this is a good thing,
because it in part prevents #1 ;).

First, “most of my opposition” is not useful in discussion and is a
straw-man, because we are not counting people here, we try to
evaluate reasons for and against. One person with good reason should
overcome 1000 not-so-good posts. This is not about winning the
argument, it’s about having the best solution.

About (2), inability to tell in advance in your program whether you
get bytes or characters from a method in core (or any other) API is
NOT a good thing. This causes innumerable problems and unexpected
behaviour if programmer expects one and code sometimes gets the
other. The API should prevent such errors, either by very simple and
strict rules that enable easy prediction, or by introducing
ByteArray, which makes prediction trivial. This is not about duck-
typing, it’s about randomly having semantically different results.

Since the rules are not fixed yet, nobody can say whether one or the
other solution is better. But if the API is not very clear or
requires lots of manual specifying in code, we will be in a mess,
similar to today.

izidor