Unicode roadmap?


#181

On 6/26/06, Daniel DeLorme removed_email_address@domain.invalid wrote:

It’s funny, maybe I’m just dumb but I can’t think of a single real-world
example where you’d want to access particular characters of a string. Why do you
want the first char? In the context of a byte string there might be something
special at position n (e.g. exif header), but in the context of a human-readable
string what is there? For example, if you want that first char in order to check
if it’s a space or not, you should use str =~ /^ /, etc, etc. I honestly can’t
think of any real-world examples where regular expressions are less appropriate
than pointer arithmetic. Can you illuminate me with some?

Substrings? Finding occurence of a string in a nother string? Why
shouldn’t str[0…3] work on characters (for a string with encoding
set)? Maybe I want to do something like str[0] =
Unicode::upcase(str[0])? :slight_smile:

Isn’t that what Humane Interface Design
(http://www.martinfowler.com/bliki/HumaneInterface.html) is all about
:wink:

Regular expressions are cryptic. They are powerful, but do I need a
sledgehammer when I need a paperclip?


#182

Snaury M. wrote:

Yukihiro M. wrote:

Then how can it
determine which should be in the current code page, or in Unicode?
Or using Win32 API ending with W could allow you living in the
Unicode?
Well, currently (just downloaded latest cvs sources) ruby uses ansi
versions of CreateFile and FindFirstFile/FindNextFile APIs, so even if I
say, for example, KCODE to UTF-8 (not sure how you can currently make
ruby work with UTF-8) ansi versions of APIs are still called, and that
means that
The same with win32ole extension, I can see a lot of ole_wc2mb/ole_mb2wc
there, which breaks things horribly when interoperating with, for
example, Excel and trying to work with russian/greek/japanese and all
other languages all on the same sheet (after I process the sheet,
modifying all of the cells, it will just strip all languages except
russian from it).

Ah, well, for ole that’s not true, only now I realized I can set
codepage there to UTF-8, but still similar thing for win32 file io (and
maybe for other things where win32 API or win32 cruntime used) would be
great.


#183

Yukihiro M. wrote:

You said Tcl has Unicode support that works well with you. So that I
think treating all of them in UTF-8 is OK for you.

It’s actually not about treating everything in UTF-8, it just unifies
everything in Tcl in a way that you can have all variety of characters
in strings.

Then how can it
determine which should be in the current code page, or in Unicode?
Or using Win32 API ending with W could allow you living in the
Unicode?

Well, currently (just downloaded latest cvs sources) ruby uses ansi
versions of CreateFile and FindFirstFile/FindNextFile APIs, so even if I
say, for example, KCODE to UTF-8 (not sure how you can currently make
ruby work with UTF-8) ansi versions of APIs are still called, and that
means that

  1. if there are filenames with characters that don’t fall in range of
    current codepage, I will receive ‘?’ in place of them when I enumerate
    directory contents.
  2. I receive filenames in current code page, and not in UTF-8
  3. There is no way for me to open a file with these characters using
    standard ruby classes

The same with win32ole extension, I can see a lot of ole_wc2mb/ole_mb2wc
there, which breaks things horribly when interoperating with, for
example, Excel and trying to work with russian/greek/japanese and all
other languages all on the same sheet (after I process the sheet,
modifying all of the cells, it will just strip all languages except
russian from it).

In *nixes you can just change your locale to *.UTF-8 and you’re ok with
that, because everything you receive when enumerating directory is
UTF-8, and File.open will expect UTF-8. Unfortunately, for Windows that
is not possible: MS already provides ‘wide’ versions of APIs for those
who need them, and there is no UTF-8 ANSI codepage you can set as
default (because UTF-8 codepage in Windows is somewhat ‘virtual’, for
conversion purposes only).

In Tcl you have all of your strings in UTF-8, and when Tcl interoperates
with the rest of the world, it converts strings appropriately (for
example, on Win9x there are mostly no ‘wide’ APIs, so it converts
strings to current code page and uses ansi APIs, but on WinNT it
converts it to unicode and uses ‘wide’ APIs). What I was thinking is
maybe a way for setting “current codepage” for ruby on win32 (including
possibility to set it to UTF-8), and so that when ruby works with the
world it would use ‘wide’ APIs when possible, converting to and from
this codepage (so that instead the way it is Tcl when it is hard-coded
to be UTF-8, there would be a possibility to choose), because there are
no other way to do that on Windows by user (user can’t set current
codepage to UTF-8).


#184

Dmitrii D. wrote:

Substrings? Finding occurence of a string in a nother string?

Those operations are precisely what regexes are best at.

shouldn’t str[0…3] work on characters (for a string with encoding
set)? Maybe I want to do something like
str[0] = Unicode::upcase(str[0])? :slight_smile:

What about
str.sub!(/^./){ |c| Unicode::upcase© }
That hardly seems more cryptic to me.

It’s not that I don’t understand the attraction; it’s just that I think
when
handling char-strings it’s best to change your mental model to something
further
away from char/byte arrays.

BTW, if str[0…3] returns the first 4 characters, then how do I get the
first 4
bytes?

Daniel


#185

On 6/26/06, Daniel DeLorme removed_email_address@domain.invalid wrote:

It’s funny, maybe I’m just dumb but I can’t think of a single real-world
example where you’d want to access particular characters of a string. Why do you
want the first char? In the context of a byte string there might be something
special at position n (e.g. exif header), but in the context of a human-readable
string what is there? For example, if you want that first char in order to check
if it’s a space or not, you should use str =~ /^ /, etc, etc. I honestly can’t
think of any real-world examples where regular expressions are less appropriate
than pointer arithmetic. Can you illuminate me with some?

Have you looked at the “short but unique” ruby quiz?

Also when you are building some search trees or such you want access
to letters one by one.

Thanks

Michal


#186

On 26-jun-2006, at 3:11, Austin Z. wrote:

Stupid, stupid, stupid, stupid. If I have guessed wrong about the
contents of file.txt, I have to rewind and read it again. Better to
always read as bytes and then say, “this is actually UTF-8”. This
would be as stupid in C++, Java, or C#:

Not so fast, let’s say you read from a file:

st = File.open(“file.txt”, “rb”) { |f| f.read(4056) }

and you recieve a PART of a unicode string (because you cannot know
where to stop reading before yoy look into the structure).
The only way to make what you read valid now is to slide along the
byte length and try to catch the bytes that you skipped.
Should I continue?


#187

On 26-jun-2006, at 8:27, Daniel DeLorme wrote:

It’s funny, maybe I’m just dumb but I can’t think of a single real-
world
example where you’d want to access particular characters of
a string.

Well, think again. You have a truncate(text) helper in Rails which
truncates the text to X characters and “dot dot dot”. The easiest
example. Or you have excerpts… etc.


#188

On 26-jun-2006, at 3:01, Austin Z. wrote:

I suggest you look through the Unicode threads again. You’ll find
your statement is untrue. There are a lot of people who (foolishly)
want
Unicode to be the only internal representation of Strings in Ruby.

Let’s say there are people who not-so-foolishly believe that trying
to ahve strings in all possible
encodings is not technically possible and the aforementioned people
don’t understand how
a system cen reliably handle them. Especially the aformentioned
people remember that Strings in Ruby
are mutable and can transition from being Unicode to being “something-
else” in one method call.


#189

On 26-jun-2006, at 10:07, Daniel DeLorme wrote:

str.sub!(/^./){ |c| Unicode::upcase© }
That hardly seems more cryptic to me.
It does seem unnatural and hints that you are working with an
encoding-incapable language, because
people who are lucky to be in ASCII will be able to do

str[0] = str[0].upcase

but people who are not will have to invent silly workarounds.

It’s not that I don’t understand the attraction; it’s just that I
think when handling char-strings it’s best to change your mental
model to something further away from char/byte arrays.

BTW, if str[0…3] returns the first 4 characters, then how do I get
the first 4 bytes?

str.bytes[0…3] seems OK to me. That is: for Strings the character-
based routines are the base ones, and byte routines are secondary.
Not the “chars” accessor I had to bolt on
right now. The problem is that you have to PROTECT an ignorant
programmer from things like normalization and character unity and
NEVER allow him to cut into a character
of a multibyte string UNLESS he especially mentions that he wants it
that way.


#190

On 6/26/06, Julian ‘Julik’ Tarkhanov removed_email_address@domain.invalid wrote:

need two APIs:

st = File.open(“file.txt”, “rb”) { |f| f.read(4056) }

and you recieve a PART of a unicode string (because you cannot know
where to stop reading before yoy look into the structure).
The only way to make what you read valid now is to slide along the
byte length and try to catch the bytes that you skipped.
Should I continue?

Why would you read 4096 bytes in the first place?

If you knew the file is in some weird multibyte encoding you should
have set it for the stream, and read something meaningful.

If it is “ascii compatible” (ISO-8859-, cp, utf-8, … ) you can just
use gets.

Otherwise there is no meaningful string content.

Note that 4096 bytes is always OK for UTF-32 (or similar plain wide
character encodings), and may at worst get you half of a surrogate
character for UTF-16. And strings will have to handle incomplete
characters anyway - they may result from some delays/buffering in
network IO or such.

Thanks

Michal


#191

On 26-jun-2006, at 15:27, Michal S. wrote:

Why would you read 4096 bytes in the first place?
This is a pattern. If a file has no line endings, but just one (very
logn) stream of characters - can you really use gets?

If you knew the file is in some weird multibyte encoding you should
have set it for the stream, and read something meaningful.

Or there should be a facility that preserves you from reading
incomplete strings. But is it implied that if I set IO.encoding = foo
the IO objects will prevent me? Will they go out to the provider
of the io and get the missing remaining bytes?
In the case of Unicode the absolute, rigorous minimum is to NEVER
EVER slice into a codepoint, and it can go anywhere you want in terms
of complexity (because
slicing between codepoints is also not the way).

If it is “ascii compatible” (ISO-8859-, cp, utf-8, … ) you can
just use gets.

Otherwise there is no meaningful string content.

Note that 4096 bytes is always OK for UTF-32 (or similar plain wide
character encodings),
Of which UTF-32 is the only one that is relevant for Unicode, and if
you investigated the subject a little
you would know that slicing Unicode strings at codepoint boundaries
is often NOT enough. That way you can cut a part of
a compound character, a modifier codepoint or an RTL override
remarkably easily, which will just give you a different character
altogether (or alter your string
diplay in a particularly nasty way - that is, reverse your string
display for the remaining output of you program if you remove an RTL
override terminator).

and may at worst get you half of a surrogate
character for UTF-16. And strings will have to handle incomplete
characters anyway - they may result from some delays/buffering in
network IO or such.

This is exactly why the notion of having strings both as byte buffers
and character vectors seems a little difficult. 90 percent of my use
cases for Ruby need characters, not bytes

  • and I would love to hint it specifically shall that be needed. The
    problem right now is that Ruby does not distinguish these at the moment.

#192

On 6/26/06, Julian ‘Julik’ Tarkhanov removed_email_address@domain.invalid wrote:

On 26-jun-2006, at 15:27, Michal S. wrote:

Why would you read 4096 bytes in the first place?
This is a pattern. If a file has no line endings, but just one (very
logn) stream of characters - can you really use gets?

But can you work with the file in parts then? If there is no
meaningful internal structure you have to work with the file in its
entirety (or do a block copy but you should not be concerned with
characters then).
If there is a structure you may use alternate line endings.

of complexity (because
slicing between codepoints is also not the way).

At most you can expect it to hold incomplete codepoints until they are
read fully I guess. However, incomplete codepoints are going to exist
anyway so the strings must deal with them in one way or another.

you would know that slicing Unicode strings at codepoint boundaries
is often NOT enough. That way you can cut a part of
a compound character, a modifier codepoint or an RTL override
remarkably easily, which will just give you a different character
altogether (or alter your string
diplay in a particularly nasty way - that is, reverse your string
display for the remaining output of you program if you remove an RTL
override terminator).

If the file has some meaningful structure (like line endings or XML)
you should get the complete parts. If it does not you have to deal
with it. And nobody can do it for you except the one who chose the
format in which th file was saved.

problem right now is that Ruby does not distinguish these at the moment.
But the problem is you cannot distinguish them, not that you do not
have separate classes for them.

Michal


#193

On 6/26/06, Julian ‘Julik’ Tarkhanov removed_email_address@domain.invalid wrote:

On 26-jun-2006, at 3:11, Austin Z. wrote:

st = File.open(“file.txt”, “rb”) { |f| f.read(4056) }
and you recieve a PART of a unicode string (because you cannot know
where to stop reading before yoy look into the structure).
The only way to make what you read valid now is to slide along the
byte length and try to catch the bytes that you skipped.
Should I continue?

Sure. It won’t make you any more correct. Let’s play with your example:

st = File.open(“file.txt”, “rb”, :encoding => :utf8) { |f|
f.read(4096) }

Okay. Am I reading 4096 bytes or 4096 characters? The correct and
least surprising behaviour is to read the specified number of bytes.
Instead it would be better to expose the minimum amount required to
work with this:

bv = File.open(“file.txt”, “rb”) { |f| f.read(4096) }
bv.encoding = :utf8
bv.encoding_valid? # will return false if the whole string isn’t a
valid UTF-8 sequence.

You’re really looking for something that is, in the end, completely
unworkable and unnecessarily complex in doing so. The m17n String –
with byte vector characteristics retained – maintains a clear, simple
API with few exceptions that would have to be memorised or understood.
Adding another class doubles the size of the class hierarchy that
has to be understood, and if there are any variances between them
the number of exceptions effectively doubles. If there aren’t any
variances between the class APIs, then what’s the point of separating
them in the first place?

A string is an ordered sequence of characters. A byte vector is an
ordered sequence of bytes. If your string is suitably flexible, then
it can say that a byte vector is a string where each character is one
byte long and that collation (etc.) are determined by the byte value.
We’re not talking rocket science here. Stop trying to make it such.

-austin


#194

On 6/19/06, Yukihiro M. removed_email_address@domain.invalid wrote:

|- at present time Ruby parser can parse only sources in ASCII compatible
|encoding. Would it change?

No. Ruby would not allow scripts in EBCDIC, nor UTF-16, although it
allows processing of those encoding.

And what about minilanguages, incorporated in Ruby: regexp patterns,
sprintf, strftime patterns etc.?
Regexps syntax uses several metachars ( []{}()±*?.: ) and latin
letters

  • lower and upper.
    But there are charsets/encodings which don’t have some of them, e.g.:
    GB_2312-80 has none of them, JIS_X0201 doesn’t have backslash,
    ebcdic-cp-ar1
    doesn’t have backslash, square and curly brackets.
    So, regexp patterns can’t be constructed for these charsets/encodings.

#195

On 6/26/06, Julian ‘Julik’ Tarkhanov removed_email_address@domain.invalid wrote:

On 26-jun-2006, at 15:27, Michal S. wrote:

Why would you read 4096 bytes in the first place?
This is a pattern. If a file has no line endings, but just one (very
logn) stream of characters - can you really use gets?

If you knew the file is in some weird multibyte encoding you should
have set it for the stream, and read something meaningful.
Or there should be a facility that preserves you from reading
incomplete strings. But is it implied that if I set IO.encoding = foo
the IO objects will prevent me? Will they go out to the provider of
the io and get the missing remaining bytes? In the case of Unicode the
absolute, rigorous minimum is to NEVER EVER slice into a codepoint,
and it can go anywhere you want in terms of complexity (because
slicing between codepoints is also not the way).

Anyone who wants to set all IO operations to a particular encoding is
making a huge mistake. Individual IO operations or handles could be set
to a particular encoding, but you would have a high probability of
breaking code external to you that did any IO operations if you forced
all IO to use your encodings.

you can cut a part of a compound character, a modifier codepoint or an
RTL override remarkably easily, which will just give you a different
character altogether (or alter your string diplay in a particularly
nasty way - that is, reverse your string display for the remaining
output of you program if you remove an RTL override terminator).

Oh, I understand that very well. At least as well as you do. However,
that is independent of whether IO works on encoded or unencoded values.
It’s easy enough to check the validity of your encoding, too. If you’re
not checking external input for taintedness, then you’re doing silly
things, too. One cannot hide too much of the complexity from Unicode,
because to do so will increase the chance that programmers not as smart
as you are will, well, screw the pooch royally.

and may at worst get you half of a surrogate character for UTF-16.
And strings will have to handle incomplete characters anyway - they
may result from some delays/buffering in network IO or such.
This is exactly why the notion of having strings both as byte buffers
and character vectors seems a little difficult. 90 percent of my use
cases for Ruby need characters, not bytes - and I would love to hint
it specifically shall that be needed. The problem right now is that
Ruby does not distinguish these at the moment.

Yes, and that’s where your opposition to maintaining this is
persistently misguided. Ruby will distinguish between a String without
an encoding and a String with an encoding. You’re basing your opposition
to tomorrow’s behaviour based on today’s (known bad) behaviour. Please,
stop doing that.

And while most of your use cases deal with characters, code that I’ve
written deals with both bytes and characters in equal measures.

-austin


#196

On Monday 26 June 2006 11:54 am, Jim W. wrote:

I think it would be a great idea to prototype these ideas in real
code to understand the advantages and disadvantages of each.

+1^2


#197

I’ve been following this debate with some interest. Alas, since my
unicode/m17n experience is quite limited, I don’t have a strong opinion
in the matter.

But the following caught my eye:

Austin Z. wrote:

[…] Ruby will distinguish between a String without
an encoding and a String with an encoding. You’re basing your opposition
to tomorrow’s behaviour based on today’s (known bad) behaviour.

Part of the problem is that we are basing our discussions on
descriptions of what will happen in the future, but that makes it
difficult to understand the issues involved without real code.

What I would like to see is prototype implementations of both
approaches, and see the differences in how they effect the code. I’m
more interested in anwering questions like “How do I safely concatenate
strings with potentially different encodings” and “How do I do I/O with
encoded strings” rather than addressing efficiency questions. In other
words, how do the different approaches effect the way I write code.

I think it would be a great idea to prototype these ideas in real code
to understand the advantages and disadvantages of each.

– Jim W.


#198

On Jun 26, 2006, at 2:13 AM, Joel VanderWerf wrote:

It’s funny I’m always forgetting you can index by regexp. But this
brings up a good point, this is Ruby, with the new Hash / named
argument syntax we can do:

“It’s needlessly cryptic.”[byte:2]

This doesn’t add anything at all to the conversation, but I think it
looks good, and it’s in the “make similiar things look similar” vein.

Indexing Strings

s[0] # The first character
s[/./] # The first character
s[byte:0] # The first byte (of a string with some non ascii
compatible encoding)


#199

On 6/26/06, Logan C. removed_email_address@domain.invalid wrote:

compatible encoding)
I kinda like that.

-austin


#200

“Austin Z.” removed_email_address@domain.invalid writes:

I would much rather keep the API – and the class library – simple. I
would rather do this:

st = File.open(“file.txt”, “rb”, :encoding => :utf8) { |f| f.read }

or

bv = File.open(“file.txt”, “rb”) { |f| f.read }
st = bv.to_encoding(:utf8)

Partly off-topic, but important nevertheless: Then it’s the right
time to drop that damn “rb” by making it default and let the people
stuck in the \r\n-age use :encoding => “win-ansi” or “dos” or whatever.