Re: Unicode roadmap?

BergerS_Daniel · June 28, 2006, 12:06am

real-world example where you’d want to access particular
characters
of a string.

If that is the case, then why doesn’t Ruby remove all
substring notation?

Because it would be a disaster. You want real world examples? Take a
look at any of the pure Ruby code in the Win32Utils examples where I
have to take slices out of character buffers and pack or unpack them
into the appropriate value. I’m guessing this might apply to Ruby/DL as
well.

I’m sure there are many people using character access in “real world”
code.

Regards,

Dan

This communication is the property of Qwest and may contain confidential
or
privileged information. Unauthorized use of this communication is
strictly
prohibited and may be unlawful. If you have received this communication
in error, please immediately notify the sender by reply e-mail and
destroy
all copies of the communication and any attachments.

BergerS_Daniel · June 28, 2006, 12:35am

On 6/27/06, Berger, Daniel [email protected] wrote:

I’m sure there are many people using character access in “real world”
code.

Raising my hand, but the question might be who does character access
on Unicode strings. I play with byte arrays all the time (sometimes
with embedded strings), but very rarely (but I do) use string slicing
against a Unicode string.

I am in the (unfortunate) position of dealing with many legacy binary
files, encoded into a wide variety of pieces and parts – I use string
slicing, but more exactly I use byte array slicing (don’t get me wrong
– I want to keep a single String class).

pth

BergerS_Daniel · June 28, 2006, 2:01am

On Jun 27, 2006, at 3:05 PM, Berger, Daniel wrote:

If that is the case, then why doesn’t Ruby remove all
substring notation?

Because it would be a disaster. You want real world examples? Take a
look at any of the pure Ruby code in the Win32Utils examples where I
have to take slices out of character buffers and pack or unpack them
into the appropriate value. I’m guessing this might apply to Ruby/
DL as
well.

Point granted, but I bet the Win32 stuff assumes 8-bit “characters”
and thus fixed offsets. -Tim

BergerS_Daniel · June 28, 2006, 2:23am

Tim B. wrote:

Point granted, but I bet the Win32 stuff assumes 8-bit “characters”
and thus fixed offsets. -Tim

Maybe I should fork a version of Ruby tailored specifically to Windows.
I’ll replace all char pointer declarations with tchar pointers, set
MBCS, and automatically convert all strings to wide strings using
MultiByteToWideChar() behind the scenes, using whatever code page they
want, defaulting to CP_UTF8.

Right after I get my VC funding.

Regards,

Dan

BergerS_Daniel · June 28, 2006, 5:51am

On 6/27/06, Daniel B. [email protected] wrote:

Maybe I should fork a version of Ruby tailored specifically to Windows.
I’ll replace all char pointer declarations with tchar pointers, set
MBCS, and automatically convert all strings to wide strings using
MultiByteToWideChar() behind the scenes, using whatever code page they
want, defaulting to CP_UTF8.

Right after I get my VC funding.

Trust me. You don’t want to do that. TCHAR with -DUNICODE is pure evil.

-austin

BergerS_Daniel · June 28, 2006, 2:10am

On Jun 27, 2006, at 7:59 PM, Tim B. wrote:

DL as
well.

Point granted, but I bet the Win32 stuff assumes 8-bit “characters”
and thus fixed offsets. -Tim

I have an example, but I’m sure most would consider it “cheating”.
Let’s say I need to write a Regexp engine…

BergerS_Daniel · June 28, 2006, 6:06am

Austin Z. wrote:

-austin
Well, it would be -DMBCS.

I cannot even begin to imagine what changes would be required for the
regex engine.

Dan

BergerS_Daniel · June 28, 2006, 8:24am

On 6/28/06, Patrick H. [email protected] wrote:

On 6/27/06, Berger, Daniel [email protected] wrote:

I’m sure there are many people using character access in “real world”
code.

Raising my hand, but the question might be who does character access
on Unicode strings. I play with byte arrays all the time (sometimes
with embedded strings), but very rarely (but I do) use string slicing
against a Unicode string.

I guess, that would be anyone in East Europe with Cyrillic-based
alphabets Especially those dealing with web apps. Sigh

On the other hand, I wonder, who in his right mind would want to work
with strings as with a sequence of bytes? 90% of developers out
there don’t even know how encodings work. So all this manual to and
fro conversion, moving through bytes etc. etc. would only be perceived
as vodoo magic. A deeper sigh

I am in the (unfortunate) position of dealing with many legacy binary
files, encoded into a wide variety of pieces and parts – I use string
slicing, but more exactly I use byte array slicing (don’t get me wrong
– I want to keep a single String class).

I wonder, what is wrong with making all strings Unicode by default
(this will ensure that most libraries don’t automagically break once
Ruby is upgraded), but let developers optionally decide whether they
want a different encoding:

s = new String #=> unicode string
sj = new String(:encoding => ‘jis’)
scp = new String(:encoding => ‘CP1251’)
sb = new String(:binary => true) #=> work as ByteArray
sbf = new String(:encoding => ‘funny encoding’, :binary => true) #=>
work as ByteArray

There are numerous performance issues, I suppose. And other problems
like assignment operations. Sigh

BergerS_Daniel · June 28, 2006, 8:33am

On 28/06/06, Dmitrii D. [email protected] wrote:

s = new String #=> unicode string
sj = new String(:encoding => ‘jis’)
scp = new String(:encoding => ‘CP1251’)
sb = new String(:binary => true) #=> work as ByteArray
sbf = new String(:encoding => ‘funny encoding’, :binary => true) #=>
work as ByteArray

I’m confused - I thought we were talking about Ruby!

Paul.

BergerS_Daniel · June 28, 2006, 9:22am

On 28/06/06, Dmitrii D. [email protected] wrote:

s = new String #=> unicode string
sj = new String(:encoding => ‘jis’)
scp = new String(:encoding => ‘CP1251’)
sb = new String(:binary => true) #=> work as ByteArray
sbf = new String(:encoding => ‘funny encoding’, :binary => true) #=>
work as ByteArray

I’m confused - I thought we were talking about Ruby!

Sorry. I currently have to work with C++, Ruby and PHP
simultaneously. It shows

I’ve come a cross a similar discussion on Unicode for Erlang. This
post in particular sums up some of the problems with Unicode:
http://article.gmane.org/gmane.comp.lang.erlang.general/16021

There is, in particular, nice info on how SWI Prolog handles Unicode
and what problems the have

BergerS_Daniel · June 28, 2006, 7:26pm

Hi,

In message “Re: Unicode roadmap?”
on Wed, 28 Jun 2006 23:11:39 +0900, “Austin Z.”
[email protected] writes:

|I have suggested to Matz that we adopt the u"string" format so that we
|have a literal constructor for Unicode strings (which is by far the
|more common need).

I am not sure how much this is more useful than usual string literals
plus unicode encoding pragma.

						matz.

BergerS_Daniel · June 28, 2006, 4:15pm

On 6/28/06, Dmitrii D. [email protected] wrote:

work as ByteArray
IIRC, any “unknown” encoding will be treated as a binary string where
you’re responsible for dealing with “characters”.

I have suggested to Matz that we adopt the u"string" format so that we
have a literal constructor for Unicode strings (which is by far the
more common need).

-austin

BergerS_Daniel · June 28, 2006, 7:43pm

On 6/28/06, Austin Z. [email protected] wrote:

I have suggested to Matz that we adopt the u"string" format so that we
have a literal constructor for Unicode strings (which is by far the
more common need).

In this regard I would love to see a user definable string quote
operator (see
http://redhanded.hobix.com/inspect/userDefinedLiteralsInSydney.html).
Then we could do it ourselves (and many other devious things as well).

pth

BergerS_Daniel · June 28, 2006, 8:14pm

On 28-jun-2006, at 19:12, Yukihiro M. wrote:

I am not sure how much this is more useful than usual string literals
plus unicode encoding pragma.

Austin, I don’t understand why my strings are more “special” than
yours and thus need subclassing, special encoding
or a special literal before them. This is by far the worst thing I
fear - that multibyte strings are handled as being “special”.
They are not special, they are the default.

With the pragma there is one thing which makes me wonder - will that
mean that the libraries will have to check for the pragma
to do their job correctly?

BergerS_Daniel · June 28, 2006, 8:23pm

On 28-jun-2006, at 19:33, Austin Z. wrote:

there’s a mismatch.)
Please no. Please please no.

What about:

raise “Not PNG.” unless @top.bytes[0, 8] == “\x89PNG\x0d\x0a\x1a\x0a”

BergerS_Daniel · June 28, 2006, 8:09pm

On 6/28/06, Yukihiro M. [email protected] wrote:

| I have suggested to Matz that we adopt the u"string" format so that
| we have a literal constructor for Unicode strings (which is by far
| the more common need).
I am not sure how much this is more useful than usual string literals
plus unicode encoding pragma.

I’m not sure I like the encoding pragma, personally, since it’s at the
file level. Consider this:

raise “Not PNG.” unless @top[0, 8] == “\x89PNG\x0d\x0a\x1a\x0a”

If I understand the encoding pragma correctly, both the “Not PNG” and
the matching string will be treated as Unicode, and the test string is
not valid Unicode.

Better, from my perspective:

raise u"Not PNG." unless @top[0, 8] == “\x89PNG\x0d\x0a\x1a\x0a”

That way, I mark the strings for which I want Unicode format. The
encoding pragma makes it hard to do mixed content files.

(This example, by the way, is specifically artificial, but the code
involved is real. It’s image matching code with error messages if
there’s a mismatch.)

-austin

BergerS_Daniel · June 28, 2006, 8:26pm

On 6/28/06, Julian ‘Julik’ Tarkhanov [email protected] wrote:

yours and thus need subclassing,
Excuse me? I’m not the one who has been advocating separate classes. I
have, however, suggested that Unicode strings are going to be common
enough that rather than saying String.new(“unicode-string”, encoding:
:utf8) we have u"unicode-string". I already know that binary string
literals are quite common in Ruby. I use them a lot.

And I mix them with strings that would logically be represented in
Unicode.

If, however, you would prefer doing it the other way, we could go:

raise “UnicodeString” unless a[0, 8] == b"binarystring"

Either way, I don’t care. But neither needs to be a separate class. But
they should be mixable, and the pragma wouldn’t make things easily
mixable.

special encoding or a special literal before them. This is by far the
worst thing I fear - that multibyte strings are handled as being
“special”. They are not special, they are the default.

No, actually, they’re possibly the default. More to the point,
UNICODE strings are not even necessarily going to be the default with
multibyte strings. So even so I would suggest that this might be
useful:

raise u"UnicodeString" unless a[0, 8] == b"binarystring"
“other-multibyte-string-according-to-pramga”

With the pragma there is one thing which makes me wonder - will that
mean that the libraries will have to check for the pragma to do their
job correctly?

I think the pragma is going to be a problem for mixed-content strings.

-austin

BergerS_Daniel · June 28, 2006, 8:38pm

On 6/28/06, Julian ‘Julik’ Tarkhanov [email protected] wrote:

Please no. Please please no.

What about:

raise “Not PNG.” unless @top.bytes[0, 8] == “\x89PNG\x0d\x0a\x1a\x0a”

Except that @top is guaranteed to not have an encoding – at least it
damned well better not – and @top.bytes is redundant in this case. I
see no reason to access #bytes unless I know I’m dealing with a
multibyte String. Worse, why would “Not PNG.” be treated as Unicode
under your scheme but “\x89PNG\x0d\x0a\x1a\x0a” not be? I don’t think
you’re thinking this through.

@top[0, 8] is sufficient when you can guarantee that sizeof(char) ==
sizeof(byte). On “raw” strings, this is always the case. On all
strings, @top[0, 8] would return the appropriate number of characters
– not the number of bytes. It just so happens on binary strings that
the number of characters and bytes is exactly the same.

What I’m arguing is that while the pragma may work for the less-common
encodings, both binary (non-)encoding and Unicode (probably UTF-8) are
going to be common enough that specific literal constructors are
probably a very good idea.

-austin

BergerS_Daniel · June 28, 2006, 8:49pm

On 28-jun-2006, at 20:36, Austin Z. wrote:

Except that @top is guaranteed to not have an encoding – at least it
damned well better not – and @top.bytes is redundant in this case. I
see no reason to access #bytes unless I know I’m dealing with a
multibyte String.
You never know if you are, that’s the problem. And no, it’s NOT
redundant. You should just get used
to the fact that all strings might become multibyte.

Worse, why would “Not PNG.” be treated as Unicode
under your scheme but “\x89PNG\x0d\x0a\x1a\x0a” not be? I don’t think
you’re thinking this through.

@top[0, 8] is sufficient when you can guarantee that sizeof(char) ==
sizeof(byte).

You can NEVER guarantee that. N e v e r. More languages and more
people use multibyte characters by default than all
ASCII users combined.

It seems very pity but you still approcah multibyte strings as
something “special”.

On “raw” strings, this is always the case.

The only way to distinguish “raw” strings from multibyte strings is
to subclass (which sucks for you as a byte user and for me as strings
user).

On all
strings, @top[0, 8] would return the appropriate number of characters
– not the number of bytes. It just so happens on binary strings that
the number of characters and bytes is exactly the same.

This is a very leaky abstraction - you can never expect what you will
get. What’s the problem with having bytes as an accessor?

What I’m arguing is that while the pragma may work for the less-common
encodings, both binary (non-)encoding and Unicode (probably UTF-8) are
going to be common enough that specific literal constructors are
probably a very good idea.

Python proved that to be wrong - both the subclassing part and the
literals part.
The fact that you have to designate Unicode strings with literals is
a bad decision and I can only suspect that it has to do with compiler
intolerance,
and the need to do preprocessing.

BergerS_Daniel · June 28, 2006, 8:53pm

On 28-jun-2006, at 20:46, Julian ‘Julik’ Tarkhanov wrote:

The fact that you have to designate Unicode strings with literals
is a bad decision and I can only suspect that it has to do with
compiler intolerance,
and the need to do preprocessing.

I meant C in this part, sorry.