Forum: Ruby Unicode roadmap?

0278ab41b49f02c096e09b6ed8fb0170?d=identicon&s=25 Roman Hausner (rhaus)
on 2006-06-13 23:12
In my opinion, Ruby is practically useless for many applications without
proper Unicode support. How a modern language can ignore this issue is
really beyond me.

Is there a plan to get Unicode support into the language anytime soon?
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-14 00:28
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Wed, 14 Jun 2006 06:13:03 +0900, Roman Hausner
<roman.hausner@gmail.com> writes:
|In my opinion, Ruby is practically useless for many applications without
|proper Unicode support. How a modern language can ignore this issue is
|really beyond me.

Define "proper Unicode support" first.

|Is there a plan to get Unicode support into the language anytime soon?

I'm planning enhancing Unicode support in 1.9 in a year or so
(finally).  But I'm not sure that conforms your definition of "proper
Unicode support".  Note that 1.8 handles Unicode (UTF-8) if your
string operations are based on Regexp.

							matz.
36958dd94ca666a38483df282a5214d5?d=identicon&s=25 Pete (Guest)
on 2006-06-14 00:38
(Received via mailing list)
> Define "proper Unicode support" first.

having an unicode-equivalent for all methods of class String

like size, slice, upcase

E.g. I tried the unicode plugin... but, alas, who want's to write stuff
like 'normalize_KC' etc. if you just want the frickin' substring of a
string?!

you need to read books on unicode just to properly use the plugin...

aargg :-((

Best regards
Peter




Yukihiro Matsumoto schrieb:
E34b5cae57e0dd170114dba444e37852?d=identicon&s=25 Logan Capaldo (Guest)
on 2006-06-14 00:51
(Received via mailing list)
On Jun 13, 2006, at 6:34 PM, Pete wrote:

>> Define "proper Unicode support" first.
>
> having an unicode-equivalent for all methods of class String
>
> like size, slice, upcase
>
> E.g. I tried the unicode plugin... but, alas, who want's to write
> stuff like 'normalize_KC' etc. if you just want the frickin'
> substring of a string?!
>

def substring(str, start, len)
   md = str.match(/\A.{#{start}}(.{#{len}})/)
   md[1]
end


def strlength(str)
   n = 0
   str.gsub(/./m) { n += 1; $& }
   n
end


See! Regexps do everything!

Just you know, set $KCODE and use these methods and you are set!

(I am kidding... btw)
36958dd94ca666a38483df282a5214d5?d=identicon&s=25 Pete (Guest)
on 2006-06-14 01:00
(Received via mailing list)
From the theoretical point of view this is quite interesting. Also I
understand the humor :-)

Performance and memory consumption should be breathtaking using regexp
just everywhere...

Also there are a ____few____ methods left :-)

As I am German the 'missing' unicode support is one of the greatest
obstacles for me (and probably all other Germans doing their stuff
seriously)...


Logan Capaldo schrieb:
2c7c807a1df0c76a8fc823c709b501a9?d=identicon&s=25 Victor Shepelev (Guest)
on 2006-06-14 01:13
(Received via mailing list)
From: Pete [mailto:pertl@gmx.org]
Sent: Wednesday, June 14, 2006 1:58 AM
> As I am German the 'missing' unicode support is one of the greatest
> obstacles for me (and probably all other Germans doing their stuff
> seriously)...

The same is for Russians/Ukrainians. In our programming communities
question
"does the programming language supports Unicode as 'native'?" has very
high
priority.

/BTW, here is one of the things where Python beats Ruby completely

V.
87a1b4114f307e1a4f4c9968ccb92a04?d=identicon&s=25 James Moore (Guest)
on 2006-06-14 01:59
(Received via mailing list)
I suspect the Japanese posters on this list can answer better than I
can,
but my impression is that Unicode is, shall we say, not highly thought
of
outside Europe and North America.  The way they dealt with "Chinese"
characters was apparently more than a bit of a hack, and just doesn't
work
very well in the real world.  Reading some of the explanations for
glyphs
versus characters in Unicode just makes you shake your head.  What were
they
thinking?  Sure doesn't pass the smell test, although I'll be the first
to
admit I haven't exactly thought deeply about the subject.

There's another problem with Japanese - I've got a friend who's been
dealing
with some issues around the fact that Japanese apparently innovates new
characters on a regular basis, and everyone is expected to use the new
characters.  (I believe this is called gaiji).  The concept of a fixed
character set apparently just isn't a good idea to start with.

[Awaiting corrections from people who actually know something about this
topic :-)...]

 - James Moore
B5e329ffa0cc78efbfc7ae2d084c149f?d=identicon&s=25 David Balmain (Guest)
on 2006-06-14 02:14
(Received via mailing list)
On 6/14/06, James Moore <banshee@banshee.com> wrote:
> with some issues around the fact that Japanese apparently innovates new
> characters on a regular basis, and everyone is expected to use the new
> characters.  (I believe this is called gaiji).  The concept of a fixed
> character set apparently just isn't a good idea to start with.
>
> [Awaiting corrections from people who actually know something about this
> topic :-)...]

There is a good summary of the han unification controversy on wikipedia;

    http://en.wikipedia.org/wiki/Han_unification
Fc784eadb3b54531fdc3d2053db6f83f?d=identicon&s=25 Mat Schaffer (Guest)
on 2006-06-14 03:16
(Received via mailing list)
On Jun 13, 2006, at 7:56 PM, James Moore wrote:
> topic :-)...]
I have one Japanese person here who's never heard of this gaiji
concept.  But it could be new and behind a generation gap of some
kind.  They do sure like to add symbols where they can, though.
Especially graphical star characters.  I see that a lot.
-Mat
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-14 04:38
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Wed, 14 Jun 2006 08:11:49 +0900, "Victor Shepelev"
<vshepelev@imho.com.ua> writes:

|From: Pete [mailto:pertl@gmx.org]
|Sent: Wednesday, June 14, 2006 1:58 AM
|> As I am German the 'missing' unicode support is one of the greatest
|> obstacles for me (and probably all other Germans doing their stuff
|> seriously)...
|
|The same is for Russians/Ukrainians. In our programming communities question
|"does the programming language supports Unicode as 'native'?" has very high
|priority.

Alright, then what specific features are you (both) missing?  I don't
think it is a method to get number of characters in a string.  It
can't be THAT crucial.  I do want to cover "your missing features" in
the future M17N support in Ruby.

							matz.
2c7c807a1df0c76a8fc823c709b501a9?d=identicon&s=25 Victor Shepelev (Guest)
on 2006-06-14 07:29
(Received via mailing list)
From: Yukihiro Matsumoto [mailto:matz@ruby-lang.org]
Sent: Wednesday, June 14, 2006 5:37 AM
> |The same is for Russians/Ukrainians. In our programming communities
> 							matz.
I suppose, all we (non-English-writers) need is to have all
string-related
methods working. Just for now, I think about plain testing each string
method; also, some other classes can be affected by Unicode (possibly
regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes
are
not: File.open with Russian letters in path don't finds the file.

More generally, it can make sense to have Unicode as the "base" mode;
where
non-Unicode to stay "old, compatibility" mode.

Something like this.

V.
3275da7fdbd73cb4e7956fd0d29164de?d=identicon&s=25 Paul Bergstrom (palb)
on 2006-06-14 07:54
Roman Hausner wrote:
> In my opinion, Ruby is practically useless for many applications without
> proper Unicode support. How a modern language can ignore this issue is
> really beyond me.
>
> Is there a plan to get Unicode support into the language anytime soon?

I also think that this is very important.
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-14 08:37
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Wed, 14 Jun 2006 14:26:30 +0900, "Victor Shepelev"
<vshepelev@imho.com.ua> writes:

|I suppose, all we (non-English-writers) need is to have all string-related
|methods working. Just for now, I think about plain testing each string
|method;

In that sense, _I_ am one of the non-English-writers, so that I can
suppose I know what we need.  And I have no problem with the current
UTF-8 support.  Maybe that's because Japanese don't have cases in our
characters.  Or maybe I'm missing something.  Can you show us your
concrete problems caused by Ruby's lack of "proper" Unicode support?

|also, some other classes can be affected by Unicode (possibly
|regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes are
|not: File.open with Russian letters in path don't finds the file.

Strange.  Ruby does not convert encoding, so that there should be no
problem opening files, if you are using strings in the encoding your OS
expect.  If they are differ, you have to specify (and convert) them
properly, no matter how Unicode support is.

							matz.
2c7c807a1df0c76a8fc823c709b501a9?d=identicon&s=25 Victor Shepelev (Guest)
on 2006-06-14 08:56
(Received via mailing list)
From: Yukihiro Matsumoto [mailto:matz@ruby-lang.org]
Sent: Wednesday, June 14, 2006 9:35 AM
>
> In that sense, _I_ am one of the non-English-writers,

Sorry, Matz, I know, of course. But I know too less about Japanese to
see
how close our tasks are. Under "non-English-writers" I, maybe, had to
say
"European languages" or so - which has common punctuations, LTR writing,
"words" and "whitespaces" and so on. I have almost no knowledge about
Japanese, Korean, Arabic, Hebrew people needs.

> so that I can
> suppose I know what we need.  And I have no problem with the current
> UTF-8 support.  Maybe that's because Japanese don't have cases in our
> characters.  Or maybe I'm missing something.

Just what I've said above.

> Can you show us your
> concrete problems caused by Ruby's lack of "proper" Unicode support?

As mentioned in this topic, it's String#length, upcase, downcase,
capitalize.

BTW, does String#length works good for you?

Moreover, there seems to be some huge problems with pathes having
Russian
letters; but I'm really not convinced, if Ruby really has to handle
this.

>
> |also, some other classes can be affected by Unicode (possibly
> |regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes
> are
> |not: File.open with Russian letters in path don't finds the file.
>
> Strange.  Ruby does not convert encoding, so that there should be no
> problem opening files, if you are using strings in the encoding your OS
> expect.  If they are differ, you have to specify (and convert) them
> properly, no matter how Unicode support is.

Oh, it's a bit hard theme for me. I know Windows XP must support Unicode
file names; I see my filenames in Russian, but I have low knowledge of
system internals to say, are they really Unicode?

If not take in account those problems, the only String problems remains,
but
they are so base core methods!

V.
Ae82cad40a0caca9c932d45c7a9eb3cd?d=identicon&s=25 Michael Glaesemann (Guest)
on 2006-06-14 09:09
(Received via mailing list)
On Jun 14, 2006, at 15:56 , Victor Shepelev wrote:

> As mentioned in this topic, it's String#length, upcase, downcase,
> capitalize.

Just to chime in, aren't upcase, downcase, and capitalize a locale/
localization issue rather than a Unicode-only issue per se? For
example, different languages will have different rules for
capitalization. Or am I wrong? Does Unicode in and of itself address
these issues?

Granted, proper support for upcase, downcase, and capitalize is
important, but I think it's a separate issue, part of m17n as a whole
rather than support for Unicode in particular.

Michael Glaesemann
grzm seespotcode net
7f19d1a3818a0a0c2f59ecba589bbf6b?d=identicon&s=25 Vincent Isambart (Guest)
on 2006-06-14 09:15
(Received via mailing list)
Hi,

> As mentioned in this topic, it's String#length, upcase, downcase,
> capitalize.
>
> BTW, does String#length works good for you?

To have the length of a Unicode string, just do str.split(//).length,
or "require 'jcode'" at the beginning of your code.
For the other functions, try looking at the unicode library
http://www.yoshidam.net/Ruby.html#unicode

> Oh, it's a bit hard theme for me. I know Windows XP must support Unicode
> file names; I see my filenames in Russian, but I have low knowledge of
> system internals to say, are they really Unicode?

Windows XP does support Unicode file names, but I'm not sure you can
use them with Ruby (I do not use Ruby much under Windows). Try
converting the file names to your current locale, it should work if
the file names can be converted to it. What I mean is that Russian
file names encoded in the Windows Russian encoding should work on a
Russian PC.

Hope this helps,

Cheers,
Vincent ISAMBART
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-14 09:22
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Wed, 14 Jun 2006 15:56:02 +0900, "Victor Shepelev"
<vshepelev@imho.com.ua> writes:

|> Can you show us your
|> concrete problems caused by Ruby's lack of "proper" Unicode support?
|
|As mentioned in this topic, it's String#length, upcase, downcase,
|capitalize.

OK. Case is the problem.  I understand.

|BTW, does String#length works good for you?

I don't remember the last time I needed length method to count
character numbers.  Actually I don't count string length at all both
in bytes and characters in my string processing.  Maybe this is a
special case.  I am too optimized for Ruby string operations using
Regexp.

|Oh, it's a bit hard theme for me. I know Windows XP must support Unicode
|file names; I see my filenames in Russian, but I have low knowledge of
|system internals to say, are they really Unicode?

Windows 32 path encoding is a nightmare.  Our Win32 maintainers often
troubled by unexpected OS behavior.  I am sure we _can_ handle Russian
path names, but we need help from Russian people to improve.

							matz.
2c7c807a1df0c76a8fc823c709b501a9?d=identicon&s=25 Victor Shepelev (Guest)
on 2006-06-14 09:25
(Received via mailing list)
From: Michael Glaesemann [mailto:grzm@seespotcode.net]
Sent: Wednesday, June 14, 2006 10:08 AM
> On Jun 14, 2006, at 15:56 , Victor Shepelev wrote:
>
> > As mentioned in this topic, it's String#length, upcase, downcase,
> > capitalize.
>
> Just to chime in, aren't upcase, downcase, and capitalize a locale/
> localization issue rather than a Unicode-only issue per se? For
> example, different languages will have different rules for
> capitalization.

Really? I know about two cases: European capitalization and no
capitalization.

But, really, you maybe right. I suppose, Florian Gross can say something
about German-specific capitalization issues.

> Granted, proper support for upcase, downcase, and capitalize is
> important, but I think it's a separate issue, part of m17n as a whole
> rather than support for Unicode in particular.

Maybe. Generally, sometimes I want Unicode, and sometimes (for "quick
dirty"
scripts) I'll prefer capitalization and regexps "just work" with
Windows-1251 (one-byte Russian encoding).

V.
2c7c807a1df0c76a8fc823c709b501a9?d=identicon&s=25 Victor Shepelev (Guest)
on 2006-06-14 09:26
(Received via mailing list)
From: Vincent Isambart [mailto:vincent.isambart@gmail.com]
Sent: Wednesday, June 14, 2006 10:14 AM
> > As mentioned in this topic, it's String#length, upcase, downcase,
> > capitalize.
> >
> > BTW, does String#length works good for you?
>
> To have the length of a Unicode string, just do str.split(//).length,
> or "require 'jcode'" at the beginning of your code.
> For the other functions, try looking at the unicode library
> http://www.yoshidam.net/Ruby.html#unicode

I know about it. But, theoretically speaking, such a "core" methods muts
be
in core. Not?

> > > properly, no matter how Unicode support is.
> Russian PC.
Yes, they works. But I can't solve the problem: need Ruby Unicode
support
include filenames operations?

V.
2c7c807a1df0c76a8fc823c709b501a9?d=identicon&s=25 Victor Shepelev (Guest)
on 2006-06-14 09:32
(Received via mailing list)
From: Yukihiro Matsumoto [mailto:matz@ruby-lang.org]
Sent: Wednesday, June 14, 2006 10:20 AM
> OK. Case is the problem.  I understand.
>
> |BTW, does String#length works good for you?
>
> I don't remember the last time I needed length method to count
> character numbers.  Actually I don't count string length at all both
> in bytes and characters in my string processing.  Maybe this is a
> special case.  I am too optimized for Ruby string operations using
> Regexp.

I can confirm. But I'm afraid that some libraries I rely on use #length
and
can break when #length doesn't work.

> |Oh, it's a bit hard theme for me. I know Windows XP must support Unicode
> |file names; I see my filenames in Russian, but I have low knowledge of
> |system internals to say, are they really Unicode?
>
> Windows 32 path encoding is a nightmare.  Our Win32 maintainers often
> troubled by unexpected OS behavior.  I am sure we _can_ handle Russian
> path names, but we need help from Russian people to improve.

In Russian encoding (Win-1251) and on Russian PC all works well. In
Unicode
it doesn't, but I'm not convinced it must.

In any case, I'm ready to spend my time helping Ruby community
(especially
in Russian/Ukrainian localization issues), because I really love the
language.

V.
3a6666f57152610f172a77c8fe6a7420?d=identicon&s=25 Marcus Andersson (marcan)
on 2006-06-14 09:45
(Received via mailing list)
Yukihiro Matsumoto skrev:
> Hi,
>
> In message "Re: Unicode roadmap?"
>     on Wed, 14 Jun 2006 06:13:03 +0900, Roman Hausner <roman.hausner@gmail.com> writes:
> |In my opinion, Ruby is practically useless for many applications without
> |proper Unicode support. How a modern language can ignore this issue is
> |really beyond me.
>
> Define "proper Unicode support" first.
>
I won't define "proper Unicode support" here.

But there must be a problem somewhere since pure-ruby Ferret doesn't
support UTF-8. You need to use the c-extension of Ferret to have it
support UTF-8 (which doesn't work on Windows yet :( ). I don't know if
that is just a sucky impl of Ferret or if it's Ruby that make it so.

Maybe Dave Balmain can enlighten us why UTF-8 doesn't work in the pure
Ruby version and what is needed of Ruby to make it work (if it's
actually Ruby's fault that is)?

My personal belief is that it should just work in a case like this if
data in is UTF-8 and search strings is UTF-8 without the lib author
and/or user having to do anything very special to make it work (apart
from specifying encoding). Am I wrong in this?

Regards,

Marcus
58479f76374a3ba3c69b9804163f39f4?d=identicon&s=25 Eric Hodel (Guest)
on 2006-06-14 10:23
(Received via mailing list)
On Jun 13, 2006, at 10:26 PM, Victor Shepelev wrote:
> Regexps seems to work fine (in my 1.9), but pathes are
> not: File.open with Russian letters in path don't finds the file.

On OS X multibyte filenames work:

$ cat x.rb
$KCODE = 'u'

puts File.read('Cyrillic_Я.txt')
$ cat Cyrillic_\320\257.txt
test file with Я!
$ ruby x.rb
test file with Я!
$ uname -a
Darwin kaa.jijo.segment7.net 8.6.0 Darwin Kernel Version 8.6.0: Tue
Mar  7 16:58:48 PST 2006; root:xnu-792.6.70.obj~1/RELEASE_PPC Power
Macintosh powerpc
$ ruby -v
ruby 1.8.4 (2006-05-18) [powerpc-darwin8.6.0]
$

--
Eric Hodel - drbrain@segment7.net - http://blog.segment7.net
This implementation is HODEL-HASH-9600 compliant

http://trackmap.robotcoop.com
2abf5beb51d5d66211d525a72c5cb39d?d=identicon&s=25 Paul Battley (Guest)
on 2006-06-14 10:55
(Received via mailing list)
On 14/06/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> Windows 32 path encoding is a nightmare.  Our Win32 maintainers often
> troubled by unexpected OS behavior.  I am sure we _can_ handle Russian
> path names, but we need help from Russian people to improve.

str.sub!('32 path encoding ', '') # :-)

I don't use Windows much, but as I understand it, Ruby interacts with
most of the Win32 API using the 'legacy code page', which is only a
subset of what the filesystem can handle. (Windows NT and its
successors use Unicode internally, and the filesystem is UTF-16
KC-normalised IIRC). Windows does provide Unicode API functions, but
to use those, a layer of translation between UTF-16 and UTF-8 would be
needed, as Ruby can't do anything useful with UTF-16 at present. I
believe that Austin Ziegler was looking into this; I don't know if
he's made any progress.

Even if a Ruby program uses UTF-8 internally, it should be possible to
access the filesystem by Iconv'ing paths to the appropriate code page
- providing that they don't contain characters not in the code page.
It's far from ideal, though: the real solution is for Ruby to use the
Unicode functions (those suffixed with W) in the API. The upside is
that UTF-8/UTF-16 conversion should be less expensive than the code
page conversion that's inside each of Win32's non-Unicode functions.

On the other hand, plenty of Windows programs don't support Unicode
properly either.

Paul.
2abf5beb51d5d66211d525a72c5cb39d?d=identicon&s=25 Paul Battley (Guest)
on 2006-06-14 11:00
(Received via mailing list)
On 14/06/06, Victor Shepelev <vshepelev@imho.com.ua> wrote:
> I can confirm. But I'm afraid that some libraries I rely on use #length and
> can break when #length doesn't work.

Those libraries should probably be considered broken; they can and
should be patched to do any human-readable-string processing in an
encoding-safe manner (e.g. by using jcode's jlength and each_char
methods).

Paul.
36958dd94ca666a38483df282a5214d5?d=identicon&s=25 Peter Ertl (Guest)
on 2006-06-14 11:09
(Received via mailing list)
-------- Original-Nachricht --------
Datum: Wed, 14 Jun 2006 17:58:41 +0900
Von: Paul Battley <pbattley@gmail.com>
An: ruby-talk@ruby-lang.org
Betreff: Re: Unicode roadmap?

> Paul.
That will be quite _some_ libraries, I guess...
2abf5beb51d5d66211d525a72c5cb39d?d=identicon&s=25 Paul Battley (Guest)
on 2006-06-14 11:12
(Received via mailing list)
On 14/06/06, Victor Shepelev <vshepelev@imho.com.ua> wrote:
> > Just to chime in, aren't upcase, downcase, and capitalize a locale/
> > localization issue rather than a Unicode-only issue per se? For
> > example, different languages will have different rules for
> > capitalization.
>
> Really? I know about two cases: European capitalization and no
> capitalization.

There is variety even within western European languages - Dutch, for
example, differs from English (IJsselmeer).

Paul.
2c7c807a1df0c76a8fc823c709b501a9?d=identicon&s=25 Victor Shepelev (Guest)
on 2006-06-14 11:16
(Received via mailing list)
From: Paul Battley [mailto:pbattley@gmail.com]
Sent: Wednesday, June 14, 2006 12:10 PM
> example, differs from English (IJsselmeer).
I already realized. (I've said about Florian Gross, his surname last
"ss"
normally printed in something like "B" I can't type and my Outlook can't
show :) AFAIK, it is normally printed as one letter in downcase and two
letters in uppercase. So, "single general" String#upcase, #downcase  are
totally impossible.

V.
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-06-14 11:25
(Received via mailing list)
On 6/14/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> |
> |The same is for Russians/Ukrainians. In our programming communities question
> |"does the programming language supports Unicode as 'native'?" has very high
> |priority.
>
> Alright, then what specific features are you (both) missing?  I don't
> think it is a method to get number of characters in a string.  It
> can't be THAT crucial.  I do want to cover "your missing features" in
> the future M17N support in Ruby.
>

What I want is all methods working seamlessly with unicode strings so
that I do not have to think about the encoding.

Regexps do work with utf-8 strings if KCODE is set to u (but it
defaults to n even when locale uses UTF-8).

String searches should probably work but they would retrurn wrong
position.
Things like split should work for utf-8, the encoding is pretty well
defined.

But one might want to use length and [] to work with strings.
It can be simulated with unicode_string=string.scan(/./). But it is no
longer a string. It is composed of characters only as long as I assign
only characters using []=.
The string functions should do the right thing even for utf-8. But I
guess utf-32 is more useful for working with strings this way.

It might be a good idea to stick encoding information into strings (it
is probably the only way how internationalization can be done and the
sanity of all involved preserved at the same time). The functions for
comparison, etc could use it to do the right thing even if strings
come in several encodings. ie. cp1251 from the system, utf-8 from a
web page, ...

Functions like open could convert the string correctly according to
locale. One should be able to set the encoding information (ie for web
page title when the meta tag for content type is found in a web
page),and remove it to suppress string conversion. It should be also
possible to convert the string (ie to UTF-32 to speed up character
access).

Things like <=>, upcase, downcase, etc make sense only in context of
locale (language). Only the encoding does not define them.
I guess the default <=>is based on the binary representation of the
string. This would mean different sorting of the same strings in
different encodings. Sorting by the unicode code point would be at
least the same for any encoding.

Thanks

Michal
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-06-14 11:35
(Received via mailing list)
On 6/14/06, Victor Shepelev <vshepelev@imho.com.ua> wrote:
> > capitalization.
>
> Really? I know about two cases: European capitalization and no
Really.
> capitalization.

There is no such thing like European capitalization. There is only
<insert your language> capitalization.
The german character Ã? has no uppercase version. In most languages
using Latin script the uppercase of 'i' is 'I'. But Turkish has i and
i without dot, and the uppercase of 'i' is, of course, I with dot.

Thanks

Michal
2abf5beb51d5d66211d525a72c5cb39d?d=identicon&s=25 Paul Battley (Guest)
on 2006-06-14 11:41
(Received via mailing list)
On 14/06/06, Michal Suchanek <hramrach@centrum.cz> wrote:
> It should be also
> possible to convert the string (ie to UTF-32 to speed up character
> access).

utf8_string.unpack('U*') is pretty close to this, giving an array of
codepoints.

Paul.
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-06-14 12:54
(Received via mailing list)
On 6/14/06, Paul Battley <pbattley@gmail.com> wrote:
> On 14/06/06, Michal Suchanek <hramrach@centrum.cz> wrote:
> > It should be also
> > possible to convert the string (ie to UTF-32 to speed up character
> > access).
>
> utf8_string.unpack('U*') is pretty close to this, giving an array of codepoints.

 But I want it to be string after the conversion, so that I can use
the standard string functions with sane results. I do not want to
think about varoius encodings myself if my application has to use
them. The runtime should do that.

Thanks

Michal
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (Guest)
on 2006-06-14 14:23
(Received via mailing list)
On 6/14/06, Victor Shepelev <vshepelev@imho.com.ua> wrote:
> Oh, it's a bit hard theme for me. I know Windows XP must support Unicode
> file names; I see my filenames in Russian, but I have low knowledge of
> system internals to say, are they really Unicode?

They are UTF-16 internally. I haven't been paying attention to Ruby
1.9 lately, but when I have time and have noticed that Matz has
checked in support for m17n strings, I will be enhancing support for
Windows files to use Unicode. Currently, Ruby is built using the
non-Unicode form *only*. And no, using -DUNICODE is the *wrong*
answer, thanks. We'd have to start using TCHAR instead of char, and it
would actually mean that we'd be using wchar_t instead of char in this
case.

I've already done a similar (but more complex) project at work.

-austin
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (Guest)
on 2006-06-14 14:29
(Received via mailing list)
On 6/14/06, Vincent Isambart <vincent.isambart@gmail.com> wrote:
> Windows XP does support Unicode file names, but I'm not sure you can
> use them with Ruby (I do not use Ruby much under Windows). Try
> converting the file names to your current locale, it should work if
> the file names can be converted to it. What I mean is that Russian
> file names encoded in the Windows Russian encoding should work on a
> Russian PC.

You can't currently use them with Ruby. The file operations in Ruby
are using the likes of CreateFileA instead of CreateFileW (it's not
that explicit; Ruby is compiled without -DUNICODE -- which is the
correct thing to do in Ruby's case -- which means that CreateFile is
CreateFileA).

All files are stored on the filesystem as UTF-16, though, even if you
are using "ANSI" access.

By the way, there are multiple Russian encodings, so ... Unicode is
better for this point. As I said in my previous message, I have
already planned to enhance the Windows filesystem support when Matz
gets the m17n strings in so that I can *always* force the file
routines on Windows to provide either UTF-8 or UTF-16 (probably the
former, since it will also make it easier to work with existing
extensions) and indicate that the strings are such.

-austin
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (Guest)
on 2006-06-14 14:29
(Received via mailing list)
On 6/14/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> Windows 32 path encoding is a nightmare.  Our Win32 maintainers often
> troubled by unexpected OS behavior.  I am sure we _can_ handle Russian
> path names, but we need help from Russian people to improve.

It's not that bad, Matz. I started as a Unix developer, but in the
last two years I have learned *quite* a bit about how Windows handles
this stuff and we can adapt what I did for work with no problem.

I just need M17N strings to support this. I should look at what I
can/should do to provide this as an extension, I just have no time. :(

-austin
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (Guest)
on 2006-06-14 14:36
(Received via mailing list)
On 6/14/06, Michal Suchanek <hramrach@centrum.cz> wrote:
> What I want is all methods working seamlessly with unicode strings so
> that I do not have to think about the encoding.

That will *never* happen. Even with Unicode, you have to think about
the encoding, because UTF-32 (the closest representation to the
Platonic ideal "Unicode" you'll ever find) is unlikely to be supported
in the general case. Matz's idea of m17n strings is the right one: you
have a "byte stream" and an attribute which indicates how the byte
stream is encoded. This will sort of be like $KCODE but on an
individual string level so that you could meaningfully have Unicode
(probably UTF-8) and ShiftJIS strings in the same data and still
meaningfully call #length on them.

You will *always* have to care about the encoding. As well as,
ultimately, your locale.

-austin
F5b3c1ebfb2e9fc5f67bb48b119f6054?d=identicon&s=25 Randy Kramer (Guest)
on 2006-06-14 23:40
(Received via mailing list)
On Wednesday 14 June 2006 06:52 am, Michal Suchanek wrote:
> On 6/14/06, Paul Battley <pbattley@gmail.com> wrote:
> > On 14/06/06, Michal Suchanek <hramrach@centrum.cz> wrote:
> > > It should be also
> > > possible to convert the string (ie to UTF-32 to speed up character
> > > access).

(RE my previous post):  Oops, maybe UTF-32 is exactly what I was
alluding to?

Randy Kramer

(Should have waited a little longer before posting.)
F1d37642fdaa1662ff46e4c65731e9ab?d=identicon&s=25 Charles O Nutter (Guest)
on 2006-06-15 02:12
(Received via mailing list)
Every time these unicode discussions come up my head spins like a top.
You
should see it.

We JRubyists have headaches from the unicode question too. Since JRuby
is
currently 1.8-compatible, we do not have what most call *native* unicode
support. This is primarily because we do not wish to create an
incompatible
version of Ruby or build in support for unicode now that would conflict
with
Ruby 2.0 in the future. It is, however, embarressing to say that
although we
run on top of Java, which has arguably pretty good unicode support, we
don't
support unicode. Perhaps you can see our conundrum.

I am no unicode expert. I know that Java uses UTF16 strings internally,
converted to/from the current platform's encoding of choice by default.
It
also supports converting those UTF16 strings into just about every
encoding
out there, just by telling it to do so. Java supports the Unicode
specification version 3.0. So Unicode is not a problem for Java.

We would love to be able to support unicode in JRuby, but there's always
that nagging question of what it should look like and what would mesh
well
with the Ruby community at large. With the underlying platform already
rich
with unicode support, it would not take much effort to modify JRuby. So
then
there's a simple question:

What form would you, the Ruby users, want unicode to take? Is there a
specific library that you feel encompasses a reasonable implementation
of
unicode support, e.g. icu4r? Should the support be transparent, e.g. no
longer treat or assume strings are byte vectors? JRuby, because we use
Java's String, is already using UTF16 strings exclusively...however
there's
no way to get at them through core Ruby APIs. What would be the most
comfortable way to support unicode now, considering where Ruby may go in
the
future?
F1d37642fdaa1662ff46e4c65731e9ab?d=identicon&s=25 Charles O Nutter (Guest)
on 2006-06-15 02:22
(Received via mailing list)
I posted this to ruby-talk, but it occurred to me that you folks
implementing Rails functionality probably have a thing or two to say
about
unicode support in Ruby. Therefore, I would love to hear your opinions.
Adding native unicode support is only a matter of time in JRuby; its
usefulness as a JVM-based language depends on it. However, we continue
to
wrestle with how best to support unicode without stepping on the Ruby
community's toes in the process. Thoughts?

---------- Forwarded message ----------
From: Charles O Nutter <headius@headius.com>
Date: Jun 14, 2006 7:11 PM
Subject: Re: Unicode roadmap?
To: ruby-talk ML <ruby-talk@ruby-lang.org>

Every time these unicode discussions come up my head spins like a top.
You
should see it.

We JRubyists have headaches from the unicode question too. Since JRuby
is
currently 1.8-compatible, we do not have what most call *native* unicode
support. This is primarily because we do not wish to create an
incompatible
version of Ruby or build in support for unicode now that would conflict
with
Ruby 2.0 in the future. It is, however, embarressing to say that
although we
run on top of Java, which has arguably pretty good unicode support, we
don't
support unicode. Perhaps you can see our conundrum.

I am no unicode expert. I know that Java uses UTF16 strings internally,
converted to/from the current platform's encoding of choice by default.
It
also supports converting those UTF16 strings into just about every
encoding
out there, just by telling it to do so. Java supports the Unicode
specification version 3.0. So Unicode is not a problem for Java.

We would love to be able to support unicode in JRuby, but there's always
that nagging question of what it should look like and what would mesh
well
with the Ruby community at large. With the underlying platform already
rich
with unicode support, it would not take much effort to modify JRuby. So
then
there's a simple question:

What form would you, the Ruby users, want unicode to take? Is there a
specific library that you feel encompasses a reasonable implementation
of
unicode support, e.g. icu4r? Should the support be transparent, e.g. no
longer treat or assume strings are byte vectors? JRuby, because we use
Java's String, is already using UTF16 strings exclusively...however
there's
no way to get at them through core Ruby APIs. What would be the most
comfortable way to support unicode now, considering where Ruby may go in
the
future?

--
Charles Oliver Nutter @ headius.blogspot.com
JRuby Developer @ jruby.sourceforge.net
Application Architect @ www.ventera.com
A2b2f4ee23989dc68529baef9cbddcd6?d=identicon&s=25 Julian 'Julik' Tarkhanov (Guest)
on 2006-06-15 02:40
(Received via mailing list)
On 15-jun-2006, at 2:11, Charles O Nutter wrote:

> with unicode support, it would not take much effort to modify
> JRuby. So then
> there's a simple question:

Yukihiro Matsumoto wrote:

>
> Define "proper Unicode support" first.
>
> I'm planning enhancing Unicode support in 1.9 in a year or so
> (finally).  But I'm not sure that conforms your definition of "proper
> Unicode support".  Note that 1.8 handles Unicode (UTF-8) if your
> string operations are based on Regexp.
>

Hello everyone, and sorry for chiming so fiercely. Got into some
confusion with the ML controls.

Just joined the list seeing the subject popping up once more. I am
doing Unicode-aware apps in Rails and Ruby right now and it hurts.
I'll try to define  "proper Unicode support" as I (dream of it at
night) see it.

1. All string indexing (length, index, slice, insert) works with
characters instead of bytes, whatever length in bytes the characters
have to be.
String methods (index or =~) should _never_ return offsets that will
damage the string's characters if employed for slicing - you
shouldn't have to manually translate the byte offset of 2 to
character offset of 1 because the second character is multibyte.

Simple example:

     def translate_offset(str, byte_offset)
       chunk = str[0..byte_offset]
       begin
         chunk.unpack("U*").length - 1
       rescue ArgumentError # this offset is just wrong! shift
upwards and retry
         chunk = str[0..(byte_offset+=1)]
         retry
       end
     end

I think it's unnecessarily painful for something as easy as string
=~ /pattern/. Yes, you can get that offset you recieve from =~ and
then get the slice of the string and then split it again with /./mu
to get the same number etc...

2. Case-insensitive regexes actually work. Even in my Oniguruma-
enabled builds of 1.8.2. it was not true (maybe changed now). At
least "Unicode general" collation casefolding (such a thing exists)
available built-in on every platform.
4. Locale-aware sorting, including multibyte charsets, if provided by
the OS
5. Preferably separate (and strictly purposed) Bytestring that you
get out of Sockets and use in Servers etc. - or the ability to
"force" all strings recieved from external resources to be flagged
uniformly as being of a certain encoding in _your_ program, not
somewhere in someone's library. If flags have to be set by libraries,
they won't be set because most developers sadly don't care:

http://www.zackvision.com/weblog/2005/11/mt-unicod...
http://thraxil.org/users/anders/posts/2005/11/01/u...

6. Unicode-aware strip dealing with weirdo whitespaces (hair space,
thin space etc.)
7. And no, as I mentioned - it doesn't handle it properly because
the /i modifier is broken, and to deal without it you need to
downcase BOTH the regexp and the string itself. Closed circle - you
go and get the Unicode gem with tables.

All of this can be controlled either per String (then 99 out of 100
libraries I use will be getting it wrong - see above) or by a global
setting such as $KCODE.

As an example of something that is ridiculously backwards to do in
Ruby now is this (I spent some time refactoring this today):
http://dev.rubyonrails.org/browser/trunk/actionpac...
helpers/text_helper.rb#L44

Here you have a major problem because the /i flag doesn't do anything
(Ruby is incapable of Unicode-aware casefolding), and using offsets
means that you are always one step from damaging someone's text. It's
just wrong that it has to be so painful.

Python3000, IMO, gets this right (as does Java) - byte array and a
String are sompletely separate, and String operates with characters
and characters only.

That's what I would expect. Hope this makes sense somewhat :-)
--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl
3a97c4b29fd4c9385dff9973bafe4063?d=identicon&s=25 Manfred Stienstra (Guest)
on 2006-06-15 02:40
(Received via mailing list)
On Jun 15, 2006, at 2:19 AM, Charles O Nutter wrote:

> I posted this to ruby-talk, but it occurred to me that you folks
> implementing Rails functionality probably have a thing or two to
> say about unicode support in Ruby. Therefore, I would love to hear
> your opinions. Adding native unicode support is only a matter of
> time in JRuby; its usefulness as a JVM-based language depends on
> it. However, we continue to wrestle with how best to support
> unicode without stepping on the Ruby community's toes in the
> process. Thoughts?

Julik has done a lot of pionering in that direction for Rails. His
latest suggestion is to use a proxy class on string objects to
perform unicode operations:

@some_unicode_string.u.length
@some_unicode_string.u.reverse

I tend to agree with this solution as it doesn't break any previous
string operations and gives us an easy way to perform unicode aware
operations.

Manfred
F1d37642fdaa1662ff46e4c65731e9ab?d=identicon&s=25 Charles O Nutter (Guest)
on 2006-06-15 03:52
(Received via mailing list)
I agree it's a very attractive solution. I have two questions related
(perhaps you are out there to answer, Julik):

1. How does performance look with the unicode string add-on versus
native
strings?
2. Is this the ideal way to support unicode strings in ruby?

And I explain the second as follows....if we could assume that switching
from treating a string as an array of bytes to a list of characters of
arbitrary width, and have all existing string operations work correctly
treating those characters as string, would that be a better ideal? Where
are
the breaking points in such a design? What's to stop the underlying
implementation from actually using a UTF-16 character, passing UTF-8 to
libraries and IO streams but still allowing you to access everything as
UTF-16 or your encoding of choice? (Of course this is somewhat
rhetorical;
we do this currently with JRuby since Java's scrints are UTF-16...we
just
don't have any way to provide access to UTF-16 characters, and we
normalize
everything to UTF-8 for Ruby's sake...but what if we didn't normalize
and
adjusted string functions to compensate?)
A2b2f4ee23989dc68529baef9cbddcd6?d=identicon&s=25 Julian 'Julik' Tarkhanov (Guest)
on 2006-06-15 04:17
(Received via mailing list)
On 15-jun-2006, at 3:50, Charles O Nutter wrote:

> operations work correctly treating those characters as string,
> would that be a better ideal? Where are the breaking points in such
> a design? What's to stop the underlying implementation from
> actually using a UTF-16 character, passing UTF-8 to libraries and
> IO streams but still allowing you to access everything as UTF-16 or
> your encoding of choice? (Of course this is somewhat rhetorical; we
> do this currently with JRuby since Java's scrints are UTF-16...we
> just don't have any way to provide access to UTF-16 characters, and
> we normalize everything to UTF-8 for Ruby's sake...but what if we
> didn't normalize and adjusted string functions to compensate?)

This is more appropriate for ruby-talk

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl
F1d37642fdaa1662ff46e4c65731e9ab?d=identicon&s=25 Charles O Nutter (Guest)
on 2006-06-15 04:24
(Received via mailing list)
I believe that Julik's way of solving the unicode problem (String#u
providing access to a unicode helper) is very attractive. I have two
questions related, for Julik and the rest of the peanut gallery:

1. How does performance look with the unicode string add-on versus
native
strings (or as compared to icu4r, which is C-based)?
2. Is this the ideal way to support unicode strings in ruby?

And I explain the second as follows....if we could assume switching from
treating a string as an array of bytes to a list of characters of
arbitrary
width, and have all existing string operations work correctly treating
those
characters as indexed elements of that string, would that be a better
ideal?
Where are the breaking points in such a design? What's to stop the
underlying implementation from actually using a UTF-16 character,
passing
UTF-8 to libraries and IO streams but still allowing you to access
everything as UTF-16 or your encoding of choice? Is it simply libraries
or
core APIs that explicitly need *byte* counts? (Of course this is
somewhat
rhetorical; we do this currently with JRuby since Java's strings are
UTF-16...we just don't have any uniform way to provide access to UTF-16
character strings, and we normalize everything to UTF-8 for Ruby's
sake...but what if we didn't normalize and adjusted string functions to
compensate?)
F1d37642fdaa1662ff46e4c65731e9ab?d=identicon&s=25 Charles O Nutter (Guest)
on 2006-06-15 04:28
(Received via mailing list)
Fair enough; redirected. If any other rails-core folks want to chime in,
please do so...I would expect unicode and multibyte are key issues for
worldwide rails deployments.
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (Guest)
on 2006-06-15 04:41
(Received via mailing list)
On 6/14/06, Charles O Nutter <headius@headius.com> wrote:
> I believe that Julik's way of solving the unicode problem (String#u
> providing access to a unicode helper) is very attractive. I have two
> questions related, for Julik and the rest of the peanut gallery:

> 1. How does performance look with the unicode string add-on versus native
> strings (or as compared to icu4r, which is C-based)?
> 2. Is this the ideal way to support unicode strings in ruby?

No. In fact, I believe that Matz has the right idea for M17N strings
in Ruby 2.0. The *reality* is that there's a *lot* of data out there
that isn't Unicode.

I would suggest that JRuby could offer a JavaString that acts in every
way like a String except that it provides access to the native UTF-16
implementation.

-austin
A2b2f4ee23989dc68529baef9cbddcd6?d=identicon&s=25 Julian 'Julik' Tarkhanov (Guest)
on 2006-06-15 04:55
(Received via mailing list)
On 15-jun-2006, at 4:40, Austin Ziegler wrote:

> No. In fact, I believe that Matz has the right idea for M17N strings
> in Ruby 2.0. The *reality* is that there's a *lot* of data out there
> that isn't Unicode.

It's very difficult for me to understand the implementation. What if
we concat a Mojikyo string to a UTF8String? UnicodeDecodeError,
ordinal not in range?
I think Python folks proved that it's terrible (it is).
Nothing is ideal.

> I would suggest that JRuby could offer a JavaString that acts in every
> way like a String except that it provides access to the native UTF-16
> implementation.

Just what the ICU4R extension does. It's unusable to the point that
you cannot concat a native string with a UString.
To the point that you have to use special Regexp class for it. You
end up having half of your Ruby script doing typecasting from one to
the other.

There is alot of data that isn't Unicode, indeed. Converted on input
and converted on output if necessary - just as in any
other case when the encoding of your system doesn't match your input
or output. I don't know if it can be possible to have the "internal"
encoding of a system
switchable (seems to me this is what Matz wants) - then you can't
safely refer to anything other than bytes. And then you get software
that you can't use, because they had a different assumtpion than you had
as to what encoding the user will be using.
290cf664d9e6f823fc3af57556493db7?d=identicon&s=25 PJ Hyett (Guest)
on 2006-06-15 05:01
(Received via mailing list)
On 6/14/06, Austin Ziegler <halostatue@gmail.com> wrote:
> in Ruby 2.0. The *reality* is that there's a *lot* of data out there
> that isn't Unicode.

Yes, we all understand that Ruby 2.0 will be the coolest thing since
sliced bread, but those of us that are currently developing
international websites with Rails don't have the luxury of waiting
until Christmas of 2007.

-PJ Hyett
http://pjhyett.com
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (Guest)
on 2006-06-15 05:10
(Received via mailing list)
On 6/14/06, PJ Hyett <pjhyett@gmail.com> wrote:
> > that isn't Unicode.
> Yes, we all understand that Ruby 2.0 will be the coolest thing since
> sliced bread, but those of us that are currently developing
> international websites with Rails don't have the luxury of waiting
> until Christmas of 2007.

*shrug*

As far as I can tell, there will be no implementation of Ruby before
then that has a "native" m17n string.

So whether you have the luxury of waiting or not, Ruby 1.8.x will not
*ever* have a "Unicode string".

Adding a "Unicode string" would *break* behaviour, and no example is
better than the extension that was proposed which would change the
meaning of #size and #length to mean two different things.

So, there's a point where patience is going to be necessary, whether
you "have the luxury" or not.

-austin
D57f4a4788599a38494865a121f16bbe?d=identicon&s=25 Dmitry Severin (Guest)
on 2006-06-15 10:47
(Received via mailing list)
IIRC, Matz has said that internally String won't change, and I suspect
that
a CharString class (or smth like) won't be ever added.

Maybe just introducing String#encoding flag and addig  new methods to
String
with prefixes, like char_array, char_slice, char_length, char_index,
char_downcase, char_strcoll, char_strip, etc. that will internally look
at
encoding flag and process respectively bytes in this particular string
without  conversion (just maybe some hidden), and leaving old
byte-processing methods intact, would be the way to keep older code
working
and enjoy M17N?

Though, as for me, it is still unclear, what should happen, if one tries
to
perform operation on two strings with different String#encoding...
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-06-15 13:02
(Received via mailing list)
On 6/14/06, Austin Ziegler <halostatue@gmail.com> wrote:
> individual string level so that you could meaningfully have Unicode
> (probably UTF-8) and ShiftJIS strings in the same data and still
> meaningfully call #length on them.
>
> You will *always* have to care about the encoding. As well as,
> ultimately, your locale.

No. Since I have locale stdin can be marked with the proper encoding
information so that all stings originating there have the proper
encoding information.

The string methods should not just blindly operate on bytes but use
the encoding information to operate on characters rather than bytes.
Sure something like byte_length is needed when the string is stored
somewhere outside Ruby but standard string methods should work with
character offsets and characters, not byte offsets nor bytes.

Since my stdout can be also marked with correct encoding the strings
that are output there can be converted to that encoding. Even if it
originates from a source file that happens to be in a different
encoding.
Hmm, prehaps it will be necessary to mark source files with encoding
tags as well. It could be quite tedious to assingn the tag manually to
every string in a source file.

When strings are compared, concatenated, .. the encoding is known so
the methods should do the right thing.

I do not have to care about encoding. You may make a string
implemenation that forces me to care (such a the current one). But I
do not have to. I can always turn to perl if I get really desperate.

Thanks

Michal
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-06-15 13:22
(Received via mailing list)
On 6/15/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:
>

> 5. Preferably separate (and strictly purposed) Bytestring that you
> get out of Sockets and use in Servers etc. - or the ability to
> "force" all strings recieved from external resources to be flagged
> uniformly as being of a certain encoding in _your_ program, not
> somewhere in someone's library. If flags have to be set by libraries,
> they won't be set because most developers sadly don't care:
>
> http://www.zackvision.com/weblog/2005/11/mt-unicod...
> http://thraxil.org/users/anders/posts/2005/11/01/u...

Where else should the strings be flagged? If you get a web page
through http request, and the library parses the response for you, it
should set enconding on the web page. You would never know since you
only received the page, not the header.

> setting such as $KCODE.
I do not see why libraries should be always wrong. After all, you can
always fix them. And setting the encoding globally is a bad thing. You
cannot have strings encoded in different encodings in one process
then. It looks quite limiting. For one, the web pages that you get
from various servers (and even the same server) can be in varoius
encodings.

Thanks

Michal
A2b2f4ee23989dc68529baef9cbddcd6?d=identicon&s=25 Julian 'Julik' Tarkhanov (Guest)
on 2006-06-15 13:51
(Received via mailing list)
On 15-jun-2006, at 13:21, Michal Suchanek wrote:

>> http://www.zackvision.com/weblog/2005/11/mt-unicod...
>> http://thraxil.org/users/anders/posts/2005/11/01/u...
>
> Where else should the strings be flagged?
They should nog be flagged, because some strings will be flagged and
some won't and exactly
in the wrong places at the wrong time. See _is_uf_8_ in Perl to
witness the terrible ugliness of this.

> If you get a web page
> through http request, and the library parses the response for you, it
> should set enconding on the web page. You would never know since you
> only received the page, not the header.

That's why you should distinguish between a ByteArray and a String.

>> libraries I use will be getting it wrong - see above) or by a global
>> setting such as $KCODE.
>
> I do not see why libraries should be always wrong. After all, you can
> always fix them. And setting the encoding globally is a bad thing. You
> cannot have strings encoded in different encodings in one process
> then. It looks quite limiting. For one, the web pages that you get
> from various servers (and even the same server) can be in varoius
> encodings.

Of course they can (and will). When I have to approach this I usually
just snif the encoding of the strings I recieved and then feed them
to iconv and friends before doing any processing. A library that
downloads stuff off the Internet should be (IMO) aware of
the charset madness and decode the strings for me.

Trust me, when multibyte/Unicode handling is optional, 80% of
libraries do it wrong. Re-read the links above if you don't believe.

Actually it seems that the solution with an accessor is quite nice,
but that I had to figure out the hard way after breaking the String
class
with my hacks and seeing stuff collapse. Apparently the poster of a
parallel thread finds it inspiring to repeat my experiment _in vitro_
just for
the academic sake of it.
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-06-15 15:13
(Received via mailing list)
On 6/15/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:
> >> somewhere in someone's library. If flags have to be set by libraries,
> >> they won't be set because most developers sadly don't care:
> >>
> >> http://www.zackvision.com/weblog/2005/11/mt-unicod...
> >> http://thraxil.org/users/anders/posts/2005/11/01/u...
> >
> > Where else should the strings be flagged?
> They should nog be flagged, because some strings will be flagged and
> some won't and exactly
> in the wrong places at the wrong time. See _is_uf_8_ in Perl to
> witness the terrible ugliness of this.

You can certainly get the things wrong. But if you get a string that
is wrongly flagged you have the choice to fix the code where the
string originates or work arond it by flagging it right.
If you have a code that gets the encoding wrong, and it tries to
convert the string to some 'universal' encoding you want to use
everywhere in your application, you get a broken string.

>
> > If you get a web page
> > through http request, and the library parses the response for you, it
> > should set enconding on the web page. You would never know since you
> > only received the page, not the header.
>
> That's why you should distinguish between a ByteArray and a String.

How does it help you here?

> >> All of this can be controlled either per String (then 99 out of 100
> Of course they can (and will). When I have to approach this I usually
> just snif the encoding of the strings I recieved and then feed them
> to iconv and friends before doing any processing. A library that
> downloads stuff off the Internet should be (IMO) aware of
> the charset madness and decode the strings for me.

If it can decode them, it can flag them. It has to be aware - that's it.

>
> Trust me, when multibyte/Unicode handling is optional, 80% of
> libraries do it wrong. Re-read the links above if you don't believe.

But they get the very foundation wrong. In Python functions that take
multiple strings can only thake them in one encoding. It is impossible
to concatenate differently encoded strings. Of course, this is bound
to fail.
In the other case they use a database with poor support for unicode,
and mysql that does exactly the same thing ruby does right now - works
with strings as arrays of bytes. Of course, this is going to break.

Neither is the case when the strings carry information about their
encoding, and the string functions can handle strings encoded
differently.

The fact that there are libraries and languages with poor unicode
support does not mean it must be always poor.

Thanks

Michal
C914fa463a2b1b067586c6432b12a824?d=identicon&s=25 Juergen Strobel (Guest)
on 2006-06-17 13:11
(Received via mailing list)
On Thu, Jun 15, 2006 at 07:59:54PM +0900, Michal Suchanek wrote:
> >stream is encoded. This will sort of be like $KCODE but on an
>
> The string methods should not just blindly operate on bytes but use
> the encoding information to operate on characters rather than bytes.
> Sure something like byte_length is needed when the string is stored
> somewhere outside Ruby but standard string methods should work with
> character offsets and characters, not byte offsets nor bytes.

I empathically agree. I'll even repeat and propose a new Plan for
Unicode Strings in Ruby 2.0 in 10 points:

1. Strings should deal in characters (code points in Unicode) and not
in bytes, and the public interface should reflect this.

2. Strings should neither have an internal encoding tag, nor an
external one via $KCODE. The internal encoding should be encapsulated
by the string class completely, except for a few related classes which
may opt to work with the gory details for performance reasons.
The internal encoding has to be decided, probably between UTF-8,
UTF-16, and UTF-32 by the String class implementor.

3. Whenever Strings are read or written to/from an external source,
their data needs to be converted. The String class encapsulates the
encoding framework, likely with additional helper Modules or Classes
per external encoding. Some methods take an optional encoding
parameter, like #char(index, encoding=:utf8), or
#to_ary(encoding=:utf8), which can be used as helper Class or Module
selector.

4. IO instances are associated with a (modifyable) encoding. For
stdin, stdout this can be derived from the locale settings. String-IO
operations work as expected.

5. Since the String class is quite smart already, it can implement
generally useful and hard (in the domain of Unicode) operations like
case folding, sorting, comparing etc.

6. More exotic operations can easily be provided by additional
libraries because of Ruby's open classes. Those operations may be
coded depending on on String's public interface for simplicissity, or
work with the internal representation directly for performance.

7. This approach leaves open the possibility of String subclasses
implementing different internal encodings for performance/space
tradeoff reasons which work transparently together (a bit like FixInt
and BigInt).

8. Because Strings are tightly integrated into the language with the
source reader and are used pervasively, much of this cannot be
provided by add-on libraries, even with open classes. Therefore the
need to have it in Ruby's canonical String class. This will break some
old uses of String, but now is the right time for that.

9. The String class does not worry over character representation
on-screen, the mapping to glyphs must be done by UI frameworks or the
terminal attached to stdout.

10. Be flexible. <placeholder for future idea>


This approach has several advantages and a few disadvantages, and I'll
try to bring in some new angles to this now too:


*Advantages*

-POL, Encapsulation-

All Strings behave exactly the same everywhere, are predictable,
and do the hard work for their users.

-Cross Library Transparency-

No String user needs to worry which Strings to pass to a library, or
worry which Strings he will get from a library. With Web-facing
libraries like rails returning encoding-tagged Strings, you would be
likely to get Strings of all possible encodings otherwise, and isthe
String user prepared to deal with this properly?  This is a *big* deal
IMNSHO.

-Limited Conversions-

Encoding conversions are limited to the time Strings are created or
written or explicitly transformed to an external representation.

-Correct String Operations-

Even basic String operations are very hard in the world of Unicode. If
we leave the String users to look at the encoding tags and sort it out
themselves, they are bound to make mistakes because they don't care,
don't know, or have no time. And these mistakes may be _security_
_sensitive_, since most often credentials are represented as Strings
too. There already have been exploits related to Unicode.


*Disadvantages* (with mitigating reasoning of course)

- String users need to learn that #byte_length(encoding=:utf8) >=
#size, but that's not too hard, and applies everywhere. Users do not
need to learn about an encoding tag, which is surely worse to handle
for them.

- Strings cannot be used as simple byte buffers any more. Either use
an array of bytes, or an optimized ByteBuffer class. If you need
regular expresson support, RegExp can be extended for ByteBuffers or
even more.

- Some String operations may perform worse than might be expected from
a naive user, in both the time or space domain. But we do this so the
String user doesn't need to himself, and are problably better at it
than the user too.

- For very simple uses of String, there might be unneccessary
conversions. If a String is just to be passed through somewhere,
without inspecting or modifying it at all, in- and outwards conversion
will still take place. You could and should use a ByteBuffer to avoid
this.

- This ties Ruby's String to Unicode. A safe choice IMHO, or would we
really consider something else? Note that we don't commit to a
particular encoding of Unicode strongly.

- More work and time to implement. Some could call it
over-engineered. But it will save a lot of time and troubles when shit
hits the fan and users really do get unexpected foreign characters in
their Strings. I could offer help implementing it, although I have
never looked at ruby's source, C-extensions, or even done a lot of
ruby programming yet.


Close to the start of this discussion Matz asked what the problem with
current strings really was for western users. Somewhere later he
concluded case folding. I think it is more than that: we are lazy and
expect character handling to be always as easy as with 7 bit ASCII, or
as close as possible. Fixed 8-bit codepages worked quite fine most of
the time in this regard, and breakage was limited to special
characters only.

Now let's ask the question in reverse: are eastern programmers so used
to doing elaborate byte-stream to character handling by hand they
don't recognize how hard this is any more? Surely it is a target for
DRY if I ever saw one. Or are there actual problems not solveable this
way? I looked up the mentioned Han-Unification issue, and as far as I
understood this could be handled by future Unicode revisions
allocating more characters, outside of Ruby, but I don't see how it
requires our Strings to stay dumb byte buffers.

Jürgen
2d532341317628fbb2cb22ec427a1d62?d=identicon&s=25 Stefan Lang (Guest)
on 2006-06-17 15:51
(Received via mailing list)
On Saturday 17 June 2006 13:08, Juergen Strobel wrote:
> On Thu, Jun 15, 2006 at 07:59:54PM +0900, Michal Suchanek wrote:
[...]
> > The string methods should not just blindly operate on bytes but
> > use the encoding information to operate on characters rather than
> > bytes. Sure something like byte_length is needed when the string
> > is stored somewhere outside Ruby but standard string methods
> > should work with character offsets and characters, not byte
> > offsets nor bytes.
>
> I empathically agree. I'll even repeat and propose a new Plan for
> Unicode Strings in Ruby 2.0 in 10 points:

Juergen, I agree with most of what you have written. I will
add my thoughts.

> 1. Strings should deal in characters (code points in Unicode) and
> not in bytes, and the public interface should reflect this.
>
> 2. Strings should neither have an internal encoding tag, nor an
> external one via $KCODE. The internal encoding should be
> encapsulated by the string class completely, except for a few
> related classes which may opt to work with the gory details for
> performance reasons. The internal encoding has to be decided,
> probably between UTF-8, UTF-16, and UTF-32 by the String class
> implementor.

Full ACK. Ruby programs shouldn't need to care about the
*internal* string encoding. External string data is treated as
a sequence of bytes and is converted to Ruby strings through
an encoding API.

> 3. Whenever Strings are read or written to/from an external source,
> their data needs to be converted. The String class encapsulates the
> encoding framework, likely with additional helper Modules or
> Classes per external encoding. Some methods take an optional
> encoding parameter, like #char(index, encoding=:utf8), or
> #to_ary(encoding=:utf8), which can be used as helper Class or
> Module selector.

I think the encoding/decoding API should be separated from the
String class. IMO, the most important change is to strictly
differentiate between arbitrary binary data and character
data. Character data is represented by an instance of the
String class.

I propose adding a new core class, maybe call it ByteString
(or ByteBuffer, or Buffer, whatever) to handle strings of
bytes.

Given a specific encoding, the encoding API converts
ByteStrings to Strings and vice versa.

This could look like:

    my_character_str = Encoding::UTF8.encode(my_byte_buffer)
    buffer = Encoding::UTF8.decode(my_character_str)

> 4. IO instances are associated with a (modifyable) encoding. For
> stdin, stdout this can be derived from the locale settings.
> String-IO operations work as expected.

I propose one of:

1) A low level IO API that reads/writes ByteBuffers. String IO
   can be implemented on top of this byte-oriented API.

   The basic binary IO methods could look like:

   binfile = BinaryIO.new("/some/file", "r")
   buffer = binfile.read_buffer(1024) # read 1K of binary data

   binfile = BinaryIO.new("/some/file", "w")
   binfile.write_buffer(buffer) # Write the byte buffer

   The standard File class (or IO module, whatever) has an
   encoding attribute. The default value is set by the
   constructor by querying OS settings (on my Linux system
   this could be $LANG):

   # read strings from /some/file, assuming it is encoded
   # in the systems default encoding.
   text_file = File.new("/some/file", "r")
   contents = text_file.read

   # alternatively one can explicitely set an encoding before
   # the first read/write:
   text_file = File.new("/some/file", "r")
   text_file.encoding = Encoding::UTF8

   The File class (or IO module) will probably use a BinaryIO
   instance internally.

2) The File class/IO module as of current Ruby just gets
   additional methods for binary IO (through ByteBuffers) and
   an encoding attribute. The methods that do binary IO don't
   need to care about the encoding attribute.

I think 1) is cleaner.

> 5. Since the String class is quite smart already, it can implement
> generally useful and hard (in the domain of Unicode) operations
> like case folding, sorting, comparing etc.

If the strings are represented as a sequence of Unicode
codepoints, it is possible for external libraries to implement
more advanced Unicode operations.

Since IMO a new "character" class would be overkill, I propose
that the String class provides codepoint-wise iteration (and
indexing) by representing a codepoint as a Fixnum. AFAIK a
Fixnum consists of 31 bits on a 32 bit machine, which is
enough to represent the whole range of unicode codepoints.

> 6. More exotic operations can easily be provided by additional
> libraries because of Ruby's open classes. Those operations may be
> coded depending on on String's public interface for simplicissity,
> or work with the internal representation directly for performance.
>
> 7. This approach leaves open the possibility of String subclasses
> implementing different internal encodings for performance/space
> tradeoff reasons which work transparently together (a bit like
> FixInt and BigInt).

I think providing different internal String representations
would be too much work, especially for maintenance in the long
run.

> 10. Be flexible. <placeholder for future idea>
The advantages of this proposal over the current situation and
tagging a string with an encoding are:

* There is only one internal string (where string means a
  string of characters) representation. String operations
  don't need to be written for different encodings.

* No need for $KCODE.

* Higher abstraction.

* Separation of concerns. I always found it strange that most
  dynamic languages simply mix handling of character and
  arbitrary binary data (just think of pack/unpack).

* Reading of character data in one encoding and representing
  it in other encoding(s) would be easy.

It seems that the main argument against using Unicode strings
in Ruby is because Unicode doesn't work well for eastern
countries. Perhaps there is another character set that works
better that we could use instead of Unicode. The important
point here is that there is only *one* representation of
character data Ruby.

If Unicode is choosen as character set, there is the
question which encoding to use internally. UTF-32 would be a
good choice with regards to simplicity in implementation,
since each codepoint takes a fixed number of bytes. Consider
indexing of Strings:

        "some string"[4]

If UTF-32 is used, this operation can internally be
implemented as a simple, constant array lookup. If UTF-16 or
UTF-8 is used, this is not possible to implement as an array
lookup, since any codepoint before the fifth could occupy more
than one (8 bit or 16 bit) unit. Of course there is the
argument against UTF-32 that it takes to much memory. But I
think that most text-processing done in Ruby spends much more
memory on other data structures than in actual character data
(just consider an REXML document), but I haven't measured that
;)

An advantage of using UTF-8 would be that for pure ASCII files
no conversion would be necessary for IO.

Thank you for reading so far. Just in case Matz decides to
implement something similar to this proposal, I am willing to
help with Ruby development (although I don't know much about
Ruby's internals and not too much about Unicode either).

I do not have a CS degree and I'm not a Unicode expert, so
perhaps the proposal is garbage, in this case please tell me
what is wrong about it or why it is not realistic to implement
it.
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-17 15:54
(Received via mailing list)
On 6/17/06, Juergen Strobel <strobel@secure.at> wrote:
> I empathically agree. I'll even repeat and propose a new Plan for
> Unicode Strings in Ruby 2.0 in 10 points:
>
> 1. Strings should deal in characters (code points in Unicode) and not
> in bytes, and the public interface should reflect this.

Agree, mostly. Strings should have a way to indicate the buffer size of
the String.

> 2. Strings should neither have an internal encoding tag, nor an
> external one via $KCODE. The internal encoding should be encapsulated
> by the string class completely, except for a few related classes which
> may opt to work with the gory details for performance reasons.
> The internal encoding has to be decided, probably between UTF-8,
> UTF-16, and UTF-32 by the String class implementor.

Completely disagree. Matz has the right choice on this one. You can't
think in just terms of a pure Ruby implementation -- you *must* think
in terms of the Ruby/C interface for extensions as well.

> 3. Whenever Strings are read or written to/from an external source,
> their data needs to be converted. The String class encapsulates the
> encoding framework, likely with additional helper Modules or Classes
> per external encoding. Some methods take an optional encoding
> parameter, like #char(index, encoding=:utf8), or
> #to_ary(encoding=:utf8), which can be used as helper Class or Module
> selector.

Conversion should be possible at any time. An "external source" may be
an extension that your Ruby program can't distinguish. Again, this point
fails because your #2 is unacceptable.

> 4. IO instances are associated with a (modifyable) encoding. For
> stdin, stdout this can be derived from the locale settings. String-IO
> operations work as expected.

Agree, realising that the internal implementation of String must be
completely different than you've suggested. It is also important to
retain *raw* reading; a JPEG should not be interpreted as Unicode.

> 5. Since the String class is quite smart already, it can implement
> generally useful and hard (in the domain of Unicode) operations like
> case folding, sorting, comparing etc.

Agreed, but this would be expected regardless of the actual encoding of
a String.

> 6. More exotic operations can easily be provided by additional
> libraries because of Ruby's open classes. Those operations may be
> coded depending on on String's public interface for simplicissity, or
> work with the internal representation directly for performance.

Agreed.

> 7. This approach leaves open the possibility of String subclasses
> implementing different internal encodings for performance/space
> tradeoff reasons which work transparently together (a bit like FixInt
> and BigInt).

Um. Disagree. Matz's proposed approach does this; yours does not. Yours,
in fact, makes things *much* harder.

> 8. Because Strings are tightly integrated into the language with the
> source reader and are used pervasively, much of this cannot be
> provided by add-on libraries, even with open classes. Therefore the
> need to have it in Ruby's canonical String class. This will break some
> old uses of String, but now is the right time for that.

"Now" isn't; Ruby 2.0 is. Maybe Ruby 1.9.1.

> 9. The String class does not worry over character representation
> on-screen, the mapping to glyphs must be done by UI frameworks or the
> terminal attached to stdout.

The String class doesn't worry about that now.

> 10. Be flexible. <placeholder for future idea>

And little is more flexible than Matz's m17n String.

> This approach has several advantages and a few disadvantages, and I'll
> try to bring in some new angles to this now too:
>
> *Advantages*
>
> -POL, Encapsulation-
>
> All Strings behave exactly the same everywhere, are predictable,
> and do the hard work for their users.

Remember: POLS is not an acceptable reason for anything. Matz's m17n
Strings would be predictable, too. a + b would be possible if and only
if a and b are the same encoding or one of them is "raw" (which would
mean that the other is treated as the defined encoding) *or* there is a
built-in conversion for them.

> -Cross Library Transparency-
> No String user needs to worry which Strings to pass to a library, or
> worry which Strings he will get from a library. With Web-facing
> libraries like rails returning encoding-tagged Strings, you would be
> likely to get Strings of all possible encodings otherwise, and isthe
> String user prepared to deal with this properly?  This is a *big* deal
> IMNSHO.

This will be true with m17n strings. However, your proposal does *not*
work for Ruby/C interfaced items. Sorry.

> -Limited Conversions-
>
> Encoding conversions are limited to the time Strings are created or
> written or explicitly transformed to an external representation.

This is a mistake. I may need to know the internal representation of a
particular encoding of a String inside of a program. Trust me on this
one: I *have* done some low-level encoding work. Additionally, even
though I might have marked a network object as "UTF-8", I may not know
whether it's *actually* UTF-8 or not until I get HTTP headers -- or
worse, a <meta http-equiv> tag. Assuming UTF-8 reading in today's world
is doomed to failure.

> -Correct String Operations-
> Even basic String operations are very hard in the world of Unicode. If
> we leave the String users to look at the encoding tags and sort it out
> themselves, they are bound to make mistakes because they don't care,
> don't know, or have no time. And these mistakes may be _security_
> _sensitive_, since most often credentials are represented as Strings
> too. There already have been exploits related to Unicode.

This is a misunderstanding on your part. Nothing about Matz's m17n
Strings suggests that String users would have to look at the encoding
tags. Merely that they *could*. I suspect that there will be pragma-
like behaviours to enforce a particular internal representation at all
times.

> *Disadvantages* (with mitigating reasoning of course)
> - String users need to learn that #byte_length(encoding=:utf8) >=
> #size, but that's not too hard, and applies everywhere. Users do not
> need to learn about an encoding tag, which is surely worse to handle
> for them.

True, but the encoding tag is not worse. Anyone who assumes that
developers can ignore encoding at any time simply *doesn't* know about
the level of problems that can be encountered.

> - Strings cannot be used as simple byte buffers any more. Either use
> an array of bytes, or an optimized ByteBuffer class. If you need
> regular expresson support, RegExp can be extended for ByteBuffers or
> even more.

I see no reason for this.

> - Some String operations may perform worse than might be expected from
> a naive user, in both the time or space domain. But we do this so the
> String user doesn't need to himself, and are problably better at it
> than the user too.

This is a wash.

> - For very simple uses of String, there might be unneccessary
> conversions. If a String is just to be passed through somewhere,
> without inspecting or modifying it at all, in- and outwards conversion
> will still take place. You could and should use a ByteBuffer to avoid
> this.

This is a wash.

> - This ties Ruby's String to Unicode. A safe choice IMHO, or would we
> really consider something else? Note that we don't commit to a
> particular encoding of Unicode strongly.

This is a wash. I think that it's better to leave the options open.
After all, it *is* a hope of mine to have Ruby running on iSeries
(AS/400) and *that* still uses EBCDIC.

> - More work and time to implement. Some could call it over-engineered.
> But it will save a lot of time and troubles when shit hits the fan and
> users really do get unexpected foreign characters in their Strings. I
> could offer help implementing it, although I have never looked at
> ruby's source, C-extensions, or even done a lot of ruby programming
> yet.

I would call it the amount of work necessary. But the work needs to be
done for a *variety* of encodings, and not just Unicode. *Especially*
because of C extensions.

> Close to the start of this discussion Matz asked what the problem with
> current strings really was for western users. Somewhere later he
> concluded case folding. I think it is more than that: we are lazy and
> expect character handling to be always as easy as with 7 bit ASCII, or
> as close as possible. Fixed 8-bit codepages worked quite fine most of
> the time in this regard, and breakage was limited to special
> characters only.

> Now let's ask the question in reverse: are eastern programmers so used
> to doing elaborate byte-stream to character handling by hand they
> don't recognize how hard this is any more? Surely it is a target for
> DRY if I ever saw one. Or are there actual problems not solveable this
> way? I looked up the mentioned Han-Unification issue, and as far as I
> understood this could be handled by future Unicode revisions
> allocating more characters, outside of Ruby, but I don't see how it
> requires our Strings to stay dumb byte buffers.

No one has ever suggested that Ruby Strings stay byte buffers. However,
blindly choosing Unicode *adds* unnecessary complexity to the situation.

-austin
A2b2f4ee23989dc68529baef9cbddcd6?d=identicon&s=25 Julian 'Julik' Tarkhanov (Guest)
on 2006-06-17 16:16
(Received via mailing list)
On 17-jun-2006, at 15:52, Austin Ziegler wrote:
>> 8. Because Strings are tightly integrated into the language with the
>> source reader and are used pervasively, much of this cannot be
>> provided by add-on libraries, even with open classes. Therefore the
>> need to have it in Ruby's canonical String class. This will break
>> some
>> old uses of String, but now is the right time for that.
>
> "Now" isn't; Ruby 2.0 is. Maybe Ruby 1.9.1.

Most probably wise, but I need casefolding and character classes to
work since yesteryear.
Oniguruma is there but even if you complie with it (which is not the
default, still) you don't get char classes (AFAIK)
and you don't get casefolding. Case-insensitive search/replace
quickly becomes bondage.

I am maintaining a gem whose test fails due to different regexps in
Oniguruma, but I would be able to quickly fix it knowing that
Oniguruma is in stable now.
>> 10. Be flexible. <placeholder for future idea>
>
> And little is more flexible than Matz's m17n String.

I couldn't find a proper description of that - as I told already, the
thing I'd least prefer would be

# get a string from the database
p str + my_unicode_chars # Ok, bail out with an ugly exception
because the author of the DB adaptor didn't care to send me proper
Strings...

If strings in the system are allowed to have varying encodings, I
don't understand how the engine is going to upgrade/downgrade strings
automatically.
Especially remembering that the receiver is on the left, so I
actually might get different exceptions going as I do

p my_unicode_chars + mojikyo_str # who wins?

or

p mojikyo_str + my_unicode_chars # who wins?

or (especially)

p mojikyo_str +
bytestring_that_i_just_grabbed_by_http_and_i_know_it_is_mojikyo_but_its_
not # who wins?
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-17 16:19
(Received via mailing list)
On 6/17/06, Stefan Lang <langstefan@gmx.at> wrote:
> Full ACK. Ruby programs shouldn't need to care about the
> *internal* string encoding. External string data is treated as
> a sequence of bytes and is converted to Ruby strings through
> an encoding API.

This is incorrect. *Most* Ruby programs won't need to care about the
internal string encoding. Experience suggests, however, that it is
*most*. Definitely not all.

> Given a specific encoding, the encoding API converts
> ByteStrings to Strings and vice versa.
>
> This could look like:
>
>     my_character_str = Encoding::UTF8.encode(my_byte_buffer)
>     buffer = Encoding::UTF8.decode(my_character_str)

Unnecessarily complex and inflexible. Before you go too much further, I
*really* suggest that you look in the archives and Google to find more
about Matz's m17n String proposal. It's a really good one, as it allows
developers (both pure Ruby and extension) to choose what is appropriate
with the ability to transparently convert as well.

>> 4. IO instances are associated with a (modifyable) encoding. For
>> stdin, stdout this can be derived from the locale settings.
>> String-IO operations work as expected.
>
> I propose one of:
>
> 1) A low level IO API that reads/writes ByteBuffers. String IO
>    can be implemented on top of this byte-oriented API.

[...]

> 2) The File class/IO module as of current Ruby just gets
>    additional methods for binary IO (through ByteBuffers) and
>    an encoding attribute. The methods that do binary IO don't
>    need to care about the encoding attribute.
>
> I think 1) is cleaner.

I think neither is necessary and both would be a mistake. It is, as I
indicated to Juergen, sometimes *impossible* to determine the encoding
to be used for an IO until you have some data from the IO already.

>> 5. Since the String class is quite smart already, it can implement
>> generally useful and hard (in the domain of Unicode) operations like
>> case folding, sorting, comparing etc.
> If the strings are represented as a sequence of Unicode codepoints, it
> is possible for external libraries to implement more advanced Unicode
> operations.

This would be true regardless of the encoding.

> Since IMO a new "character" class would be overkill, I propose that
> the String class provides codepoint-wise iteration (and indexing) by
> representing a codepoint as a Fixnum. AFAIK a Fixnum consists of 31
> bits on a 32 bit machine, which is enough to represent the whole range
> of unicode codepoints.

This does not match what Matz will be doing.

  str = "Fran\303\247ais"
  str[5] # -> "\303\247"

This is better than doing a Fixnum representation. It is character
iteration, but each character is, itself, a String.

>> 7. This approach leaves open the possibility of String subclasses
>> implementing different internal encodings for performance/space
>> tradeoff reasons which work transparently together (a bit like
>> FixInt and BigInt).
> I think providing different internal String representations
> would be too much work, especially for maintenance in the long
> run.

If you're depending on classes to do that, especially given that Ruby's
String, Array, and Hash classes don't inherit well, you're right.

> The advantages of this proposal over the current situation and
> tagging a string with an encoding are:

The problem, of course, is that this proposal -- and your take on it --
don't account for the m17n String that Matz has planned. The current
situation is a mess. But the current situation is *not* what is planned.
I've had to do some encoding work for work in the last two years, and
while I *prefer* a UTF-8/UTF-16 internal representation, I also know
that's *impossible* in some situations and you have to be flexible. I
also know that POSIX handles this situation worse than any other
setup.

With the work that I've done on this, Matz is *right* about this, and
the people claiming that Unicode is the Only Way ... are wrong. In an
ideal world, Unicode would be the correct and only way. In the real
world, however, it's a lot messier, and Ruby has to be aware of that.

We can *still* make it as easy as possible for the common case (which
will be UTF-8 encoding data and filenames). But we shouldn't make the
mistake of assuming that the common case is all that Ruby should handle.

> * There is only one internal string (where string means a
>   string of characters) representation. String operations
>   don't need to be written for different encodings.

This is still (mostly) correct under the m17n String proposal.

> * No need for $KCODE.

This is true under the m17n String.

> * Higher abstraction.

This is true under the m17n String.

> * Separation of concerns. I always found it strange that most dynamic
>   languages simply mix handling of character and arbitrary binary data
>   (just think of pack/unpack).

The separation makes things harder most of the time.

> * Reading of character data in one encoding and representing it in
>   other encoding(s) would be easy.

This is true under the m17n String.

> It seems that the main argument against using Unicode strings in Ruby
> is because Unicode doesn't work well for eastern countries. Perhaps
> there is another character set that works better that we could use
> instead of Unicode. The important point here is that there is only
> *one* representation of character data Ruby.

This is a mistake.

> If Unicode is choosen as character set, there is the question which
> encoding to use internally. UTF-32 would be a good choice with regards
> to simplicity in implementation, since each codepoint takes a fixed
> number of bytes. Consider indexing of Strings:

Yes, but this would be very hard on memory requirements. There are
people who are trying to get Ruby to fit into small-memory environments.
This would destroy any chance of that.

[...]

> Thank you for reading so far. Just in case Matz decides to implement
> something similar to this proposal, I am willing to help with Ruby
> development (although I don't know much about Ruby's internals and not
> too much about Unicode either).

I would suggest that you look for discussions about m17n Strings in
Ruby. Matz has this one right.

> I do not have a CS degree and I'm not a Unicode expert, so perhaps the
> proposal is garbage, in this case please tell me what is wrong about
> it or why it is not realistic to implement it.

I don't have a CS degree either, but I have been in the business for a
*long* time and I've been immersed in Unicode and encoding issues for
the last two years. If everyone used Unicode -- and POSIX weren't stupid
-- your proposal would be much more realistic. I *agree* that Ruby
should encourage the use of Unicode as much as is practical. But it also
shouldn't tie our hands like other programming languages do.

-austin
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-17 16:26
(Received via mailing list)
On 6/17/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:
> (AFAIK) and you don't get casefolding. Case-insensitive search/replace
> quickly becomes bondage.

I don't disagree. But you're *not* going to get those features, in all
likelihood, in a Ruby 1.8.x release. It would be a breaking release.
Oniguruma is the default for Ruby 1.9+. If there are things missing,
work with the developer.

> I am maintaining a gem whose test fails due to different regexps in
> Oniguruma, but I would be able to quickly fix it knowing that
> Oniguruma is in stable now.

I don't think that Oniguruma is in stable (1.8.x); I *don't* think it
will be enabled as default in stable. Again, it's a breaking change.

>>> 10. Be flexible. <placeholder for future idea>
>> And little is more flexible than Matz's m17n String.
> I couldn't find a proper description of that - as I told already, the
> thing I'd least prefer would be

> # get a string from the database
> p str + my_unicode_chars # Ok, bail out with an ugly exception
> because the author of the DB adaptor didn't care to send me proper
> Strings...

The DB adaptor, of course, will have to look at the encoding that the DB
is using.

> p mojikyo_str + my_unicode_chars # who wins?
>
> or (especially)
>
> p mojikyo_str +
> bytestring_that_i_just_grabbed_by_http_and_i_know_it_is_mojikyo_but_its_
> not # who wins?

Consider coersion in Numerics (ri Numeric#coerce). A similar framework
can be built for Strings.

-austin
E7559e558ececa67c40f452483b9ac8c?d=identicon&s=25 unknown (Guest)
on 2006-06-17 17:00
(Received via mailing list)
On Jun 17, 2006, at 9:50 AM, Stefan Lang wrote:

> *internal* string encoding. External string data is treated as
> a sequence of bytes and is converted to Ruby strings through
> an encoding API.

I don't claim to be an Unicode export but shouldn't the goal be to
have Ruby work with *any* text encoding on a per-string basis?  Why
would you want to force all strings into Unicode for example in a
context where you aren't using Unicode?  (The internal encoding has
to be....).  And of course even in the Unicode world you have several
different encodings (UTF-8, UTF-16, and so on).  Juergen, when you
say 'internal encoding' are you talking about the text encoding of
Ruby source code?

It seems to me that irrespective of any particular text encoding
scheme you need clean support of a simple byte vector data structure
completely unencumbered with any notion of text encoding or locale.
Right now that is done by the String class, whose name I think
certainly creates much confusion.  If the class had been called
Vector and then had methods like:

	Vector#size		# size in bytes
	Vector#str_size 	# size in characters (encoding and locale considered)

I think this discussion would be clearer because it would be the
behavior of the str* methods that would need to understand text
encodings and/or locale settings while the underlying byte vector
methods remained oblivious.  The #[] method is the most confusing
since sometimes you want to extract bytes and sometimes you want to
extract sub-strings (i.e consider the encoding).  One method, two
interpretations, bad headache.

It seems that three distinct behaviors are being shoehorned (with
good reason) into a single class framework (String):

	byte vector
	text encoding (encoded sequence of code points)
	locale	      (cultural interpretations of the encoded sequence of
code points)

I'm just suggesting that these distinctions seem to be lost in much
of this discussion, especially for folks (like myself) who have a
practical interest in this but certainly aren't text-encoding gurus.


Gary Wright
2abf5beb51d5d66211d525a72c5cb39d?d=identicon&s=25 Paul Battley (Guest)
on 2006-06-17 18:04
(Received via mailing list)
On 17/06/06, Austin Ziegler <halostatue@gmail.com> wrote:
> > - This ties Ruby's String to Unicode. A safe choice IMHO, or would we
> > really consider something else? Note that we don't commit to a
> > particular encoding of Unicode strongly.
>
> This is a wash. I think that it's better to leave the options open.
> After all, it *is* a hope of mine to have Ruby running on iSeries
> (AS/400) and *that* still uses EBCDIC.

Not to mention that Matz has explicitly stated in the past that he
wants Ruby to support other encodings (TRON, Mojikyo, etc.) that
aren't compatible with a Unicode internal representation.

Not tying String to Unicode is also the right thing to do: it allows
for future developments. Java's weird encoding system is entirely down
to the fact that it standardised on UCS-2; when codepoints beyond
65535 arrived, they had to be shoehorned in via an ugly hack. As far
as possible, Ruby should avoid that trap.

Paul.
2d532341317628fbb2cb22ec427a1d62?d=identicon&s=25 Stefan Lang (Guest)
on 2006-06-17 18:17
(Received via mailing list)
On Saturday 17 June 2006 16:58, gwtmp01@mac.com wrote:
> > Full ACK. Ruby programs shouldn't need to care about the
> when you say 'internal encoding' are you talking about the text
> encoding of Ruby source code?

I'm not Juergen, but since you responded to my message...

First of all Unicode is a character set and UTF-8, UTF-16 etc.
are encodings, that is they specify how a Unicode character is
represented as a series of bits.

At least *I* am not talking about the encoding of Ruby source
code. The main point of the proposal is to use a single
universal character encoding for all Ruby character strings
(instances of the String class). Assuming there is an ideal
character set that is really sufficient to represent any
text in this world, it could be used to construct a String
class that abstracts the underlying representation completely
away.

Consider the "float" data type you will find in most
programming languages: The programmer doesn't think in terms
of the bits that represent a floating point value. He just
uses the operators provided for floats. He can choose between
different serialization strategies if he needs to serialize
floats. But the *operators* on floats the programming language
provides don't care about the different serialization formats,
they all work using the same internal representation.
Conversion is done on IO. Ideally, the same level of
abstraction should be there for character data.

If you have a universal character set (Unicode is an attempt
at this), and an encoding for it, the programming language can
abstract the underlying String representation away. For IO, it
provides methods (i.e. through Encoding objects) that
serialize Strings to a stream of bytes and vice versa.

> It seems to me that irrespective of any particular text encoding
> scheme you need clean support of a simple byte vector data
> structure completely unencumbered with any notion of text encoding
> or locale.

I have proposed that further below as Buffer or ByteString.

> Right now that is done by the String class, whose name I
> think certainly creates much confusion.  If the class had been
> called Vector and then had methods like:
>
> 	Vector#size		# size in bytes
> 	Vector#str_size 	# size in characters (encoding and locale
> considered)

By providing str_size you are already mixing up the purpose of
your simple byte vector and character strings.
E7559e558ececa67c40f452483b9ac8c?d=identicon&s=25 unknown (Guest)
on 2006-06-17 18:38
(Received via mailing list)
On Jun 17, 2006, at 12:16 PM, Stefan Lang wrote:
> Assuming there is an ideal
> character set that is really sufficient to represent any
> text in this world, it could be used to construct a String
> class that abstracts the underlying representation completely
> away.

So all we need is an ideal character set?  That sounds simple.  :-)

> By providing str_size you are already mixing up the purpose of
> your simple byte vector and character strings.

Yes.  I was pointing out that there were multiple concerns that were
being solved by a single class and I said that there were good
reasons for this.  My point was that even if you choose to handle all
those concerns in a single class it was important to keep the
concerns distinct during discussion.  Something that I thought wasn't
happening in this discussion.

I think this is another example of the Humane Interface discussion
started by Martin Fowler (http://www.martinfowler.com/bliki/
HumaneInterface.html)

In Ruby arrays have an interface that allow them to be used as pure
arrays, as lists, as queue, as stacks and so on instead of having
lots of additional classes.
Similarly I think it makes sense for all M17N issues to be packaged
up in a single class (String) instead of breaking up those concerns
into a class hierarchy.


Gary Wright
2d532341317628fbb2cb22ec427a1d62?d=identicon&s=25 Stefan Lang (Guest)
on 2006-06-17 19:37
(Received via mailing list)
On Saturday 17 June 2006 16:16, Austin Ziegler wrote:
> On 6/17/06, Stefan Lang <langstefan@gmx.at> wrote:
> > Full ACK. Ruby programs shouldn't need to care about the
> > *internal* string encoding. External string data is treated as
> > a sequence of bytes and is converted to Ruby strings through
> > an encoding API.
>
> This is incorrect. *Most* Ruby programs won't need to care about
> the internal string encoding. Experience suggests, however, that it
> is *most*. Definitely not all.

As long as one treats a character string as a character
string, the internal encoding is irrelevant, and as soon as a
decision for an internal string encoding is made, every
programmer can read in the docs "Ruby internally encodes
strings using the XYZ encoding".

[...]
> Unnecessarily complex and inflexible. Before you go too much
> further, I *really* suggest that you look in the archives and
> Google to find more about Matz's m17n String proposal. It's a
> really good one, as it allows developers (both pure Ruby and
> extension) to choose what is appropriate with the ability to
> transparently convert as well.

I couldn't find much (in English, I don't understand
Japanese), do you have a link at hand?

[...]
> already.
That is easy to handle with the proposed scheme: Read as much
as you need with the binary interface until you know the
encoding and then do the conversion of the byte buffer to
string. For file input, you can close the file when you have
determined the encoding and reopen it using the "normal"
(character oriented) interface.

Or do you mean Ruby should determine the encoding
automatically? IMO, that would be bad magic and error-prone.

[...]
> > If the strings are represented as a sequence of Unicode
> > codepoints, it is possible for external libraries to implement
> > more advanced Unicode operations.
>
> This would be true regardless of the encoding.

But a conversion from [insert arbitrary encoding here] to
unicode codepoints would be needed.

>
> This is better than doing a Fixnum representation. It is character
> iteration, but each character is, itself, a String.

I wouldn't mind additionally having:

    str.codepoint_at(5)     => a Fixnum

[...]
> and the people claiming that Unicode is the Only Way ... are wrong.
> >   string of characters) representation. String operations
> >   don't need to be written for different encodings.
>
> This is still (mostly) correct under the m17n String proposal.

How does the regular expression engine work then? And all
String methods that have to combine two or more strings in
some way?

[...]
> > * Separation of concerns. I always found it strange that most
> > dynamic languages simply mix handling of character and arbitrary
> > binary data (just think of pack/unpack).
>
> The separation makes things harder most of the time.

Why? In which cases?

[...]
> > It seems that the main argument against using Unicode strings in
> > Ruby is because Unicode doesn't work well for eastern countries.
> > Perhaps there is another character set that works better that we
> > could use instead of Unicode. The important point here is that
> > there is only *one* representation of character data Ruby.
>
> This is a mistake.

OK, Unicode was enough for me until now, but I see that
Unicode is not enough for everyone.

> > If Unicode is choosen as character set, there is the question
> > which encoding to use internally. UTF-32 would be a good choice
> > with regards to simplicity in implementation, since each
> > codepoint takes a fixed number of bytes. Consider indexing of
> > Strings:
>
> Yes, but this would be very hard on memory requirements. There are
> people who are trying to get Ruby to fit into small-memory
> environments. This would destroy any chance of that.

I can hardly believe that. There is still the binary IO
interface and ByteString that I proposed. And I still think
that the memory used for pure character data is a small
fraction of the overall memory consumption of typical Ruby
programs.
C914fa463a2b1b067586c6432b12a824?d=identicon&s=25 Juergen Strobel (Guest)
on 2006-06-17 22:34
(Received via mailing list)
On Sun, Jun 18, 2006 at 01:02:39AM +0900, Paul Battley wrote:
> On 17/06/06, Austin Ziegler <halostatue@gmail.com> wrote:
> >> - This ties Ruby's String to Unicode. A safe choice IMHO, or would we
> >> really consider something else? Note that we don't commit to a
> >> particular encoding of Unicode strongly.
> >
> >This is a wash. I think that it's better to leave the options open.
> >After all, it *is* a hope of mine to have Ruby running on iSeries
> >(AS/400) and *that* still uses EBCDIC.

AFAIK, EBCDIC can be losslessly converted to Unicode and back. Right?

On the other hand, do you really trust all ruby library writers to
accept your strings tagged with EBCDIC encoding? Or do you look
forward to a lot of manual conversions?

> Paul.
That's why I explicitly stated it ties Ruby's String class to Unicode
Character Code Points, but not to a particular Unicode encoding or
character class, and *that* was Java's main folly. (UCS-2 is a
strictly 16 bit per character encoding, but new Unicode standards
specify 21 bit characters, so they had to "extend" it).

I am unaware of unsolveable problems with Unicode and Eastern
languages, I asked specifically about it. If you think Unicode is
unfixably flawed in this respect, I guess we all should write off
Unicode now rather than later? Can you detail why Unicode is
unacceptable as a single world wide unifying character set?
Especially, are there character sets which cannot be converted to
Unicode and back, which is the main requirement to have Unicode
Strings in a non-Unicode environment?

Jürgen
C914fa463a2b1b067586c6432b12a824?d=identicon&s=25 Juergen Strobel (Guest)
on 2006-06-17 22:37
(Received via mailing list)
On Sun, Jun 18, 2006 at 01:16:12AM +0900, Stefan Lang wrote:
> > >
> > several different encodings (UTF-8, UTF-16, and so on).  Juergen,
> code. The main point of the proposal is to use a single
> universal character encoding for all Ruby character strings
> (instances of the String class). Assuming there is an ideal
> character set that is really sufficient to represent any
> text in this world, it could be used to construct a String
> class that abstracts the underlying representation completely
> away.

That's what I meant, yes. And that is the most important point too.

Jürgen
2abf5beb51d5d66211d525a72c5cb39d?d=identicon&s=25 Paul Battley (Guest)
on 2006-06-17 23:02
(Received via mailing list)
On 17/06/06, Juergen Strobel <strobel@secure.at> wrote:
> I am unaware of unsolveable problems with Unicode and Eastern
> languages, I asked specifically about it. If you think Unicode is
> unfixably flawed in this respect, I guess we all should write off
> Unicode now rather than later? Can you detail why Unicode is
> unacceptable as a single world wide unifying character set?
> Especially, are there character sets which cannot be converted to
> Unicode and back, which is the main requirement to have Unicode
> Strings in a non-Unicode environment?

They aren't so much unsolvable problems as mutually incompatible
approaches. Unicode is concerned with the semantic meaning of a
character, and ignores glyph variations through the 'Han unification'
process. TRON encoding doesn't use Han unification: it encodes the
historically-same Chinese character differently for different
languages/regions where they are written differently today. Mojikyo
encodes each graphically distinct character differently and includes a
very wide range of historical characters, and is therefore
particularly suited to certain linguistic and literary niches.

In spite of this, I think that Unicode is an excellent choice for
everyday usage. Unicode does have a solution to the problem of
character variants, but it's not a universal back end for all
encodings.

Incidentally, it is said that TRON is the world's most widely-used
operating system, so supporting that encoding is not necessarily a
minor concern.

Paul.
C914fa463a2b1b067586c6432b12a824?d=identicon&s=25 Juergen Strobel (Guest)
on 2006-06-17 23:51
(Received via mailing list)
On Sat, Jun 17, 2006 at 10:52:24PM +0900, Austin Ziegler wrote:
> >2. Strings should neither have an internal encoding tag, nor an
> >external one via $KCODE. The internal encoding should be encapsulated
> >by the string class completely, except for a few related classes which
> >may opt to work with the gory details for performance reasons.
> >The internal encoding has to be decided, probably between UTF-8,
> >UTF-16, and UTF-32 by the String class implementor.
>
> Completely disagree. Matz has the right choice on this one. You can't
> think in just terms of a pure Ruby implementation -- you *must* think
> in terms of the Ruby/C interface for extensions as well.

I admit I don't know about Ruby's C extensions. Are they unable to
access String's methods? That is all that is needed to work with them.

And since this String class does not have a parametric encoding
attribute, it should be easier to crunch in C even.

> fails because your #2 is unacceptable.
Note that explict conversion to characters, arrays, etc, is possible
for any supported character set and encodig. I have even given method
examples. "External" is to be seen in the context of the String class.

> >case folding, sorting, comparing etc.
>
> Agreed, but this would be expected regardless of the actual encoding of
> a String.

I am unaware of Matz's exact plan. Any good english language links?

I was under the impression users of Matz' String instances need to
look at the encoding tag to implement eg. #version_sort. If that is
not the case our proposals are not that much different, only Matz' one
is even more complex to implement than mine.

> >tradeoff reasons which work transparently together (a bit like FixInt
> >and BigInt).
>
> Um. Disagree. Matz's proposed approach does this; yours does not. Yours,
> in fact, makes things *much* harder.

If Matz's approach requires looking at the encoding tag from the
outside, it is not as transparent as mine. If it isn't it just boils
down to a parametric class versus subclass hierarchy design decision,
and I don't see much difference and would be happy with either one.

>
> >8. Because Strings are tightly integrated into the language with the
> >source reader and are used pervasively, much of this cannot be
> >provided by add-on libraries, even with open classes. Therefore the
> >need to have it in Ruby's canonical String class. This will break some
> >old uses of String, but now is the right time for that.
>
> "Now" isn't; Ruby 2.0 is. Maybe Ruby 1.9.1.

My original title, somewhere snipped out, was "A Plan for Unicode
Strings in Ruby 2.0". I don't want to rush things or break 1.8 either.

>
> >9. The String class does not worry over character representation
> >on-screen, the mapping to glyphs must be done by UI frameworks or the
> >terminal attached to stdout.
>
> The String class doesn't worry about that now.

I was just playing safe here.

> >10. Be flexible. <placeholder for future idea>
>
> And little is more flexible than Matz's m17n String.

I've had flexibility with respect to Unicode Standards in mind, to not
fall into traps similiar to Java. A simple to use String class,
powerful enough to include every character of the world was my goal,
with the ability to convert to and from other external (from the
String class'es point of view) representations.

The flexibility to have parametric String encodings inside the String
class was not what I was going for, rather I would have that
inaccessible or at least unneccessary to access for the common String
user, and I provided a somewhat weaker but maybe still sufficient
technique via subclassing.

> Remember: POLS is not an acceptable reason for anything. Matz's m17n
> Strings would be predictable, too. a + b would be possible if and only
> if a and b are the same encoding or one of them is "raw" (which would
> mean that the other is treated as the defined encoding) *or* there is a
> built-in conversion for them.

Since I probably cannot control which Strings I get from libraries,
and dont't want to worry which ones I'll have to provide to them, this
is weaker than my approach in this respect, see my next point.

> work for Ruby/C interfaced items. Sorry.
Please elaborate this or provide pointers. I cannot believe C cannot
crunch at my Strings, which are less parametric than Matz's ones are.

> whether it's *actually* UTF-8 or not until I get HTTP headers -- or
> worse, a <meta http-equiv> tag. Assuming UTF-8 reading in today's world
> is doomed to failure.

Read it as binary, and decide later. These problems should be locally
containable, and methods are still able to return Strings after
determining the encoding.

> tags. Merely that they *could*. I suspect that there will be pragma-
> like behaviours to enforce a particular internal representation at all
> times.

Previously you stated users need to look at the encoding to determine
if simple operations like a + b work.

Can you point to more info? I am interested how this pragma stuff
works, and if not doing it "right" can break things.

> >*Disadvantages* (with mitigating reasoning of course)
> >- String users need to learn that #byte_length(encoding=:utf8) >=
> >#size, but that's not too hard, and applies everywhere. Users do not
> >need to learn about an encoding tag, which is surely worse to handle
> >for them.
>
> True, but the encoding tag is not worse. Anyone who assumes that
> developers can ignore encoding at any time simply *doesn't* know about
> the level of problems that can be encountered.

For String concatenates, substring access, search, etc, I expect to be
able to ignore encoding totally. Only when interfacing with
non-String-class objects (I/O and/or explicit conversion) would I need
encoding info.

> >- Strings cannot be used as simple byte buffers any more. Either use
> >an array of bytes, or an optimized ByteBuffer class. If you need
> >regular expresson support, RegExp can be extended for ByteBuffers or
> >even more.
>
> I see no reason for this.

In my proposal, Unicode Strings cannot represent arbitrary binary data
in their internal representation, since not everything would be valid
characters. In fact, you cannot set the internal representation
directly.

The interface could accept a code point sequence of values
(0..255), but that would be wasteful compared to an array of bytes.

> >- Some String operations may perform worse than might be expected from
> >a naive user, in both the time or space domain. But we do this so the
> >String user doesn't need to himself, and are problably better at it
> >than the user too.
>
> This is a wash.

Only trying to refute weak arguments in advance.

> >- For very simple uses of String, there might be unneccessary
> >conversions. If a String is just to be passed through somewhere,
> >without inspecting or modifying it at all, in- and outwards conversion
> >will still take place. You could and should use a ByteBuffer to avoid
> >this.
>
> This is a wash.

Not a big problem either, but someone was bound to bring it up.

> >users really do get unexpected foreign characters in their Strings. I
> >concluded case folding. I think it is more than that: we are lazy and
> >understood this could be handled by future Unicode revisions
>               * austin@zieglers.ca
The way I see it we have to choose a character set. I proposed
Unicode, because their official goal is to be the one unifying set,
and if they ain't yet, I hope they'll be sometime.

If that is not enough we will effectively create our own character
set, let's call it RubyCode, which will contain characters from the
union of Unicode and a few other sets. Each String will have a
particular encoding, which will determine which characters of RubyCode
are valid in this particular String instance. Hopefully many
characters will be valid in multiple encodings. But it doesn't sound
like a very clear design to me.

Jürgen
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-06-17 23:57
(Received via mailing list)
On 6/17/06, Stefan Lang <langstefan@gmx.at> wrote:
>
> As long as one treats a character string as a character
> string, the internal encoding is irrelevant, and as soon as a

No, it is not.

First for reasons of efficiency. If an application is going to perform
lots of slicing and poking on strings it will want some encoding that
is suiatble for that such as UTF-32. If an application runs on system
with little memory it will want space-efficient encoding (ie UTF-8 or
UTF-16 for Asian languages). And if an appliaction runs on system that
uses some legacy codepage it can read, write, and process all strings
in that codepage. And in JRuby it will be useful to convert strings to
UTF-16 so that the native Java functions can be used for manipulation.

Second, not all characters are equal. If you lived in world where
everything was Unicode you would be fine. But it is not so. Unicode is
suboptimal for encoding CJK characters. So some people might want to
use another encoding for their texts (iirc TRON mentioned earlier is
one of such encodings). In your model you can modify Ruby to use
strings composed of TRON characters instead of Unicode characters. But
how would Unicode Ruby and TRON Ruby exchange strings?
And how would  you write an application that handles _both_ TRON and
Unicode? (I suspect TRON would not be much good ie for Runic script)
Such appliaction has to be written very carefully because neither
character set would be subset of the other so it is not possible
converting strings forth and back without thinking. But in your model
such application is not possible at all.

> decision for an internal string encoding is made, every
> programmer can read in the docs "Ruby internally encodes
> strings using the XYZ encoding".
>
> [...]

> > I indicated to Juergen, sometimes *impossible* to determine the
> > encoding to be used for an IO until you have some data from the IO
> > already.
>
> That is easy to handle with the proposed scheme: Read as much
> as you need with the binary interface until you know the
> encoding and then do the conversion of the byte buffer to
> string. For file input, you can close the file when you have
> determined the encoding and reopen it using the "normal"
> (character oriented) interface.

Why reopening or converting if you can simply tag a string that you
had to read anyway?

>
> Or do you mean Ruby should determine the encoding
> automatically? IMO, that would be bad magic and error-prone.

No. But if you read  part of html/xml document before the encoding was
specified there  is no reason why that part hes to be converted or
reread. You apparently got it right if you were able to determine the
encoding from what you read.

>
> [...]
> > > If the strings are represented as a sequence of Unicode
> > > codepoints, it is possible for external libraries to implement
> > > more advanced Unicode operations.
> >
> > This would be true regardless of the encoding.
>
> But a conversion from [insert arbitrary encoding here] to
> unicode codepoints would be needed.

That will be needed anyway. You cannot expect all libraries to use the
arbitrary encoding you chose for Ruby strings.

But if you can choose the encoding of your strings there is nothing
stopping you from converting your strings so that they best suit your
library of choice.


> >
> > > * There is only one internal string (where string means a
> > >   string of characters) representation. String operations
> > >   don't need to be written for different encodings.
> >
> > This is still (mostly) correct under the m17n String proposal.
>
> How does the regular expression engine work then? And all
> String methods that have to combine two or more strings in
> some way?

If they are both subset of Unicode I see no problem with converting
both to Unicode. If they are incompatible things may break. But that
is because of real incompatibility, not because of some restriction of
the approach.

>
> [...]
> > > * Separation of concerns. I always found it strange that most
> > > dynamic languages simply mix handling of character and arbitrary
> > > binary data (just think of pack/unpack).
> >
> > The separation makes things harder most of the time.
>
> Why? In which cases?

Such as when you have to read sthe start of a HTML page as ByteBuffer
and then convert it to String once you determine the encoding.
Especially if string operations do not exist on the ByteBuffer to
allow parsing it.

>
> I can hardly believe that. There is still the binary IO
> interface and ByteString that I proposed. And I still think
> that the memory used for pure character data is a small
> fraction of the overall memory consumption of typical Ruby
> programs.

It depends on the program. For programs that do only text processing
the portion of memory taken by text may be large.

Michal
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-18 00:16
(Received via mailing list)
On 6/17/06, Stefan Lang <langstefan@gmx.at> wrote:
> internal string encoding is made, every programmer can read in the
> docs "Ruby internally encodes strings using the XYZ encoding".

And I'm saying that it's a mistake to do that (standardize on a single
encoding). Every programmer will instead be able to read:

  "Ruby supports encoded strings in a variety of encodings. The
  default behaviour for all strings is XYZ, but this can be
  changed and individual strings may be recoded for performance
  or compatibility reasons."

Language and character encodings are hard. Hiding that fact is a
mistake. That doesn't mean we have to make the APIs difficult, but that
we aren't going to be buzzworded into compliance, either.

> [...]
>> Unnecessarily complex and inflexible. Before you go too much further,
>> I *really* suggest that you look in the archives and Google to find
>> more about Matz's m17n String proposal. It's a really good one, as it
>> allows developers (both pure Ruby and extension) to choose what is
>> appropriate with the ability to transparently convert as well.
> I couldn't find much (in English, I don't understand Japanese), do you
> have a link at hand?

I do not. I've been reading about this, talking about this, and
discussing it with Matz for the last two years or so, and I've been
dealing with Unicode and other character encoding issues extensively at
work. However, the gist of it is that every String is still a byte
vector. Each string will also have an encoding flag. Substrings of a
single character width will always return the String required for the
*character*. The supported encodings will probably start with UTF-8,
UTF-16, various ISO-8859-* encodings, EUC-JP, SJIS, and other Asian
encodings.

>
> Or do you mean Ruby should determine the encoding automatically? IMO,
> that would be bad magic and error-prone.

I mean that what you're suggesting *exposes* problems with encoding
stuff extensively and unnecessarily. I certainly wouldn't want to
program in it if the API involved were as stupid as you're suggesting it
should be.


> [...]
>>> If the strings are represented as a sequence of Unicode codepoints,
>>> it is possible for external libraries to implement more advanced
>>> Unicode operations.
>> This would be true regardless of the encoding.
> But a conversion from [insert arbitrary encoding here] to unicode
> codepoints would be needed.

Why? What if the library that I'm interfacing with requires EUC-JP?
Sorry, but Unicode is *not necessarily* the right answer.

>> This is better than doing a Fixnum representation. It is character
>> iteration, but each character is, itself, a String.
> I wouldn't mind additionally having:
>
>     str.codepoint_at(5)     => a Fixnum

Since Ruby isn't *only* using Unicode, this isn't necessarily going to
be possible or meaningful.

> [...]
>>> * There is only one internal string (where string means a
>>>   string of characters) representation. String operations
>>>   don't need to be written for different encodings.
>> This is still (mostly) correct under the m17n String proposal.
> How does the regular expression engine work then? And all
> String methods that have to combine two or more strings in
> some way?

Matz will have that figured and detailed before he starts writing it.

> [...]
>>> * Separation of concerns. I always found it strange that most
>>> dynamic languages simply mix handling of character and arbitrary
>>> binary data (just think of pack/unpack).
>> The separation makes things harder most of the time.
> Why? In which cases?

In *reality*, the separation is not nearly as clean as people who
advocate such separations would like to pretend. It's less of a problem
in dynamic languages like Ruby, but it's also far less necessary in
dynamic languages like Ruby. I have found it far more useful to not have
to care whether I'm reading a binary or string value. I despise dealing
with C++ and Java where I am forced to care because of stupid API
design.

> [...]
>>> It seems that the main argument against using Unicode strings in
>>> Ruby is because Unicode doesn't work well for eastern countries.
>>> Perhaps there is another character set that works better that we
>>> could use instead of Unicode. The important point here is that there
>>> is only *one* representation of character data Ruby.
>> This is a mistake.
> OK, Unicode was enough for me until now, but I see that Unicode is not
> enough for everyone.

Thank you. Unicode needs to -- will -- work *very* well. I know enough
about Unicode handling to make sure that what I deal with *will*. But I
have come to believe that choosing a single encoding as your String
representation is a mistake, even if it means making your job harder by
defining and implementing rules for mixed-encoding handling.

> consumption of typical Ruby programs.
I can believe it; it's very domain and program specific, but you've just
proposed multiplying the memory usage of that amount of space by four.
(Rails would suffer terribly under your proposal to use UTF-32.)

-austin
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-18 00:22
(Received via mailing list)
On 6/17/06, Juergen Strobel <strobel@secure.at> wrote:
> On Sun, Jun 18, 2006 at 01:02:39AM +0900, Paul Battley wrote:
>> On 17/06/06, Austin Ziegler <halostatue@gmail.com> wrote:
>>>> - This ties Ruby's String to Unicode. A safe choice IMHO, or would
>>>> we really consider something else? Note that we don't commit to a
>>>> particular encoding of Unicode strongly.
>>> This is a wash. I think that it's better to leave the options open.
>>> After all, it *is* a hope of mine to have Ruby running on iSeries
>>> (AS/400) and *that* still uses EBCDIC.
> AFAIK, EBCDIC can be losslessly converted to Unicode and back. Right?

Which code page? EBCDIC has as many code pages (including a UTF-EBCDIC)
as exist in other 8-byte encodings.

> On the other hand, do you really trust all ruby library writers to
> accept your strings tagged with EBCDIC encoding? Or do you look
> forward to a lot of manual conversions?

It depends on the purpose of the library. Very few libraries end up
using byte vectors for strings or completely treat them as such. I would
expect that some of the libraries that I've written would work without
any problems in EBCDIC.

> Character Code Points, but not to a particular Unicode encoding or
> character class, and *that* was Java's main folly. (UCS-2 is a
> strictly 16 bit per character encoding, but new Unicode standards
> specify 21 bit characters, so they had to "extend" it).

Um. Do you mean UTF-32? Because there's *no* binary representaiton of
Unicode Character Code Points that isn't an encoding of some sort. If
that's the case, that's unacceptable from a memory representation.

> I am unaware of unsolveable problems with Unicode and Eastern
> languages, I asked specifically about it. If you think Unicode is
> unfixably flawed in this respect, I guess we all should write off
> Unicode now rather than later? Can you detail why Unicode is
> unacceptable as a single world wide unifying character set?
> Especially, are there character sets which cannot be converted to
> Unicode and back, which is the main requirement to have Unicode
> Strings in a non-Unicode environment?

Legacy data and performance.

-austin
E7559e558ececa67c40f452483b9ac8c?d=identicon&s=25 unknown (Guest)
on 2006-06-18 00:25
(Received via mailing list)
On Jun 17, 2006, at 5:48 PM, Juergen Strobel wrote:
> The way I see it we have to choose a character set.

What leads you to this conclusion?  I don't think it can be refuted
that there exists today an almost endless number of character sets
and text encodings in use. I don't understand why the core facilities
of a language should be intimately tied to any one of those
representations.  Once you do that you've decided that all other
representations are second class citizens.  Why not have the language
be agnostic about these things but still provide a coherent framework
for building libraries and applications that can be locale and
encoding-aware?

Gary Wright
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-06-18 00:49
(Received via mailing list)
On 6/17/06, Juergen Strobel <strobel@secure.at> wrote:
> On Sat, Jun 17, 2006 at 10:52:24PM +0900, Austin Ziegler wrote:
> > On 6/17/06, Juergen Strobel <strobel@secure.at> wrote:

> > mean that the other is treated as the defined encoding) *or* there is a
> > built-in conversion for them.
>
> Since I probably cannot control which Strings I get from libraries,
> and dont't want to worry which ones I'll have to provide to them, this
> is weaker than my approach in this respect, see my next point.

It's apparent from the explanation above.
You do not have to look at  string encoding or worry which encoding
they are as long as they are compatible (ie iso-8859-1 and utf-8) -
there is a conversion for them. The string methods have to use
(internally) the encoding tag, and you can look if you are interested.
If the  strings are incomatible it is a real problem. Not one created
by the implmentation but one originating form the fact that the
strings cannot be automatically converted from one ecoding to another.
But you can keep all your strings, even if they are in several
incompatible encodings. You are not limited to using just one
encoding.


Michal
A2b2f4ee23989dc68529baef9cbddcd6?d=identicon&s=25 Julian 'Julik' Tarkhanov (Guest)
on 2006-06-18 01:17
(Received via mailing list)
On 17-jun-2006, at 23:55, Michal Suchanek wrote:

>
> First for reasons of efficiency. If an application is going to perform
> lots of slicing and poking on strings it will want some encoding that
> is suiatble for that such as UTF-32.
I would much rather prefer UTF-8 in a language such as Ruby which is
often used as glue between
other systems. UTF-8 is used for interchange and it's indisputable.
If you go for UTF-16 or UTF-32, you are most likely
to convert every single character of text files you read (in text
files present in the wild AFAIK UTF-16 and UTF-32 are a minority,
thanks to the BOM and other setbacks).

> If an application runs on system
> with little memory it will want space-efficient encoding (ie UTF-8 or
> UTF-16 for Asian languages). And if an appliaction runs on system that
> uses some legacy codepage it can read, write, and process all strings
> in that codepage. And in JRuby it will be useful to convert strings to
> UTF-16 so that the native Java functions can be used for manipulation.
>
> n your model you can modify Ruby to use
> strings composed of TRON characters instead of Unicode characters. But
> how would Unicode Ruby and TRON Ruby exchange strings?

I think Alan Little summed it up very well. The problem with Unicode
in Ruby is strive for perfection
(i.e. satisfy the users of every conceivable or needed encoding).
It's very noble and I personally can't imagine it
(even with the "democratic coerce" approach Austin cited). The only
thing I don't know if a system having this type of handling can be
built at all and how it will interoperate.

Up until now all scripting languages I used somewhat (Perl, Python,
Ruby) allowed all encodings in strings and doing Unicode in them hurts.

Bluntly put, I am selfish and I don't believe in the "saving grace"
of the M17N (because I just can't wrap it around my head and I sure
as hell know it's going to be VERY complex).
It's also something that bothers me the most about Ruby's "unicode
discussions" (I've read all of them on this list dating back to 2002
because I need it to work NOW) and they
always transcend into this kind of religious discussion in the spirit
of "but your encoding is not good enough", "but my bad encoding isn't
that one and I still need it to work" etc.

While for me the greatest thing about Unicode is that it's Just Good
Enough. And it doesn't seem Unicode is indeed THAT useless for CJK
languages either
(although I'm sure Paul can correct me - all the 4 languages I am in
control of use only 2 scripting systems with some odd additions here
and there).

And no, I didn't have a chance to see a TRON system in the wild. If
someone would show me one within 200 km distance I would be glad to
take a look.
A2b2f4ee23989dc68529baef9cbddcd6?d=identicon&s=25 Julian 'Julik' Tarkhanov (Guest)
on 2006-06-18 01:20
(Received via mailing list)
On 18-jun-2006, at 0:21, Austin Ziegler wrote:
> Legacy data and performance.
Yes, you will spend those cycles to count the letters in my language
RIGHT :-)) (evil grin)
It's actually the most common case when apps damage strings in my
language - their authors wanted to be smart
and _conserve_. And yes, normalization etc. is complex and you DO
need to have a case-conversion table in memory. Please do have one
(Ruby doesn't).

No offense, just observation.
2d532341317628fbb2cb22ec427a1d62?d=identicon&s=25 Stefan Lang (Guest)
on 2006-06-18 02:18
(Received via mailing list)
On Saturday 17 June 2006 23:55, Michal Suchanek wrote:
> On 6/17/06, Stefan Lang <langstefan@gmx.at> wrote:
[...]
> And if an appliaction runs on system that uses some legacy codepage
> it can read, write, and process all strings in that codepage. And
> in JRuby it will be useful to convert strings to UTF-16 so that the
> native Java functions can be used for manipulation.

If you really need this level of efficiency, Ruby is probably
the wrong language anyway. Regarding JRuby: Of course each
implementation would be free to choose an internal Unicode
encoding. If somebody has enough time and motivation he can
even implement support for multiple encodings and let the user
choose at build-time.

[...]
> > Or do you mean Ruby should determine the encoding
> > automatically? IMO, that would be bad magic and error-prone.
>
> No. But if you read  part of html/xml document before the encoding
> was specified there  is no reason why that part hes to be converted
> or reread. You apparently got it right if you were able to
> determine the encoding from what you read.

The conversion would be done anyway, iff a single internal
encoding was choosen and iff the encoding of the input doesn't
match the internal encoding.

>
> That will be needed anyway. You cannot expect all libraries to use
> the arbitrary encoding you chose for Ruby strings.

I assume you mean C libraries here.
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-18 04:38
(Received via mailing list)
On 6/17/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:
> Yes, you will spend those cycles to count the letters in my language
> RIGHT :-)) (evil grin) It's actually the most common case when apps
> damage strings in my language - their authors wanted to be smart and
> _conserve_. And yes, normalization etc. is complex and you DO need to
> have a case-conversion table in memory. Please do have one (Ruby
> doesn't).

I think you're overthinking the problem. Let's consider the guarantees
that an m17n String would make:

  * #size and #length would return the number of glyphs
  * #[] would return glyphs

Presumably, in Regexen with an m17n String, \w would indicate only
"word" glyphs. Other guarantees *would* be made along that line.

Therefore, if your input data is UTF-8, anything that deals with #size,
#length, and character-based indexing *will just work*. The same will
apply to SJIS or any other encoding. The number of times that people are
dealing with mixed-encoding data is vanishingly small, and even when
a developer must, they will probably use a Unicode encoding to
deal with that. But if you're using SJIS, you're just going to want use
*that*.

That's what the m17n String is about. It's not about dictating a single
encoding, but enabling people to use Strings intelligently.

> No offense, just observation.

I agree -- we *need* full Unicode support. But not at the cost of legacy
code pages in favour of Unicode. It's not always appropriate.

-austin
87fe25bf0272d8ad886dda793bdcbbd9?d=identicon&s=25 Tim Bray (Guest)
on 2006-06-18 06:19
(Received via mailing list)
On Jun 17, 2006, at 4:08 AM, Juergen Strobel wrote:

> 1. Strings should deal in characters (code points in Unicode) and not
> in bytes, and the public interface should reflect this.

Be careful.  People who care about this stuff might want to read
http://www.w3.org/TR/2005/REC-charmod-20050215/ It turns out that
characters do not correspond one-to-one with units of sound, or units
of input, or units of display.  Except for low-level stuff like
regexps, it's very difficult to write any code that goes character-at-
a-time that doesn't contain horrible i18n bugs. For practical
purposes, a String is a more useful basic tool than a character.

> 5. Since the String class is quite smart already, it can implement
> generally useful and hard (in the domain of Unicode) operations like
> case folding, sorting, comparing etc.

Be careful.  Case folding is a horrible can of worms, is rarely
implemented correctly, and when it is (the Java library tries really
hard) is insanely expensive.  The reason is that case conversion is
not only language-sensitive but jurisdiction sensitive (in some
respects different in France & Québec).  Trying to do case-folding on
text that is not known to be ASCII is likely a symptom of a bug.

> - This ties Ruby's String to Unicode. A safe choice IMHO, or would we
> really consider something else? Note that we don't commit to a
> particular encoding of Unicode strongly.

For information: The XML view is that Shift-JIS, KOI8-R, EBCDIC, and
many others are all encodings of Unicode and a best effort should be
made to accept and emit all sane encodings on demand.  Most XML
software sticks to a single encoding, internally.

  -Tim
87fe25bf0272d8ad886dda793bdcbbd9?d=identicon&s=25 Tim Bray (Guest)
on 2006-06-18 06:29
(Received via mailing list)
On Jun 17, 2006, at 6:50 AM, Stefan Lang wrote:

> It seems that the main argument against using Unicode strings
> in Ruby is because Unicode doesn't work well for eastern
> countries.

Point of information: there are highly successful word-processing
products selling well in countries whose writing systems include Han
characters, which internally use Unicode.   So while the Han-
unification problems have been much discussed and are regarded as
important by people who are not fools, in fact there is existence
proof that Unicode does work well enough for wide deployment in
commercial software.

> If Unicode is choosen as character set, there is the
> question which encoding to use internally. UTF-32 would be a
> good choice with regards to simplicity in implementation,

UTF-32 has a practical problem in that in C code, you can't use strcmp
() and friends because it's full of null bytes.  Of course if you're
careful to code everything using wchar_t you'll be OK, but lots of
code isn't.  (UTF-8 doesn't have this problem and is much more compact).

> Consider
> indexing of Strings:
>
>         "some string"[4]
>
> If UTF-32 is used, this operation can internally be
> implemented as a simple, constant array lookup. If UTF-16 or
> UTF-8 is used, this is not possible to implement as an array

Correct.  But in practice this seems not to be too huge a problem,
since in practice text is most often accessed sequentially.  The
times that you really need true random access to the N'th character
are rare enough that for some problems, the advantages of UTF-8 are
big enough to compensate for this problem.  Note that in a variable-
length character encoding, there's no trouble whatever with a table
of pointers into text; the *only* problem is when you need to find
the Nth character cheaply.

> An advantage of using UTF-8 would be that for pure ASCII files
> no conversion would be necessary for IO.

Be careful.  There are almost no pure ASCII files left.  Café.
Ordoñez. ?Smart quotes?

  -Tim
87fe25bf0272d8ad886dda793bdcbbd9?d=identicon&s=25 Tim Bray (Guest)
on 2006-06-18 06:36
(Received via mailing list)
On Jun 17, 2006, at 6:52 AM, Austin Ziegler wrote:

>> The internal encoding has to be decided, probably between UTF-8,
>> UTF-16, and UTF-32 by the String class implementor.
>
> Completely disagree. Matz has the right choice on this one. You can't
> think in just terms of a pure Ruby implementation -- you *must* think
> in terms of the Ruby/C interface for extensions as well.

Point of information: Of all the widely-used methods of encoding
international strings, UTF-8 is by far the easiest to deal with in C.

> Trust me on this
> one: I *have* done some low-level encoding work. Additionally, even
> though I might have marked a network object as "UTF-8", I may not know
> whether it's *actually* UTF-8 or not until

That's an incredibly important point in a networked world.  One of
the reasons XML has had so much success, probably more than it
deserves, is that its encoding is self-descriptive.  To quote Larry
Wall: "An XML document knows what encoding it's in."  Since HTTP
headers are (sigh) known to be wrong on occasion, this is a pretty
big value-add.

>> - This ties Ruby's String to Unicode. A safe choice IMHO, or would we
>> really consider something else? Note that we don't commit to a
>> particular encoding of Unicode strongly.
>
> This is a wash. I think that it's better to leave the options open.
> After all, it *is* a hope of mine to have Ruby running on iSeries
> (AS/400) and *that* still uses EBCDIC.

EBCDIC is in fact an encoding of Unicode.  Just saying that it's
necessary to be clear both as to what character set is being
supported, and what limitations on encoding are enforced.

-Tim
87fe25bf0272d8ad886dda793bdcbbd9?d=identicon&s=25 Tim Bray (Guest)
on 2006-06-18 06:45
(Received via mailing list)
On Jun 17, 2006, at 10:34 AM, Stefan Lang wrote:

> Or do you mean Ruby should determine the encoding
> automatically? IMO, that would be bad magic and error-prone.

Not possible in the general case.  There are a few data formats
including XML and ASN.1, which make it possible to reliably infer the
encoding from the instance, but a lot of Web processing these days is
best-guess, and often fails.

> How does the regular expression engine work then?

The two sane options are
(a) have a fixed encoding for Strings and compile the regex in such a
way that it runs directly on the encoding.  This has been done for
both UTF-8 and UTF-16 and is insanely efficient, but it locks you
into the fixed encoding.
(b) have an iterator which produces abstract characters from whatever
encoding is in use and run the regex over the characters, not the
bytes of the representation.  The implementation is trickier and
performance is an issue, but you're not locked to an encoding.

-Tim
87fe25bf0272d8ad886dda793bdcbbd9?d=identicon&s=25 Tim Bray (Guest)
on 2006-06-18 06:51
(Received via mailing list)
On Jun 17, 2006, at 2:55 PM, Michal Suchanek wrote:

> First for reasons of efficiency. If an application is going to perform
> lots of slicing and poking on strings it will want some encoding that
> is suiatble for that such as UTF-32. If an application runs on system
> with little memory it will want space-efficient encoding (ie UTF-8 or
> UTF-16 for Asian languages).

Um, the practical experience is that the code required to unpack a
UTF-8 stream into a sequence of integer codepoints (and reverse the
process) is easy and very efficient; to the point that for "slicing
and poking", UTF-8 vs UTF-16 vs UTF-32 is pretty well a wash.

  -Tim
87fe25bf0272d8ad886dda793bdcbbd9?d=identicon&s=25 Tim Bray (Guest)
on 2006-06-18 06:57
(Received via mailing list)
On Jun 17, 2006, at 3:15 PM, Austin Ziegler wrote:

> Why? What if the library that I'm interfacing with requires EUC-JP?
> Sorry, but Unicode is *not necessarily* the right answer.

Indeed it's not, but this argument escapes me.  If you try feed that
library an Arabic string, something will break, because EUC-JP can't
represent Arabic.  So what?  Whatever character set(s) you
standardize on, there is going to be existing software that won't be
able to handle all of it... I'm just not following your argument.

  -Tim
87fe25bf0272d8ad886dda793bdcbbd9?d=identicon&s=25 Tim Bray (Guest)
on 2006-06-18 07:00
(Received via mailing list)
On Jun 17, 2006, at 3:22 PM, gwtmp01@mac.com wrote:

> be locale and encoding-aware?
I'm not close enough to Ruby to have a useful opinion, but for many
other software systems, the designers decided that the performance
and interoperability gains achievable by limiting themselves to
Unicode were a compelling enough argument, and so chose.

In particular, these days, both the W3C and the IETF overwhelmingly
specify the use of Unicode characters when text is to be included in
protocols or data delivery formats.  So even if you can handle lots
of non-Unicode stuff, the Net may have difficulty getting it to you. -
Tim
F1d37642fdaa1662ff46e4c65731e9ab?d=identicon&s=25 Charles O Nutter (Guest)
on 2006-06-18 07:03
(Received via mailing list)
I'll chime back in with my not-so-expert opinion, so it's known where I
stand. Take it for whatever it's worth.

- I almost entirely agree with Juergen's longer post on what unicode
support
should look like in 2.0. I won't go into the details of what I disagree
with
because I'm a little squishy in those areas.
- I believe that supporting encoding-tagged strings would be a horrible,
horrible mess for both Ruby VM/interpreter implementers and extension
implementers while not adding any serious benefits for Ruby the
language.
When it comes down to it, you're going to have string A using encoding X
and
string B using encoding Y and in order to work with them both together
you'll have to find some common ground. Settle on common ground early or
you
pay the price to do it EVERY time you work with strings later.
- I have no intention to ever write a C extension for Ruby. I know many
out
there do. However, I think the important thing about Ruby is Ruby, and
making the language bend over backwards to make life easier for C
hackers is
absurd. Making unicode support needlessly complex in Ruby (the language)
only ends up hurting its usability. I for one would not want to
sacrifice
the beauty and simplicity of Ruby solely to apease the C community.
Flame on
if you will, but The Ruby Way should rule here.
- In the end, I should not have to care what encoding strings use
internally
unless I absolutely have to know. Every time questions come up about
unicode
support in Java, I have to look it up...UTF-8? UTF-16? UCS-2? I rarely
need
to know this information, and I rarely remember it. That's exactly the
point. Make the one internal encoding whatever is deemed most flexible,
most
performant, and above all *most global*. Nobody writing Ruby code should
have to care.
- I so rarely work with Strings on a character-by-character basis, and
when
I do all I should have to say is get_character and know that what I have
represents a full and complete character representation. If you're
dealing
with bytes, call it what it is--the aforementioned ByteBuffer. Ruby
needs to
support the concepts of Strings and ByteBuffers independently.

I think it all comes back to a simple question: Which method of
supporting
unicode would feel the most "Ruby"? Which one is DRY and KISS and all
the
other lovely acronyms this community holds so dear? Figure that out, and
there's your answer. I'd be willing to bet it's not
every-string-can-encode-differently, because I don't see how that would
ever
help me write better Ruby code...and improving Ruby is the point of all
this, right?
87fe25bf0272d8ad886dda793bdcbbd9?d=identicon&s=25 Tim Bray (Guest)
on 2006-06-18 07:07
(Received via mailing list)
On Jun 17, 2006, at 4:15 PM, Julian 'Julik' Tarkhanov wrote:

> I would much rather prefer UTF-8 in a language such as Ruby which
> is often used as glue between
> other systems. UTF-8 is used for interchange and it's indisputable.
> If you go for UTF-16 or UTF-32, you are most likely
> to convert every single character of text files you read (in text
> files present in the wild AFAIK UTF-16 and UTF-32 are a minority,
> thanks to the BOM and other setbacks).

There's a lot of UTF-16 out there.  There's more ISO-8859-* than
that, and more Microsoft code-page-* text than everything else put
together.  Yes, with UTF-16 & -32 you do a lot of byte swapping but
it's pretty cheap and pretty reliable.  (I like UTF-8 too, but it's
not without issues).

  -Tim
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-06-18 12:54
(Received via mailing list)
On 6/18/06, Stefan Lang <langstefan@gmx.at> wrote:
> > encoding that is suiatble for that such as UTF-32. If an
> encoding. If somebody has enough time and motivation he can
> even implement support for multiple encodings and let the user
> choose at build-time.

Why? It can already handle utf-8 strings or arrays of unicode
codepoints. They just do not feel like strings with ruby 1.8. What I
want is a glue in string class that does make them feel so.

> > had to read anyway?
> encoding was choosen and iff the encoding of the input doesn't
> match the internal encoding.

However, if you can choose the encoding there is no need to recode at
all. You just keep the string as is, and there is a good chance the
output encoding will match the input encoding.

And in case you need to recode the string you got the encoding
information, and the recoding can be done automatically, and only when
needed.

Michal
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-06-18 13:09
(Received via mailing list)
On 6/18/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:
> If you go for UTF-16 or UTF-32, you are most likely
> to convert every single character of text files you read (in text
> files present in the wild AFAIK UTF-16 and UTF-32 are a minority,
> thanks to the BOM and other setbacks).

Here you go. You can have the strings in UTF-8, and I can heve them in
UTF-32. That is the flexibility of the solution without a fixed
encoding.

> > how would Unicode Ruby and TRON Ruby exchange strings?
>
> I think Alan Little summed it up very well. The problem with Unicode
> in Ruby is strive for perfection
> (i.e. satisfy the users of every conceivable or needed encoding).
> It's very noble and I personally can't imagine it
> (even with the "democratic coerce" approach Austin cited). The only
> thing I don't know if a system having this type of handling can be
> built at all and how it will interoperate.

But quite a few people here look like they do know. I do not know much
about regexes but I can imagine just about any other string operation.
And the current regexes already do operate on multiple encodings.

>
> Up until now all scripting languages I used somewhat (Perl, Python,
> Ruby) allowed all encodings in strings and doing Unicode in them hurts.

And how that leads to the conclusion that there should be only one
encoding?

>
> Bluntly put, I am selfish and I don't believe in the "saving grace"
> of the M17N (because I just can't wrap it around my head and I sure
> as hell know it's going to be VERY complex).

That's the point. If it is wrapped into the string class you do not
have to wrap it around your head.

> It's also something that bothers me the most about Ruby's "unicode
> discussions" (I've read all of them on this list dating back to 2002
> because I need it to work NOW) and they
> always transcend into this kind of religious discussion in the spirit
> of "but your encoding is not good enough", "but my bad encoding isn't
> that one and I still need it to work" etc.

And that is eaxctly why a fixed encoding is bad. If strings can be
encoded in any way there is no point i religious discussions which
encoding you like the most.

>
> While for me the greatest thing about Unicode is that it's Just Good
> Enough. And it doesn't seem Unicode is indeed THAT useless for CJK
> languages either
> (although I'm sure Paul can correct me - all the 4 languages I am in
> control of use only 2 scripting systems with some odd additions here
> and there).

It is JustGoodEnouhg for most cases but not for all. It is not useless
for CJK, just suboptimal because of the Han unification. And it also
does not try to include the historic characters.

>
> And no, I didn't have a chance to see a TRON system in the wild. If
> someone would show me one within 200 km distance I would be glad to
> take a look.

I do not care. Some poeple find that encoding useful. Since the
potential to support any encoding including TRON does not get in the
way when I deal with my text I am fine with that.

Michal
C914fa463a2b1b067586c6432b12a824?d=identicon&s=25 Juergen Strobel (Guest)
on 2006-06-18 16:18
(Received via mailing list)
On Sun, Jun 18, 2006 at 07:21:25AM +0900, Austin Ziegler wrote:
>
> Which code page? EBCDIC has as many code pages (including a UTF-EBCDIC)
> as exist in other 8-byte encodings.

Obviously, EBCDIC -> UNICODE -> same EBCDIC Codepage as before.

> >>Not to mention that Matz has explicitly stated in the past that he
> >character class, and *that* was Java's main folly. (UCS-2 is a
> >strictly 16 bit per character encoding, but new Unicode standards
> >specify 21 bit characters, so they had to "extend" it).
>
> Um. Do you mean UTF-32? Because there's *no* binary representaiton of
> Unicode Character Code Points that isn't an encoding of some sort. If
> that's the case, that's unacceptable from a memory representation.

Yes, I do mean the String *interface* to be UTF-32, or pure code
points which is the same but less suscept to to standard changes, if
accessed at character level. If accessed at substring level, a
substring of a String is obviously a String, and you don't need a
bitwise representation at all.

According to my proposal, Strings do not need an encoding from the
String user's point of view when working just with Strings, and users
won't care apart from memory/performance consumption, which I believe
can be made good enough with a totally encapsulted, internal storage
format to be decided later. I will avoid a premature optimization
debate here now.

Of course encoding matters when Strings are read or written somewhere,
or converted to bit-/bytewise representation explicitly. The Encoding
Framework, however it'll look, needs to be able to convert to and from
Unicode code points for these operations only, and not between
arbitrary encodings. (You *may* code this to recode directly from
the internal storage format for performance reasons, but that'll be
transparent to the String user.)

This breaks down for characters not represented in Unicode at all, and
is a nuisance for some characters affected by the Han Unification
issue.  But Unicode set out to prevent exactly this, and if we
beleieve in Unicode at all, we can only hope they'll fix this in an
upcoming revision. Meanwhile we could map any additional characters
(or sets of) we need to higher, unused Unicode plains, that'll be no
worse than having different, possibly incompatible kinds of Strings.

We'll need an additional class for pure byte vectors, or just use
Array for this kind of work, and I think this is cleaner.

Regarding Java, they switched from UCS-2 to UTF-16 (mostly). UCS-2 is
a pure 16 bit per character encoding and cannot represent codepoints
above 0xffff. UTF-16 works alike UTF-8, but with 16 bit chunks.  But
their abstraction of a single character, the class Char(acter), is
still only 16 bit wide which leads to confusion and similiar to the C
type char, which cannot represent all real characters either. It is
even worse than in C, because C explicitly defines char to be a memory
cell of 8 bits or more, whereas Java really meant Char to be a
character.

> >I am unaware of unsolveable problems with Unicode and Eastern
> >languages, I asked specifically about it. If you think Unicode is
> >unfixably flawed in this respect, I guess we all should write off
> >Unicode now rather than later? Can you detail why Unicode is
> >unacceptable as a single world wide unifying character set?
> >Especially, are there character sets which cannot be converted to
> >Unicode and back, which is the main requirement to have Unicode
> >Strings in a non-Unicode environment?
>
> Legacy data and performance.

Map legacy data, that is characters still not in Unicode, to a high
Plane in Unicode. That way all characters can be used together all the
time. When Unicode includes them we can change that to the official
code points. Note there are no files in String's internal storage
format, so we don't have to worry about reencoding them.

I am not worried about performance. I'd code in C if I were, or
Lisp.

For one, Moore's law is at work and my whole proposal was for 2.0. My
proposal only adds a constant factor to String handling, it doesn't
have higher order complexity.

On the other hand, conversions needs to be done at other times with my
proposal than for M17N Strings, and it depends on the application if
that is more or less often.  String-String operations never need to do
recoding, as opposed to M17N Strings. I/O always needs conversion, and
may need conversion with M17N too. I havea a hunch that allowing
different kinds of Strings around (as in M17N presumely) should
require recoding far more often.

Jürgen
C914fa463a2b1b067586c6432b12a824?d=identicon&s=25 Juergen Strobel (Guest)
on 2006-06-18 16:49
(Received via mailing list)
On Sun, Jun 18, 2006 at 07:22:34AM +0900, gwtmp01@mac.com wrote:
> be agnostic about these things but still provide a coherent framework
> for building libraries and applications that can be locale and
> encoding-aware?
>
> Gary Wright
>

Maybe I was unclear. I did't mean Ruby has too choose an existing
standard, but Ruby has to choose which set of characters to handle in
Strings, in the mathematical sense.

Language implementation, and usage of the String class should be
easier if this set is

- well defined

Unicode code points are pretty good in this respect, better than the
union of all characters in all encodings of possible M17N Strings.
And we may use private extensions to Unicode for legacy characters not
included in Unicode already.

- All characters are equally allowed in all Strings.

M17N fails this one. a[5] = b[3] if their encodings are incompatible?

At best it'll coerce a to an encoding which can handle both, which
would be Unicode 98% of the time any way, 1% something else, and 1%
totally fail. Don't nail me down on the numbers.

Mathematically, String functions should be defined on the whole set,
not subsets, or their application becomes a chore.

Jürgen
A2b2f4ee23989dc68529baef9cbddcd6?d=identicon&s=25 Julian 'Julik' Tarkhanov (Guest)
on 2006-06-18 16:56
(Received via mailing list)
On 18-jun-2006, at 6:17, Tim Bray wrote:
>
> Be careful.  Case folding is a horrible can of worms, is rarely
> implemented correctly, and when it is (the Java library tries
> really hard) is insanely expensive.  The reason is that case
> conversion is not only language-sensitive but jurisdiction
> sensitive (in some respects different in France & Québec).  Trying
> to do case-folding on text that is not known to be ASCII is likely
> a symptom of a bug.

Let's write a specification.
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-18 17:32
(Received via mailing list)
On 6/18/06, Juergen Strobel <strobel@secure.at> wrote:
> On Sun, Jun 18, 2006 at 07:21:25AM +0900, Austin Ziegler wrote:
>> Um. Do you mean UTF-32? Because there's *no* binary representaiton of
>> Unicode Character Code Points that isn't an encoding of some sort. If
>> that's the case, that's unacceptable from a memory representation.
> Yes, I do mean the String *interface* to be UTF-32, or pure code
> points which is the same but less suscept to to standard changes, if
> accessed at character level. If accessed at substring level, a
> substring of a String is obviously a String, and you don't need a
> bitwise representation at all.

Again, this is completely unacceptable from a memory usage perspective.
I certainly don't want my programs taking up 4x the additional memory
for string handling.

But "pure code points" is a red herring and a mistake in any case. Code
points aren't sufficient. You need glyphs, and some glyphs can be
produced with multiple code points (e.g., LOWERCASE A + COMBINING ACUTE
ACCENT as opposed to A ACUTE). Indeed, some glyphs can *only* be
produced with multiple code points. Dealing with this intelligently
requires a *lot* of smarts, but it's precisely what we should do.

> According to my proposal, Strings do not need an encoding from the
> String user's point of view when working just with Strings, and users
> won't care apart from memory/performance consumption, which I believe
> can be made good enough with a totally encapsulted, internal storage
> format to be decided later. I will avoid a premature optimization
> debate here now.

Again, you are incorrect. I *do* care about the encoding of each String
that I deal with, because only that allows me (or String) to deal with
conversions appropriately. Granted, *most* of the time, I won't care.
But I do work with legacy code page stuff from time to time, and
pronouncements that I won't care are just arrogance or ignorance.

> Of course encoding matters when Strings are read or written somewhere,
> or converted to bit-/bytewise representation explicitly. The Encoding
> Framework, however it'll look, needs to be able to convert to and from
> Unicode code points for these operations only, and not between
> arbitrary encodings. (You *may* code this to recode directly from
> the internal storage format for performance reasons, but that'll be
> transparent to the String user.)

I prefer arbitrary encoding conversion capability.

> This breaks down for characters not represented in Unicode at all, and
> is a nuisance for some characters affected by the Han Unification
> issue.  But Unicode set out to prevent exactly this, and if we
> beleieve in Unicode at all, we can only hope they'll fix this in an
> upcoming revision. Meanwhile we could map any additional characters
> (or sets of) we need to higher, unused Unicode plains, that'll be no
> worse than having different, possibly incompatible kinds of Strings.

Those choices aren't ours to make.

> We'll need an additional class for pure byte vectors, or just use
> Array for this kind of work, and I think this is cleaner.

I don't. Such an additional class adds unnecessary complexity to
interfaces. This is the *main* reason that I oppose the foolish choice
to pick a fixed encoding for Ruby Strings.

>> Legacy data and performance.
> Map legacy data, that is characters still not in Unicode, to a high
> Plane in Unicode. That way all characters can be used together all the
> time. When Unicode includes them we can change that to the official
> code points. Note there are no files in String's internal storage
> format, so we don't have to worry about reencoding them.

Um. This is the statement of someone who is ignoring legacy issues.
Performance *is* a big issue when you're dealing with enough legacy
data. Don't punish people because of your own arrogance about encoding
choices.

Again: Unicode Is Not Always The Right Choice. Anyone who tells you
otherwise is selling you a Unicode toolkit and only has their wallet in
mind. Unicode is *often* the right choice, but it's *not* the only
choice and there are times when having the *flexibility* to work in
other encodings without having to work through Unicode as an
intermediary is the right choice. And from an API perspective,
separating String and "ByteVector" is a mistake.

> On the other hand, conversions needs to be done at other times with my
> proposal than for M17N Strings, and it depends on the application if
> that is more or less often.  String-String operations never need to do
> recoding, as opposed to M17N Strings. I/O always needs conversion, and
> may need conversion with M17N too. I havea a hunch that allowing
> different kinds of Strings around (as in M17N presumely) should
> require recoding far more often.

Unlikely. Mixed-encoding data handling is uncommon.

-austin
A2b2f4ee23989dc68529baef9cbddcd6?d=identicon&s=25 Julian 'Julik' Tarkhanov (Guest)
on 2006-06-18 17:32
(Received via mailing list)
On 18-jun-2006, at 13:08, Michal Suchanek wrote:
>
> But quite a few people here look like they do know. I do not know much
> about regexes but I can imagine just about any other string operation.
> And the current regexes already do operate on multiple encodings.
Oh, lord... Have you at least tried that to make such assumtpions? In
other words, tell me, can Ruby's regexes cope with the following:

/[а-я]/
/[а-я]/i

or something like this:
http://rubyforge.org/cgi-bin/viewvc.cgi/icu4r/samp...?
revision=1.2&root=icu4r&view=markup

>
>
> And how that leads to the conclusion that there should be only one
> encoding?
Very simply - I use many pieces of software written in many languages
all the time, with non-Latin text.
I know that when they want to get "historically compatible" problems
arise. And the software that settles on Unicode
internally or somehow enforces it on the programmer usually works
best (all Cocoa and all C#. And to a certain extens yes, Java).

>
>>
>> Bluntly put, I am selfish and I don't believe in the "saving grace"
>> of the M17N (because I just can't wrap it around my head and I sure
>> as hell know it's going to be VERY complex).
>
> That's the point. If it is wrapped into the string class you do not
> have to wrap it around your head.

This is rather naive.
>
> And that is eaxctly why a fixed encoding is bad. If strings can be
> encoded in any way there is no point i religious discussions which
> encoding you like the most.

Yes, it just becomes hard and error prone to process them.
>
> It is JustGoodEnouhg for most cases but not for all. It is not useless
> for CJK, just suboptimal because of the Han unification. And it also
> does not try to include the historic characters.

I think this thread is going to end the same as the one in 2002 did.
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-18 18:37
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Sun, 18 Jun 2006 23:46:40 +0900, Juergen Strobel
<strobel@secure.at> writes:

|Language implementation, and usage of the String class should be
|easier if this set is
|
|- well defined
|- All characters are equally allowed in all Strings.

I understand these attributes might make implementation easier.   But
who cares if I don't care.  And I am not sure how these make usage
easier, really.

Somebody who owns gigabytes of text data in legacy encoding (e.g. me),
wants to avoid encoding conversion back and forth between Unicode and
legacy encoding everytime.  Another somebody want text processing on
historical text which character set is far bigger than Unicode.  The
"well-defined" simple implementation just prohibits those demands.  On
the contrary, M17N approach does not bother Universal Character Set
solution.  You just need to choose Unicode (UTF-8 or UTF-16) as
internal string representation, and convert encoding on I/O as you
might have done in Unicode centric languages.  Nothing lost.

You may worry about implementation difficulty (and performance), but
don't.  It's _my_ concern.  I made a prototype, and have convinced
that I can implement it with acceptable performance.

|Unicode code points are pretty good in this respect, better than the
|union of all characters in all encodings of possible M17N Strings.
|And we may use private extensions to Unicode for legacy characters not
|included in Unicode already.

"private extensions".  No.  It just cause another nightmare.

							matz.
87fe25bf0272d8ad886dda793bdcbbd9?d=identicon&s=25 Tim Bray (Guest)
on 2006-06-18 19:27
(Received via mailing list)
On Jun 18, 2006, at 8:29 AM, Austin Ziegler wrote:

> You need glyphs, and some glyphs can be
> produced with multiple code points (e.g., LOWERCASE A + COMBINING
> ACUTE
> ACCENT as opposed to A ACUTE).

This is another thing you need your String class to be smart about.
You want an equality test between "más" and "más" to always be true
even their "á" characters are encoded differently.  The right way to
solve this is called "Early Uniform Normalization" (see http://
www.w3.org/TR/2003/WD-charmod-20030822/#sec-Normalization); the idea
is you normalize the composed characters at the time you create the
string, then the internal equality test can be done with strcmp() or
equivalent.

>> Map legacy data, that is characters still not in Unicode, to a high
>> Plane in Unicode. That way all characters can be used together all
>> the
>> time. When Unicode includes them we can change that to the official
>> code points. Note there are no files in String's internal storage
>> format, so we don't have to worry about reencoding them.
>
> Um. This is the statement of someone who is ignoring legacy issues.
> Performance *is* a big issue when you're dealing with enough legacy
> data.

Note that you don't have to use a high plane.  The Private Use Area
in the Basic Multilingual Pane has 6,400 code points, which is quite
a few.  Even if you did use a high plane, it's not obvious there'd be
a detectable runtime performance penalty.

>  Unicode is *often* the right choice, but it's *not* the only
> choice and there are times when having the *flexibility* to work in
> other encodings without having to work through Unicode as an
> intermediary is the right choice.

That may be the case.  You need to do a cost-benefit analysis; you
could buy a lot of simplicity by decreeing all-Unicode-internally;
would the benefits of allowing non-Unicode characters be big enough
to to compensate for the loss of simplicity?  I don't know the
answer, but it needs thinking about.

  -Tim
7264fb16beeea92b89bb42023738259d?d=identicon&s=25 Christian Neukirchen (Guest)
on 2006-06-18 21:21
(Received via mailing list)
Tim Bray <tbray@textuality.com> writes:

> is you normalize the composed characters at the time you create the
> string, then the internal equality test can be done with strcmp() or
> equivalent.

Does that mean that  binary.to_unicode.to_binary != binary  is possible?
That could turn out pretty bad, no?
A2b2f4ee23989dc68529baef9cbddcd6?d=identicon&s=25 Julian 'Julik' Tarkhanov (Guest)
on 2006-06-18 21:33
(Received via mailing list)
On 18-jun-2006, at 21:17, Christian Neukirchen wrote:

>> solve this is called "Early Uniform Normalization" (see http://
>> www.w3.org/TR/2003/WD-charmod-20030822/#sec-Normalization); the idea
>> is you normalize the composed characters at the time you create the
>> string, then the internal equality test can be done with strcmp() or
>> equivalent.
>
> Does that mean that  binary.to_unicode.to_binary != binary  is
> possible?
> That could turn out pretty bad, no?

And it does as long as you are not careful. One of the things I do is
normalize all that come IN
into something that is suitable and predictable.
87fe25bf0272d8ad886dda793bdcbbd9?d=identicon&s=25 Tim Bray (Guest)
on 2006-06-18 22:53
(Received via mailing list)
On Jun 18, 2006, at 12:17 PM, Christian Neukirchen wrote:

> possible?
> That could turn out pretty bad, no?

Yes, but having "más" != "más" is pretty bad too; the alternative is
normalizing at comparison time, which would really hurt for example
in a big sort, so you'd need to cache the normalized form, which
would be a lot more code.

binary.to_unicode looks a little weird to me... can you do that
without knowing what the binary is?  If it's text in a known
encoding, no breakage should occur.  If it's unknown bit patterns,
you can't really expect anything sensible to happen... or am I
missing an obvious scenario?  -Tim
C914fa463a2b1b067586c6432b12a824?d=identicon&s=25 Juergen Strobel (Guest)
on 2006-06-18 22:53
(Received via mailing list)
On Sat, Jun 17, 2006 at 11:24:45PM +0900, Austin Ziegler wrote:
> On 6/17/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:
> >On 17-jun-2006, at 15:52, Austin Ziegler wrote:
> >>>8. Because Strings are tightly integrated into the language with the
> >>>source reader and are used pervasively, much of this cannot be
> >>>provided by add-on libraries, even with open classes. Therefore the
> >>>need to have it in Ruby's canonical String class. This will break
> >>>some old uses of String, but now is the right time for that.
> >>"Now" isn't; Ruby 2.0 is. Maybe Ruby 1.9.1.

My title was "A Plan for Unicode Strings in Ruby 2.0". I don't want to
rush things or break 1.8.

Jürgen
C914fa463a2b1b067586c6432b12a824?d=identicon&s=25 Juergen Strobel (Guest)
on 2006-06-18 23:15
(Received via mailing list)
On Mon, Jun 19, 2006 at 01:33:54AM +0900, Yukihiro Matsumoto wrote:
>
> solution.  You just need to choose Unicode (UTF-8 or UTF-16) as
> internal string representation, and convert encoding on I/O as you
> might have done in Unicode centric languages.  Nothing lost.
>
> You may worry about implementation difficulty (and performance), but
> don't.  It's _my_ concern.  I made a prototype, and have convinced
> that I can implement it with acceptable performance.

I never worried about performance much, that's Austin. :P

Thanks for clarifying that. So far I could not find much info on how
exactly M17N will work, especially on the role of the encoding tag, so
I had to guess a lot.

Given your explanation, it seems our ways are quite similiar on the
interface side of things, so far as Unicode is concerned. You chose a
more powerful (and more complex) parametric class design for where I
would have left open only the possiblity of transparently useable
subclasses for performance reasons.

I am happy we've worked that out now. And you are right, I am not that
much interested in the implementation, thank you for doing it. My
concern was with the interface of the String class, but several
posters misunderstood me and tried to draw me into implementation
issues.

Jürgen
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-19 01:02
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Mon, 19 Jun 2006 00:29:46 +0900, Julian 'Julik' Tarkhanov
<listbox@julik.nl> writes:

|In other words, tell me, can Ruby's regexes cope with the following:
|
|/[Á-Ñ]/
|/[Á-Ñ]/i

1.9 Oniguruma regexp engine should handle these, otherwise it's a bug.

							matz.
A2b2f4ee23989dc68529baef9cbddcd6?d=identicon&s=25 Julian 'Julik' Tarkhanov (Guest)
on 2006-06-19 01:11
(Received via mailing list)
On 19-jun-2006, at 1:00, Yukihiro Matsumoto wrote:

>
> 1.9 Oniguruma regexp engine should handle these, otherwise it's a bug.


I'll try to check. Oniguruma on 1.8.4. didn't cope, but maybe it just
weren't hooked in properly.
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-19 01:57
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Mon, 19 Jun 2006 08:09:29 +0900, Julian 'Julik' Tarkhanov
<listbox@julik.nl> writes:

|> |/[Á-Ñ]/
|> |/[Á-Ñ]/i
|>
|> 1.9 Oniguruma regexp engine should handle these, otherwise it's a bug.
|
|I'll try to check. Oniguruma on 1.8.4. didn't cope, but maybe it just
|weren't hooked in properly.

If you have any problem, send us a report with what you expect and
what you get.

							matz.
A2b2f4ee23989dc68529baef9cbddcd6?d=identicon&s=25 Julian 'Julik' Tarkhanov (Guest)
on 2006-06-19 03:34
(Received via mailing list)
On 19-jun-2006, at 1:56, Yukihiro Matsumoto wrote:

> a bug.
> |
> |I'll try to check. Oniguruma on 1.8.4. didn't cope, but maybe it just
> |weren't hooked in properly.
>
> If you have any problem, send us a report with what you expect and
> what you get.

Well, I tried on the CVS latest (1.9) and I get:

irb(main):011:0> "НÐ?Ð?лагодаÑ?Ная" =~ /[а-я]/i
=> 6 (should be zero)

That is - character classes work, casefolding doesn't.
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-19 06:06
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Mon, 19 Jun 2006 10:32:08 +0900, Julian 'Julik' Tarkhanov
<listbox@julik.nl> writes:

|Well, I tried on the CVS latest (1.9) and I get:
|
|irb(main):011:0> "îåâÌÁÇÏÄÁÒîÁÑ" =~ /[Á-Ñ]/i
|=> 6 (should be zero)
|
|That is - character classes work, casefolding doesn't.

I found out that Oniguruma casefolding works only for characters
within iso8869-*.  Considering the size of the casefolding table it is
compromise for the time being.  I will fix this in the future.

							matz.
A2b2f4ee23989dc68529baef9cbddcd6?d=identicon&s=25 Julian 'Julik' Tarkhanov (Guest)
on 2006-06-19 07:24
(Received via mailing list)
On 19-jun-2006, at 6:05, Yukihiro Matsumoto wrote:
> |
> |That is - character classes work, casefolding doesn't.
>
> I found out that Oniguruma casefolding works only for characters
> within iso8869-*.  Considering the size of the casefolding table it is
> compromise for the time being.  I will fix this in the future.

Thanks for the clarification :-)
D57f4a4788599a38494865a121f16bbe?d=identicon&s=25 Dmitry Severin (Guest)
on 2006-06-19 07:58
(Received via mailing list)
Correct me,if I'm wrong, but for Matz's plan on M17N, summary is:
1. String internally will remain the same : char *ptr, long len - in
bytes
2. String instances will have encoding tag
3. All String/Regexp methods will respect that encoding tag and return
char(glyph) indexes
4. Methods like byte_size, codepoints, each_char, each_codepoint will be
introduced(?)
5. slice will always accept chars indices and return substrings

I'd say that WOULD BE GOOD, and with methods like
String#enforce_encoding!(encoding) and
String#coerce_encoding!(otherstring)
it won't require developers (for C extensions also) to look at encoding
tag,
just set it when needed.

But, I can see several imlementation issues and possible options, that
should be considered:
- what will happen if one tries to perfom str1.operation(str2) on two
strings with different encodings:
  a) raise exception
  b) silent coerce one or both strings to some "compatible"
charset/encoding, update encoding of result, replacing non-convertable
chars
using fallback mappings? (ouch, this can be split to set of options)
  c) same as b) but raise exception if non-loss conversion is not
possible?
  d) same as b) but warn if non-loss conversion is not possible?
  e) downgrade encoding tag of acceptor to "raw/bytes" and process it?

- what will happen if one changes encoding tag for String instance:
  a) check and raise exception if current bytes don't represent valid
encoding sequence?
  b) just set new tag?
  c) convert byte sequence to given encoding, using fallback mappings?

- what to do with IO:
  a) IO will return strings in "raw/bytes"?
  b) IO can be tagged and will return Strings with given econding tag?
  c) IO can be tagged and is by default tagged with global encoding tag?
  d) IO can be tagged, but is not tagged by default, although methods
returning strings (such as read, readlines) will use global encoding
tag?
  e) if IO is tagged and one tries to write to it a String with
different
encoding, what will happen?

- what will be default encoding tag for new Strings:
  a) "raw/bytes"
  b) derived from system properties of host platform
  c) option b) and can be overriden in application (btw, $KCODE, as
present,
must definitely go away!!!)

- how to process source code files:
  a) restrict them to ASCII and require all non-ASCII strings to be
externalized?
  b) process them as "raw/bytes"?
  c) introduce some kind of commented pragma for source files allowing
to
set encoding,

- at present time Ruby parser can parse only sources in ASCII compatible
encoding.  Would it change?

- what encodings will have Numeric.to_s, Time.to_s etc., or String has
to
have/conform for String#to_f, String#to_i?

On Unicode:
- case-independent canonical string matches/searches DO MATTER. And even
for
encodings, that code variants of glyphs with different codepoints
"variant-insensitive" search, as for me, is desired. Will there be such
functionality?

- string comparison: will <=> use at least UCA rules for Unicode
strings, or
only byte-order comparisons will stay?

- is_digit, is_space, is_alpha, is_foobarbaz etc. could matter, when
writing
a custom parser. Will those methods be provided for one-char strings?


Yes, this is short and incomplete list, but, you should get my point:
it's
not that easy -- there are dozens of decisions, with their pros and
cons, to
be done and implemented :(
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-19 09:57
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Mon, 19 Jun 2006 14:57:22 +0900, "Dmitry Severin"
<dmitry.severin@gmail.com> writes:

|But, I can see several imlementation issues and possible options, that
|should be considered:

Thank you for the ideas.

|- what will happen if one tries to perfom str1.operation(str2) on two
|strings with different encodings:
|  a) raise exception
|  b) silent coerce one or both strings to some "compatible"
|charset/encoding, update encoding of result, replacing non-convertable chars
|using fallback mappings? (ouch, this can be split to set of options)
|  c) same as b) but raise exception if non-loss conversion is not possible?
|  d) same as b) but warn if non-loss conversion is not possible?
|  e) downgrade encoding tag of acceptor to "raw/bytes" and process it?

a), unless either of strings is "ascii" and the other is "ascii"
compatible.  This point is arguable.

|- what will happen if one changes encoding tag for String instance:
|  a) check and raise exception if current bytes don't represent valid
|encoding sequence?
|  b) just set new tag?
|  c) convert byte sequence to given encoding, using fallback mappings?

b), encoding conformance check shall done lazily.  I think there's a
need for explicit encoding conformance check method.

|- what to do with IO:
|  a) IO will return strings in "raw/bytes"?
|  b) IO can be tagged and will return Strings with given econding tag?
|  c) IO can be tagged and is by default tagged with global encoding tag?
|  d) IO can be tagged, but is not tagged by default, although methods
|returning strings (such as read, readlines) will use global encoding tag?
|  e) if IO is tagged and one tries to write to it a String with different
|encoding, what will happen?

c), the global default shall be set from locale setting.

|- what will be default encoding tag for new Strings:
|  a) "raw/bytes"
|  b) derived from system properties of host platform
|  c) option b) and can be overriden in application (btw, $KCODE, as present,
|must definitely go away!!!)

Encoding for literal strings are set by pragma.

|- how to process source code files:
|  a) restrict them to ASCII and require all non-ASCII strings to be
|externalized?
|  b) process them as "raw/bytes"?
|  c) introduce some kind of commented pragma for source files allowing to
|set encoding,

1.9 already has encoding pragma a la Python PEP263.

|- at present time Ruby parser can parse only sources in ASCII compatible
|encoding.  Would it change?

No.  Ruby would not allow scripts in EBCDIC, nor UTF-16, although it
allows processing of those encoding.

|- what encodings will have Numeric.to_s, Time.to_s etc., or String has to
|have/conform for String#to_f, String#to_i?

Good point.  Currently, I think they should work on ASCII.

|On Unicode:
|- case-independent canonical string matches/searches DO MATTER. And even for
|encodings, that code variants of glyphs with different codepoints
|"variant-insensitive" search, as for me, is desired. Will there be such
|functionality?

Casefold search/match will be provided for Regexp.  "variant
insensitive" search should be accomplished by explicit normalization
or collation.

|- string comparison: will <=> use at least UCA rules for Unicode strings, or
|only byte-order comparisons will stay?

Byte order comparison.  UCA rules or such should be done explicitly
via normalization or collation.

|- is_digit, is_space, is_alpha, is_foobarbaz etc. could matter, when writing
|a custom parser. Will those methods be provided for one-char strings?

Those functions will be provided via Regexp.  I am not sure if we will
provide character classification methods for strings.

							matz.
7264fb16beeea92b89bb42023738259d?d=identicon&s=25 Christian Neukirchen (Guest)
on 2006-06-19 13:17
(Received via mailing list)
Tim Bray <tbray@textuality.com> writes:

>>
> without knowing what the binary is?  If it's text in a known
> encoding, no breakage should occur.  If it's unknown bit patterns,
> you can't really expect anything sensible to happen... or am I
> missing an obvious scenario?  -Tim

Those were just fictive method calls.  But let's say I read from
a pipe and I know it contains UTF-16 with BOM, then .to_unicode
would make perfect sense, no?

In case of binary bit patterns, I sooner or later would expect some
kind of EncodingError, given this API.  (I haven't seen yet drafts of
how the API really will be.)
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-06-19 14:40
(Received via mailing list)
On 6/19/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> |- what will happen if one tries to perfom str1.operation(str2) on two
> compatible.  This point is arguable.
What is "ascii"? Specifically I would like string operations to suceed
in cases when both strings are encoded as different subset of Unicode
(or anything else). ie concatenating an ISO-8859-2 and an ISO-8859-1
string sould result in UTF-* string, not an error.

However, this would make the errors from incompatible encodings more
surprising as they would be very infrequent.

I wonder what operations on raw strings (ones without specified
encoding) would do. Or where one of the strings is raw, and the other
is not.


> c), the global default shall be set from locale setting.
>

I am not sure this is good for network IO as well. For diagnostics it
might be useful to set the default to none, and have string raise an
exception when such strings are combined with other strings.

It is only obvious for STDIN and STDOUT that they should follow the
locale setting.

hmm, but it would need to carefully consider which operations should
work on raw strings and which not. Perhaps it is not as nice as it
looks at the first glance.

Thanks

Michal
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-19 15:02
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Mon, 19 Jun 2006 21:39:33 +0900, "Michal Suchanek"
<hramrach@centrum.cz> writes:

|> a), unless either of strings is "ascii" and the other is "ascii"
|> compatible.  This point is arguable.
|
|What is "ascii"? Specifically I would like string operations to suceed
|in cases when both strings are encoded as different subset of Unicode
|(or anything else). ie concatenating an ISO-8859-2 and an ISO-8859-1
|string sould result in UTF-* string, not an error.

Every encoding has an attribute named ascii_compat.  EUC_JP, SJIS,
ISO-8859-* and UTF-8 are declared ascii compatible, where EBCDIC,
UTF-16 and UTF-32 are not.  No other auto conversion shall be done,
since we don't particularly encourage mixed encoding model.

|> |- what to do with IO:
|> |  a) IO will return strings in "raw/bytes"?
|> |  b) IO can be tagged and will return Strings with given econding tag?
|> |  c) IO can be tagged and is by default tagged with global encoding tag?
|> |  d) IO can be tagged, but is not tagged by default, although methods
|> |returning strings (such as read, readlines) will use global encoding tag?
|> |  e) if IO is tagged and one tries to write to it a String with different
|> |encoding, what will happen?
|>
|> c), the global default shall be set from locale setting.
|
|I am not sure this is good for network IO as well. For diagnostics it
|might be useful to set the default to none, and have string raise an
|exception when such strings are combined with other strings.
|
|It is only obvious for STDIN and STDOUT that they should follow the
|locale setting.

Restricting default encoding from locale to STDIO may be a good idea.
There's still open issues, since default encoding from locale is not
covered by the prototype, so we need more experience.

							matz.
2f717e37b332d7816dbf732b3fc8ee72?d=identicon&s=25 Dmitrii Dimandt (Guest)
on 2006-06-19 15:28
(Received via mailing list)
On 6/19/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> |(or anything else). ie concatenating an ISO-8859-2 and an ISO-8859-1
> |string sould result in UTF-* string, not an error.
>
> Every encoding has an attribute named ascii_compat.  EUC_JP, SJIS,
> ISO-8859-* and UTF-8 are declared ascii compatible, where EBCDIC,
> UTF-16 and UTF-32 are not.  No other auto conversion shall be done,
> since we don't particularly encourage mixed encoding model.
>

I wonder. Why cannot Strings throughout Ruby be _always_ represented
as Unicode and why no let ICU handle the conversion between various
encodings for incoming and outgoing data?
(http://www.ibm.com/software/globalization/icu/). I know, it is a
long-stanbding issue on Unicode's Han unification process, but without
proper Unicode support Ruby is destined to be a toy for
English-speaking and Japanese communities only. (And as I'm gearing up
to prepare a web-site in Russian, Turkish and English, I feel that
using Ruby could prove to be a major pain in the nether regions of my
body :) )
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-19 15:34
(Received via mailing list)
On 6/19/06, Dmitrii Dimandt <dmitriid@gmail.com> wrote:
> I wonder. Why cannot Strings throughout Ruby be _always_ represented
> as Unicode and why no let ICU handle the conversion between various
> encodings for incoming and outgoing data?
> (http://www.ibm.com/software/globalization/icu/). I know, it is a
> long-stanbding issue on Unicode's Han unification process, but without
> proper Unicode support Ruby is destined to be a toy for
> English-speaking and Japanese communities only. (And as I'm gearing up
> to prepare a web-site in Russian, Turkish and English, I feel that
> using Ruby could prove to be a major pain in the nether regions of my
> body :) )

This entire discussion is centered around a proposal to do exactly
that. There are many *very good* reasons to avoid doing this. Unicode
Is Not Always The Answer.

It's *usually* the answer, but there are times when it's just easier
to work with data in an established code page.

-austin
2f717e37b332d7816dbf732b3fc8ee72?d=identicon&s=25 Dmitrii Dimandt (Guest)
on 2006-06-19 15:47
(Received via mailing list)
On 6/19/06, Austin Ziegler <halostatue@gmail.com> wrote:
> > body :) )
>
> This entire discussion is centered around a proposal to do exactly
> that. There are many *very good* reasons to avoid doing this. Unicode
> Is Not Always The Answer.
>
> It's *usually* the answer, but there are times when it's just easier
> to work with data in an established code page.
>

I totally agree with that. IMO, the point lies exactly in this
"*usually* an answer". What was the last time 90% of developers had to
wonder what encoding their data was in ;-) And with the advent of
Unicode (and storage becoming cheaper and cheaper and developers
becoming more and more lazy and lazy) more and more of that data is
going to be Unicode.

So, since Unicode is *usually* the answer, make it as painless as
possible. Make all String methods and any other functions that work
with strings accept Unicode straight out of the box without any
worries on the developer's part. And provide alternatives (or optional
parameters?) that would allow the few more encoding-aware gurus :) do
whatever they want with encodings.

Because otherwise we are in a risk of ending up with incompatible
extensions to strings that "simplfy" a developer's life (and the
trend's already begun). I wouldn't want a C/C++ scenario with a string
class upon string class upon extension upon extension that aim to do
something String should do from the start.

All is IMHO, of course :)
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-19 16:35
(Received via mailing list)
On 6/19/06, Dmitrii Dimandt <dmitriid@gmail.com> wrote:
> Because otherwise we are in a risk of ending up with incompatible
> extensions to strings that "simplfy" a developer's life (and the
> trend's already begun). I wouldn't want a C/C++ scenario with a string
> class upon string class upon extension upon extension that aim to do
> something String should do from the start.

I think that's more likely with (a) what we have now and (b) a
Unicode-internal approach. (Indeed, a Unicode-internal approach
*requires* separating a byte vector from String, which doubles
interface complexity.) I would suggest that you look through the whole
discussion and particular attention to Matz's statements.

-austin
87fe25bf0272d8ad886dda793bdcbbd9?d=identicon&s=25 Tim Bray (Guest)
on 2006-06-19 18:09
(Received via mailing list)
On Jun 19, 2006, at 4:16 AM, Christian Neukirchen wrote:

>> without knowing what the binary is?  If it's text in a known
>> encoding, no breakage should occur.  If it's unknown bit patterns,
>> you can't really expect anything sensible to happen... or am I
>> missing an obvious scenario?  -Tim
>
> Those were just fictive method calls.  But let's say I read from
> a pipe and I know it contains UTF-16 with BOM, then .to_unicode
> would make perfect sense, no?

Yep.  And yes, calling to_unicode on it might in fact change the bit
patterns if you adopted Early Uniform Normalization (which would be a
good thing to do).  -Tim
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-06-19 19:22
(Received via mailing list)
On 6/19/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> |(or anything else). ie concatenating an ISO-8859-2 and an ISO-8859-1
> |string sould result in UTF-* string, not an error.
>
> Every encoding has an attribute named ascii_compat.  EUC_JP, SJIS,
> ISO-8859-* and UTF-8 are declared ascii compatible, where EBCDIC,
> UTF-16 and UTF-32 are not.  No other auto conversion shall be done,
> since we don't particularly encourage mixed encoding model.

Reading what you said it appears it would be only possible to add
ascii strings to ascii-compatible sttings. That does not sound very
useful.
If the intended meanig was rather that operations on two
ascii-compatible strings
should always be possible, and that the result is again
ascii-compatible that would sound better.

But it makes these "ascii" encodings a special case. In particular, it
makes UTF-32 less convenient to use.
I guess that for calculation so complex that it would really benefit
form the fast random access of UTF-32 it is reasonable to create a
wrapper that converts the arguments and results. However, If one wants
to perform several such (different) consecutive calculations there are
going to be several useless conversions. It is certainly possible to
make the input interface clever enough to get it right for both UTF-32
and ascii strings but requiring the user to do the conversion on
results does not look nice.

The compatibility could also be just general value that specifies the
encoding family.

ie " ".compatibility => :ascii

ASCII="".encode(:utf8).compatibility

raise "Incompatible encoding #{str.encoding}" unless str.compatibility
== ASCII

But different families could be possible. I am not sure if any other
encoding families of any significance exist, though.

Thanks

Michal
87fe25bf0272d8ad886dda793bdcbbd9?d=identicon&s=25 Tim Bray (Guest)
on 2006-06-19 19:48
(Received via mailing list)
On Jun 19, 2006, at 6:31 AM, Austin Ziegler wrote:

> This entire discussion is centered around a proposal to do exactly
> that. There are many *very good* reasons to avoid doing this. Unicode
> Is Not Always The Answer.
>
> It's *usually* the answer, but there are times when it's just easier
> to work with data in an established code page.

To enlighten the ignorant, could you describe one or two scenarios
where a Unicode-based String class would get in the way?  To use your
words, make things less easy?  I would probably not agree that there
are "*many good*" reasons to avoid this, but probably that's just
because I've been fortunate enough to not encounter the problem
scenarios.  This material would have application in a far larger
domain than just Ruby, obviously.  -Tim
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-19 20:35
(Received via mailing list)
On 6/19/06, Tim Bray <tbray@textuality.com> wrote:
> are "*many good*" reasons to avoid this, but probably that's just
> because I've been fortunate enough to not encounter the problem
> scenarios.  This material would have application in a far larger
> domain than just Ruby, obviously.  -Tim

I've found that a Unicode-based string class gets in the way when it
forces you to work around it. For most text-processing purposes, it
*isn't* an issue. But when you've got text that you don't *know* the
origin encoding (and you're probably working in a different code page),
a Unicode-based string class usually guesses wrong.

Transparent Unicode conversion only works when it is guaranteed that the
starting code page and the ending code page are identical. It's
*definitely* a legacy data issue, and doesn't affect most people, but it
has affected me in dealing with (in a non-Ruby context) NetWare.
Additionally, the overhead of converting to Unicode if your entire data
set is in ISO-8859-1 is unnecessary; again, this is a specialized case.

More problematic, from the Ruby perspective, is the that a Unicode-based
string class would require that there be a wholly separate byte vector
class; I am not sure that is necessary or wise. The first time I read a
JPG into a String, I was delighted -- the interface presented was so
clean and nice as opposed to having to muck around in languages that
force multiple interfaces because of such a presentation.

Like I said, I'm not anti-Unicode, and I want Ruby's Unicode support to
be the best, bar none. I'm not willing to compromise on API or
flexibility to gain that, though.

-austin
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-20 01:40
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Tue, 20 Jun 2006 02:20:10 +0900, "Michal Suchanek"
<hramrach@centrum.cz> writes:

|Reading what you said it appears it would be only possible to add
|ascii strings to ascii-compatible sttings. That does not sound very
|useful.

You will have all your strings in the encoding you choose as a
internal encoding in the usual case, so that you will have a few
compatibility problem.  Only if you want to handle multiple encodings
at a time, you need explicit code conversion for mix encoding
operations.

|I guess that for calculation so complex that it would really benefit
|form the fast random access of UTF-32 it is reasonable to create a
|wrapper that converts the arguments and results. However, If one wants
|to perform several such (different) consecutive calculations there are
|going to be several useless conversions.

I am not sure what you mean.  I feel like that my plan does not have
anything against UTF-32 in this regard.  Perhaps, I am missing
something.  What is going to cause useless conversions?

							matz.
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-06-20 14:13
(Received via mailing list)
On 6/20/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> internal encoding in the usual case, so that you will have a few
> compatibility problem.  Only if you want to handle multiple encodings
> at a time, you need explicit code conversion for mix encoding
> operations.

If I read pieces of text from web pages they can be in different
encodings. I do not see any reason why such pieces of text could not
be automatically concatenated as long as they are all subset of
unicode.

It was the complaint of one of the people here that in Python strings
with different encodings exist but the operations on tham fail. And it
makes the life of anybody working with such strings unneccessarily
hard. They have to be converted explicitly.

>
> |I guess that for calculation so complex that it would really benefit
> |form the fast random access of UTF-32 it is reasonable to create a
> |wrapper that converts the arguments and results. However, If one wants
> |to perform several such (different) consecutive calculations there are
> |going to be several useless conversions.
>
> I am not sure what you mean.  I feel like that my plan does not have
> anything against UTF-32 in this regard.  Perhaps, I am missing
> something.  What is going to cause useless conversions?

If automatic conversions aren't implemented at all, utf-32 does not
really stand out in this regard.

Thanks

Michal
24d3102d656a4654db23d28382a2d6f0?d=identicon&s=25 Timothy Bennett (Guest)
on 2006-06-20 15:57
(Received via mailing list)
On 6/20/06, Michal Suchanek <hramrach@centrum.cz> wrote:
>
>
> If I read pieces of text from web pages they can be in different
> encodings. I do not see any reason why such pieces of text could not
> be automatically concatenated as long as they are all subset of
> unicode.


Having different encodings on one web page is a good way to make sure
that
the page won't display correctly, since all the browsers I know of
display
all text on a page using just one encoding.  Granted, if the encoding is
a
subset of unicode, it may still manage to work out, but personally I
keep
running in to pages that display some of the characters as garbage no
matter
what encoding I instruct my browser to use.  So, no, I don't think it
should
be valid to concatenate strings with different encodings.
31af45939fec7e3c4ed8a798c0bd9b1a?d=identicon&s=25 Matthew Smillie (notmatt)
on 2006-06-20 16:34
(Received via mailing list)
On Jun 20, 2006, at 14:54, Timothy Bennett wrote:

> sure that
> be valid to concatenate strings with different encodings.
So we shouldn't do it because it doesn't work in web browsers?

Hopefully we don't apply that criteria globally, or we'd never get
anything done.
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-06-20 16:41
(Received via mailing list)
On 6/20/06, Timothy Bennett <timothy.s.bennett@gmail.com> wrote:
> the page won't display correctly, since all the browsers I know of display
> all text on a page using just one encoding.  Granted, if the encoding is a
> subset of unicode, it may still manage to work out, but personally I keep
> running in to pages that display some of the characters as garbage no matter
> what encoding I instruct my browser to use.  So, no, I don't think it should
> be valid to concatenate strings with different encodings.

No, I meant that the strings are, of course, converted to a common
encoding such as utf-8 before they are concatenated.
The point is that you do not have to care in which encoding you
obtained the pieces and convert them manually to a common encoding if
the string class can do it automatically for you.

Thanks

Michal
E7559e558ececa67c40f452483b9ac8c?d=identicon&s=25 unknown (Guest)
on 2006-06-20 17:46
(Received via mailing list)
On Jun 20, 2006, at 8:09 AM, Michal Suchanek wrote:
> If I read pieces of text from web pages they can be in different
> encodings. I do not see any reason why such pieces of text could not
> be automatically concatenated as long as they are all subset of
> unicode.

I'm not sure I understand what 'subset of unicode' means.

Do you mean two different encodings of Unicode code points?
As in 'UTF-8 and UTF-16 are subsets of Unicode'?

That usage seems unusual to me.  Are you using 'subset' and 'encoding'
as synonyms or am I missing subtle difference?



Gary Wright
87fe25bf0272d8ad886dda793bdcbbd9?d=identicon&s=25 Tim Bray (Guest)
on 2006-06-20 18:08
(Received via mailing list)
On Jun 20, 2006, at 6:54 AM, Timothy Bennett wrote:

> Having different encodings on one web page is a good way to make
> sure that
> the page won't display correctly
...
>   So, no, I don't think it should
> be valid to concatenate strings with different encodings.

Well, unless you had a String class that took care of the encoding
details and, when you were ready to output, allowed you to say "Give
me that in ISO-8859 or UTF-8 or whatever". -Tim
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-20 18:20
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Tue, 20 Jun 2006 23:33:43 +0900, "Michal Suchanek"
<hramrach@centrum.cz> writes:

|No, I meant that the strings are, of course, converted to a common
|encoding such as utf-8 before they are concatenated.
|The point is that you do not have to care in which encoding you
|obtained the pieces and convert them manually to a common encoding if
|the string class can do it automatically for you.

If you choose to convert all input text data into Unicode (and convert
them back at output), there's no need for unreliable automatic
conversion.

							matz.
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-06-20 19:52
(Received via mailing list)
On 6/20/06, gwtmp01@mac.com <gwtmp01@mac.com> wrote:
> As in 'UTF-8 and UTF-16 are subsets of Unicode'?
>
> That usage seems unusual to me.  Are you using 'subset' and 'encoding'
> as synonyms or am I missing subtle difference?
>
I mean that iso-8859-1 and iso-8859-2 encodings (as well as many
other) encode a subset of characters available in Unicode, and any of
its utf-* encodings. Thus any string that is encoded using such
encoding can be losslessly and automatically converted to an encoding
of full unicode such as utf-8, and operations on several such
converted strings make sense even if the strings were encoded using
different encodings before the conversion.

The automatic conversion would simplify things if you get strings in
different encodings from outside sources such as various web pages,
databases, etc.

Thanks

Michal
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-06-21 13:46
(Received via mailing list)
On 6/20/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
>
> If you choose to convert all input text data into Unicode (and convert
> them back at output), there's no need for unreliable automatic
> conversion.

Well, it's actually you who chose the conversion on input for me.
Since the strings aren't automatically converted I have to ensure that
I have always strings encoded using the same encoding. And the only
reasonable way I can think of is to convert any string that enters my
application (or class) to an arbitrary encoding I choose in advance.

This is no more reliable than automatic conversion. The reliability or
(un)reliability of the conversion is based on the (un)reliability with
which the actual encoding of the string is determined when it is
obtained. If the encoding tag is wrong the string will be converted
incorrectly. It is the only cause for incorrect conversion wether it
happens manually or automatically.

If conversion was done automatically by the string class it could be
performed lazily. The strings are kept in the encoding in which the
were obtained, and only converted when it is needed because they are
combined with a string in a different encoding. And users of the
srings still have the choice to convert them explicitly when they see
fit.

When such automatic conversion is not available it makes interfacing
with libraries that fetch external data more difficult.

a) I could instruct the library that fetches data from a database or
the web to return them always in the encoding I chose for
reperesenting strings in my application, irregardless of the encoding
the data was originally obtained in.
The disadvantage is that if the encoding was determined incorrectly on
input to the library the data is already garbled.

b) I could get the data from the library in the original encoding in
which it was obtained. Either because I would like to check that the
encoding is correct before converting the data or because the library
does not implement the interface for (a).
The disadvantage is that I have to traverse a potentially complex data
structure and convert all strings so that they work with the other
strings inside my application.

c) Every time I perform a string operation I should first check
(manually) that the two strings are compatible (or catch the exception
very near the opration so that I can convert the arguments and retry).
I do not think this is a reasonable option for the common case that
should be made as simple as possible: the strings can be represented
in Unicode. This may be necessary to some extent in applications
dealing with encodings that are incompatible with Unicode but it
should not be required for the common case.

The people with experience from other languages are complaining that
they have to do (b) or (c) because (a) is usually not implemented. And
ensuring either of the three does look like additional problems that
could be solved elsewhere - in the string class.

Thanks

Michal
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-21 16:04
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Wed, 21 Jun 2006 20:45:38 +0900, "Michal Suchanek"
<hramrach@centrum.cz> writes:

|> If you choose to convert all input text data into Unicode (and convert
|> them back at output), there's no need for unreliable automatic
|> conversion.
|
|Well, it's actually you who chose the conversion on input for me.
|Since the strings aren't automatically converted I have to ensure that
|I have always strings encoded using the same encoding. And the only
|reasonable way I can think of is to convert any string that enters my
|application (or class) to an arbitrary encoding I choose in advance.

Agreed.  It is me.  Perhaps you don't know how terrible code
conversion can be.  In the ideal world, lazy conversion seems
attractive, but reality bites.  Conversions fail so easily.
Characters lost, text broken.  Failures can not be avoided for various
reasons, mostly historical reasons we can't fix anymore.  When error
happens (often) it's good to detect errors as early as possible,
i.e. on input/output.  So I encourage universal character set model as
far as it is applicable.  You may use UTF-8 or ISO8859-1 for universal
character set.  I may use EUC-JP for it.

For only rare case, there might be need to handle multiple encoding in
an application.  I do want to allow it.  But I am not sure how we can
help that kind of applications, since they are fundamentally complex.
And we don't have enough experience to design a framework for such
applications.

							matz.
D57f4a4788599a38494865a121f16bbe?d=identicon&s=25 Dmitry Severin (Guest)
on 2006-06-21 16:59
(Received via mailing list)
On 6/21/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
>
>
> For only rare case, there might be need to handle multiple encoding in
> an application.  I do want to allow it.  But I am not sure how we can
> help that kind of applications, since they are fundamentally complex.
> And we don't have enough experience to design a framework for such
> applications.
>
>

I can see one more problem with setting encoding per file and tagging
accordingly string literals in it.
If operations on strings with different encodings will always throw an
exception, problems can raise when one calls such third-party library
from
script with different encoding.

Here's small example:

library code in file some_utility.rb:
# -*- coding: EUC-JP -*-
module SomeUtility
  def SomeUtility.fancy_format(str)
    "<text>" + str + "</text>" # these literals are tagged as EUC-JP,
right?
  end
end

application code in file my_app.rb:
# -*- coding: UTF-8 -*-
require 'some_utility'
puts SomeUtility.fancy_format("an utf8 string")  # this literal is
tagged as
UTF8

If the last call will throw some kind of EncodingMismatchError, how to
deal
with that?
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-21 17:19
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Wed, 21 Jun 2006 23:56:47 +0900, "Dmitry Severin"
<dmitry.severin@gmail.com> writes:

|I can see one more problem with setting encoding per file and tagging
|accordingly string literals in it.

Indeed.

|Here's small example:
|
|library code in file some_utility.rb:
|# -*- coding: EUC-JP -*-
|module SomeUtility
|  def SomeUtility.fancy_format(str)
|    "<text>" + str + "</text>" # these literals are tagged as EUC-JP, right?
|  end
|end
|
|application code in file my_app.rb:
|# -*- coding: UTF-8 -*-
|require 'some_utility'
|puts SomeUtility.fancy_format("an utf8 string")  # this literal is tagged as
|UTF8
|
|If the last call will throw some kind of EncodingMismatchError, how to deal
|with that?

I recommend using "ascii" encoding, which is default, for library
files, unless you are sure in what encoding your input data are.
For localization, tools like gettext would help dealing with strings
in the native encoding.

							matz.
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-21 17:36
(Received via mailing list)
On 6/21/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> I recommend using "ascii" encoding, which is default, for library
> files, unless you are sure in what encoding your input data are.
> For localization, tools like gettext would help dealing with strings
> in the native encoding.

Just a thought. Might it be possible to have a new String literal for
what will be, I think, the most common encoding chosen (UTF-8)? That is,
in addition to:

  # -*- coding: EUC-JP -*-
  "<text>" # tagged as EUC-JP

We allow:

  # -*- coding: EUC-JP -*-
  "<text>" # tagged as EUC-JP
  u"<text>" # tagged as UTF-8

Despite my belief that we should avoid an enforced universal encoding as
the String representation, I *do* plan on making most of my applications
and libraries UTF-8 friendly and aware. It's extremely important that we
be able to work with this cleanly, and if I can simply do either u"foo"
or U"foo" I would find it much easier to deal with in those places where
I need UTF-8/Unicode support.

-austin
A2b2f4ee23989dc68529baef9cbddcd6?d=identicon&s=25 Julian 'Julik' Tarkhanov (Guest)
on 2006-06-21 17:42
(Received via mailing list)
On 21-jun-2006, at 17:18, Yukihiro Matsumoto wrote:
> |If the last call will throw some kind of EncodingMismatchError,
> how to deal
> |with that?
>
> I recommend using "ascii" encoding, which is default, for library
> files, unless you are sure in what encoding your input data are.
> For localization, tools like gettext would help dealing with strings
> in the native encoding.

Matz, this would be a disaster (if in such a situation a library
throws). It's gonna be like python.
Because it means that 99 percent of the libraries will throw.
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-21 18:21
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Thu, 22 Jun 2006 00:41:02 +0900, Julian 'Julik' Tarkhanov
<listbox@julik.nl> writes:

|Matz, this would be a disaster (if in such a situation a library
|throws). It's gonna be like python.
|Because it means that 99 percent of the libraries will throw.

Can you elaborate?  I don't want to see disaster whatever it is.

							matz.
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-21 18:47
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Thu, 22 Jun 2006 00:34:27 +0900, "Austin Ziegler"
<halostatue@gmail.com> writes:

|Just a thought. Might it be possible to have a new String literal for
|what will be, I think, the most common encoding chosen (UTF-8)? That is,
|in addition to:
|
|  # -*- coding: EUC-JP -*-
|  "<text>" # tagged as EUC-JP
|
|We allow:
|
|  # -*- coding: EUC-JP -*-
|  "<text>" # tagged as EUC-JP
|  u"<text>" # tagged as UTF-8

I am not sure this is a good idea or not (yet).  If your "u" text
contains only ASCII characters, I see no need to tag it "UTF-8", and
if it's not, how do we prepare them?  I think, for example,

   u"\346\235\276\346\234\254" => my family name in Kanji

is too ugly.

							matz.
D57f4a4788599a38494865a121f16bbe?d=identicon&s=25 Dmitry Severin (Guest)
on 2006-06-21 19:18
(Received via mailing list)
On 6/21/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
>
> Can you elaborate?  I don't want to see disaster whatever it is.
>
>                                                         matz.
>
>

Single scripts and small self-contained applications almost always
are written in the same codepage. Usually text data processing also
is done for the same codepage, that simplifies life a lot even with
current String as byte vector. So recoding is an overhead here, and
external data is only recoded on input/output in relativey small number
of well-defined places, using known subset of source and target
encodings.
In this case when you know what to expect from your file/network IO,
things
are OK.

It is also OK, when part of script is extracted and evolves to a
library,
as long as you use it in the same environment.

But let's view a case when several third-party libraries are used, all
returning
strings with different encodings. gettext for libraries won't solve
everything, as even externalized strings will have some particular
encoding.
E.g. localization libraries can't fit in only ASCII.

And now calls to methods will behave like some kind of IO in respect to
encoding of passed parameters.
Number of i/o points grows drastically.

How can it be solved in consistent and reliable manner?
a) just simply declare in documentation: "Methods in these classes
*require*

strings to be in UTF16, you've been warned!!!"

  So users of that code will have to remember those constrains and
enforce
  encoding of their data before calling those methods. With dynamic
nature
  of Ruby things will break in unexpected places. No, i dislike idea to
write:

     str.enforce_encoding!(BooClass::INTERNAL_ENCODING)
     b = BooClass.new(str)

b) take care in called methods to enforce encoding
     def process_formatting(str)
        str.enforce_encoding!(MY_INTERNAL_ENCODING)
        # now it is compatible with rest of my code
        # and i can do something with it
     end

 This is also too error-prone :(

And what about processing results of calls? To take care about it in
caller
code?
       res_str = SomeUtil.fancy_format( str )
       res_str.enforce_encoding!(MY_INTERNAL_ENCODING)

On input parameters and returned results which represent complex
structures
with some
String fields things will go even worse.

Who will ever cope with this issues?
Probably this is what Julik meant  by "disaster"?

Things shouldn't be that complicated.
A2b2f4ee23989dc68529baef9cbddcd6?d=identicon&s=25 Julian 'Julik' Tarkhanov (Guest)
on 2006-06-21 21:20
(Received via mailing list)
On 21-jun-2006, at 18:20, Yukihiro Matsumoto wrote:

> Can you elaborate?  I don't want to see disaster whatever it is.
I imagine that in the case mentioned the encoding assumed for a
library will depend on the pragma in the source.

Fr instance, I am writing a program that needs to work wuth UTF8
data, but one of the libraries I am using has ASCII in the pragma.
What is going to haveppen if I ship this library UTF8 strings? Python
libraries just throw, because they do all kinds of no-unicode aware
operations on strings
or request Unicode strings explicitly. So anytime you want to ship
something to a library (or get something from STDIN) you have to
decode and encode.
As soon as you forget to, you get exceptions everywhere.
A2b2f4ee23989dc68529baef9cbddcd6?d=identicon&s=25 Julian 'Julik' Tarkhanov (Guest)
on 2006-06-21 21:23
(Received via mailing list)
On 21-jun-2006, at 19:17, Dmitry Severin wrote:
> .
>
> Who will ever cope with this issues?
> Probably this is what Julik meant  by "disaster"?
>
> Things shouldn't be that complicated.

What I meant is the desritption how you get a Python program wielded
from different libraries to be Unicode-aware.
If Ruby works like that I won't be happy. Basically, some libraries
accept Unicode in Python's 16bit form, some accept utf-8 bytestrings
and some can only grok ASCII and will
throw up anyway. These are not going to work on Python 3000 as I
understand.
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-06-22 01:47
(Received via mailing list)
On 6/21/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> |Since the strings aren't automatically converted I have to ensure that
> i.e. on input/output.  So I encourage universal character set model as
> far as it is applicable.  You may use UTF-8 or ISO8859-1 for universal
> character set.  I may use EUC-JP for it.

I do not see how converting the strings on input will make the
situation better than converting them later. The exact place where the
text is garbled because it is converted incorrectly does not change
the fact it is no longer usable, does it?
well, it may be possible to detect characters that are invalid for
certain encoding either by scanning the string or by attempting a
conversion. But I would rather like optional checks that can be added
when something breaks or is likely to break rather than forced
conversion.

Or to put it another way: If I get a string from somewhere where the
encoding is marked incorrectly it is wrong and it should be expected
to fail. And I can do some checks if I think my source of data is not
reliable in this respect. But if I get string that is marked correctly
and it fails because I did not manually convert it it is frustrating.
And needlessly so.

>
> For only rare case, there might be need to handle multiple encoding in
> an application.  I do want to allow it.  But I am not sure how we can
> help that kind of applications, since they are fundamentally complex.
> And we don't have enough experience to design a framework for such
> applications.

I do no think it is that rare. Most people want new web (or any other)
stuff in utf-8 but there is need to interface legacy databases or
applications. Sometimes converting the data to fit the new application
is not practical. For one, the legacy application may be still used as
well.

Anyway, Ruby being as dynamic as it is I should be able to add support
for automatic recoding myself quite easily. The problem is I would not
be able to use it in libraries (should I ever write some) without
risking a clash with similar feature added by somebody else.

Thanks

Michal
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-22 04:36
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Thu, 22 Jun 2006 02:17:53 +0900, "Dmitry Severin"
<dmitry.severin@gmail.com> writes:

|Things shouldn't be that complicated.

Agreed in principle.  But it seems to be fundamental complexity of the
world of multiple encoding.  I don't think automatic conversion would
improve the situation.  It would cause conversion error almost
randomly.  Do you have any idea to simplify things?

I am eager to hear.

							matz.
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-22 04:46
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Thu, 22 Jun 2006 08:46:08 +0900, "Michal Suchanek"
<hramrach@centrum.cz> writes:

|I do not see how converting the strings on input will make the
|situation better than converting them later. The exact place where the
|text is garbled because it is converted incorrectly does not change
|the fact it is no longer usable, does it?

It does.  But if you convert encoding lazily, you will have hard time
to track down the source of the error causing data.  It may be input
data from IO, or from some GUI toolkit, or the result of operation
with variety of sources.

|> For only rare case, there might be need to handle multiple encoding in
|> an application.  I do want to allow it.  But I am not sure how we can
|> help that kind of applications, since they are fundamentally complex.
|> And we don't have enough experience to design a framework for such
|> applications.
|
|I do no think it is that rare. Most people want new web (or any other)
|stuff in utf-8 but there is need to interface legacy databases or
|applications. Sometimes converting the data to fit the new application
|is not practical. For one, the legacy application may be still used as
|well.

I understand the challenge, but I don't think it is common to run some
part of your program in legacy encoding (without conversion), and
other part in UTF-8.  You need to convert them into universal encoding
anyway for most of the cases.  That's why I said it rare.

							matz.
5c19f2d52879a1e10670c7334ba4c7e3?d=identicon&s=25 Lugovoi Nikolai (Guest)
on 2006-06-22 08:58
(Received via mailing list)
2006/6/22, Yukihiro Matsumoto <matz@ruby-lang.org>:
> randomly.  Do you have any idea to simplify things?
>
> I am eager to hear.
>



So what will be semantic for encoding tag:
 a) weak suggestion?
 b) strong assertion?

If encoding tag is only weak suggestion (and for now I see it will be
just that), it will imply:
  - performance win (no need to check conformance to told encoding)
  - win in having less complexity (most tasks use source code, text
data input and output all in the same [default host] encoding)
  - portability drawbacks (assumtions made by original coders will be
implicit, but they have to be figured out, when porting to another
environement)
  - reliability drawbacks (weak suggestions are too often ignored, and
you don't know when, where and why they will hit your app, but someday
they will!)

If encoding tag is strong assertion, it will imply:
  - probable performance loss:
     * to assure this string with encoding = "none" (raw) represents
valid encoding sequence of bytes,
       at the same price as String#length
     * need to recode bytes, when changing tag
  - slightly more complexity (developers will have to declare these
assertions explicitly)
  - portability win
  - reliability win

What compromise on this issues would be acceptable?

I'd prefer encoding tag as strong assertion, mostly for reliability
reasons.

And for operations on Strings with different encodings, I'd like
implicit automatic encoding coercion:
-------------------------------
#
# NOTES:
#  a) String#recode!(new_encoding) replaces current internal byte
representation with new byte sequence,
#     that is recoded current. must raise IncompatibleCharError, if
can't convert char to destination encoding
#  b) downgrading string from some stated encoding to "none"  tag must
be done only explicitly.
#     it is not an option for implicit conversion
#  c) $APPLICATION_UNIVERSAL_ENCODING is a global var, allowed to be
set once and only once per application run.
#     Intent: we want all strings which aren't raw bytes to be in one
single predefined encoding,
#     so all operations on string must return string in conformant
encoding.
#     Desired encoding is value of $APPLICATION_UNIVERSAL_ENCODING.
#     If $APPLICATION_UNIVERSAL_ENCODING is nil, we go in "democracy
mode", see below.
#
def coerce_encodings(str1, str2)
   enc1 = str1.encoding
   enc2 = str2.encoding

   # simple case, same encodings, will return fast in most cases
   return if enc1 == enc2

   # another simple but rare case, totally incompatible encodings, as
they represent incompatible charsets
   if fully_incompatible_charsets?(enc1, enc2)
   	raise(IncompatibleCharError, "incompatible charsets %s and %s",
enc1, enc2)
   end

   # uncertainity, handling "none" and preset encoding
   if enc1 == "none" || enc2 == "none"
   	raise(UnknownIntentEncodingError, "can't implicitly coerce
encodings %s and %s, use explicit conversion", enc1, enc2)
   end

   # Tirany mode:
   # we want all strings which aren't raw bytes to be in one single
predefined encoding
   if $APPLICATION_UNIVERSAL_ENCODING
   	str1.recode!($APPLICATION_UNIVERSAL_ENCODING)
	str2.recode!($APPLICATION_UNIVERSAL_ENCODING)
   	return
   end

   # Democracy mode:
   # first try to perform non-loss conversion from one encoding to
another:
   # 1) direct conversion, without loss, to another encoding, e.g. UTF8
+ UTF16
   if exists_direct_non_loss_conversion?(enc1, enc2)
   	if exists_direct_non_loss_conversion?(enc2, enc1)
   	# performance hint if both available
	   if str1.byte_length < str2.byte_length
	   	str1.recode!(enc2)
	   else
	   	str2.recode!(enc1)
	   end
	else
		str1.recode!(enc2)
	end
	return
   end
   if exists_direct_non_loss_conversion?(enc2, enc1)
   	str2.recode!(enc1)
	return
   end

   # 2) non-loss conversion to superset
   # (I see no reason to raise exception on KOI8R + CP1251, returning
string in Unicode will be OK)
   if superset_encoding = find_superset_non_loss_conversion?(enc1, enc2)
   	str1.recode!(superset_encoding)
	str2.recode!(superset_encoding)
	return
   end

   # A case for incomplete compatibility:
   # Check if subset of enc1 is also subset of enc2,
   # so some strings in enc1 can be safely recoded to enc2,
   # e.g. two pure ASCII strings, whatever ASCII-compatible encoding
they have
   if exists_partial_loss_conversion?(enc1, enc2)
	if exists_partial_loss_conversion?(enc2, enc1)
   	   # performance hint if both available
	   if str1.byte_length < str2.byte_length
	   	str1.recode!(enc2)
	   else
	   	str2.recode!(enc1)
	   end
	else
		str1.recode!(enc2)
	end
	return
   end

   # the last thing we can try
   str2.recode!(enc1)
end
---------------------------

So, when operation involves two Strings or String and Regexp, with
different encodings, automatic coercion should be done, as described
above.

That will, probably, solve coding problems (no need to think about
encodings most time), but can have following impacts:
1) after several operations, when one sends string to external IO, it
might be internally encoded in superset of that IO encoding. One has
to remember that and perform external IO accordingly, i.e. to resolve
- to fail on invalid chars or use replacement chars (like U+FFFD),-
but that is unavoidable.
2) some performance hits, which I expect to be rare.

Besides, there can be another class of problems with automatic
coercion: how to ensure consistent work of character ranges in Regexps
and String methods like [count, delete, squeeze, tr, succ, next, upto]
when encodings are coerced?

What I, as Ruby user, wish for Unicode/M17N support:
1) reliability and consistency:
  a) String should be abstraction for character sequence,
  b) String methods shouldn't allow me to garble internal
representation;
  c) treating String as byte sequence is handy, but must be explict
stated.
2) coding comfort:
  a) no need to care what encodings have strings while working with
them;
  b) no need to care what encodings have strings returned from
third-party code;
  c) using explicit stated conversion options for external IO.
3) on Unicode and i18n : at least to have a set of classes for
Unicode-specific tasks (collation, normalization, string search,
locale-aware formatting etc.) that would efficiently work with Ruby
strings.

And, for all out there, just ask "Which charset/encoding will fit all
the [present and future] needs?". You know the exact answer: "NONE".

> I understand the challenge, but I don't think it is common to run some
> part of your program in legacy encoding (without conversion), and
> other part in UTF-8.  You need to convert them into universal encoding
> anyway for most of the cases.  That's why I said it rare.

uhm, how to convert compiled extension library?
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-22 10:18
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Thu, 22 Jun 2006 15:55:18 +0900, "Lugovoi Nikolai"
<meadow.nnick@gmail.com> writes:
|> I am eager to hear.
|
|So what will be semantic for encoding tag:
| a) weak suggestion?
| b) strong assertion?

Weak suggestion, if I understand you correctly.

|I'd prefer encoding tag as strong assertion, mostly for reliability reasons.

Hmm, your idea of combination of strong assertion and automatic
conversion seems too complex for me, but it may be worth considering.
Thank you for idea.

|uhm, how to convert compiled extension library?

Every extension that does input/output need to specify (either
explicitly or implicitly) encoding it uses anyway.  I will add
an encoding option to rb_tainted_str_new() and its family.  If it's
possible, I'd like to allow extensions to declare their default
encoding in their initialize function (Init_xxx).

							matz.
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-06-22 12:42
(Received via mailing list)
On 6/22/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> Weak suggestion, if I understand you correctly.
>
> |I'd prefer encoding tag as strong assertion, mostly for reliability reasons.
>
> Hmm, your idea of combination of strong assertion and automatic
> conversion seems too complex for me, but it may be worth considering.
> Thank you for idea.

What I had in mind was much simpler. If the strings do not match just
try to recode to the default encoding which would be unicode most of
the time. Or just try to find a superset.

>
> |uhm, how to convert compiled extension library?
>
> Every extension that does input/output need to specify (either
> explicitly or implicitly) encoding it uses anyway.  I will add
> an encoding option to rb_tainted_str_new() and its family.  If it's
> possible, I'd like to allow extensions to declare their default
> encoding in their initialize function (Init_xxx).
>

But if recoding is not automatic you still have to recode the strings
manually. Both the input to the extension and the results. That is an
annoyance an repetitive code everywhere.

Thanks

Michal
C914fa463a2b1b067586c6432b12a824?d=identicon&s=25 Juergen Strobel (Guest)
on 2006-06-22 17:32
(Received via mailing list)
On Wed, Jun 21, 2006 at 01:04:55AM +0900, Tim Bray wrote:
> details and, when you were ready to output, allowed you to say "Give
> me that in ISO-8859 or UTF-8 or whatever". -Tim

That's what I suggested basically. The problem seems to be non-Unicode
demands mainly, and performance issues on the other hand. And it makes
Strings useless as byte buffers, since you have to specifiy the
encoding of the external representation you create the String from at
creation time. To recap:

Private extensions to Unicode are deemed too complex to implement
(Matz).

Transforming legacy or special (non Unicode) data to a ruby-private
internal storage format on I/O is too performance/space intensive
(Matz).

Strings as byte buffers are important to some people, and they don't
want to use another class or array for it, even if RegExp et al would
be extended to handle these too.

While it would be proper OO design, encapsulating the internal String
implementation hampers direct access to the "raw" data for C-hackers,
creating unwanted hurdles, and again performance issues.


I am still not convinced the arguments against this approach really
will hold in the long run, but since I am not the one implementing it
and can't really participate there due to language barriers, I can
only lean back and wait for the first realease of M17N. Learning
English was hard enough for me.

-Jürgen
Dfa842ab64f794363e66d7cce85ba277?d=identicon&s=25 Alexey Borzenkov (snaury)
on 2006-06-25 16:41
Yukihiro Matsumoto wrote:
> Alright, then what specific features are you (both) missing?  I don't
> think it is a method to get number of characters in a string.  It
> can't be THAT crucial.  I do want to cover "your missing features" in
> the future M17N support in Ruby.

Sorry for maybe getting into, but here are my 5 cents. When I first
found out about ruby, I practically almost fell in love with the
language. Unfortunately, after some studying and experimenting I
suddenly found that it lacks proper unicode support on win32, in
particular with file IO and ole automation, i.e. in two cases where I
had to interoperate with the rest of the world. Win32 really differs
from Linux and maybe other Unixes in API because in *nix you don't have
to worry about unicode/whatever, because all of the system depends on
your current locale. In win32 there are two sets of API, ansi and
unicode, maybe that was a bad microsoft's decision, but that's a
reality. Now I am a Russian, and when I write scripts I have to worry
that not only Russian characters don't get messed up, but characters of
other languages as well. So that if I receive, say, excel file with a
lot of languages in that, and I have to process that file somehow I have
to be sure that no letters will be lost, nor messed up, thus converting
it to current codepage (1251) is no option for me. The same is with
filenames, the fact that I'm running russian winxp doesn't mean that I
have only filenames that fall in 1251 codepage, I also have filenames
with european characters (umlauts and such), as well as japanese, and
when I want to write some script that processes these files, I have to
be able to work with them. At that time this caused me to move to Tcl
(it has utf-8 encoding everywhere, and it converts to required encoding
when interoperating with the world). Since then I'm still waiting for
proper unicode support in ruby (read: proper interoperability with
operating system and its components using unicode API versions: the ones
ending with W) and maybe a way to define in which locale (specific code
page, utf-8, etc) the current script is running.

Hope that clarifies what is currently missing for me (and maybe others,
I don't know).
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-25 17:11
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Sun, 25 Jun 2006 23:41:48 +0900, Snaury Miyoto <snaury@gmail.com>
writes:

|Hope that clarifies what is currently missing for me (and maybe others,
|I don't know).

Unfortunately, not.  I understand Russian people having problem with
multiple encoding, but I don't know how can we help you.

You said Tcl has Unicode support that works well with you.  So that I
think treating all of them in UTF-8 is OK for you.  Then how can it
determine which should be in the current code page, or in Unicode?
Or using Win32 API ending with W could allow you living in the
Unicode?

							matz.
2a321daf565791ad30ac5ee945abf59a?d=identicon&s=25 Izidor Jerebic (Guest)
on 2006-06-25 19:19
(Received via mailing list)
On 22.6.2006, at 10:17, Yukihiro Matsumoto wrote:

>
> |I'd prefer encoding tag as strong assertion, mostly for
> reliability reasons.
>
> Hmm, your idea of combination of strong assertion and automatic
> conversion seems too complex for me, but it may be worth considering.

Strong assertion + auto conversion is the only solution which will
relieve programmers from manually checking/changing string encodings
in their programs.

Remember, string input/output points in a program are not only system
IO classes, but also all the third party libraries/classes which deal
with strings. So most of the existing Ruby and other external (e.g.
Java) libraries, which can be used from Ruby.

The assumption that only system IO is the entry/exit point for string
encoding is very wrong. This assumption holds only for scripts which
use no third party libraries.

So we have two possibilities:
a) every programmer is forced to implement the above solution in
every program (this is starting to happen already, and current
experience tells us that the future in this direction is disaster!)
b) Ruby interpreter implements this solution, and programmers happily
ignore all the complexity.

So, it is true that we move the complexity into Ruby, but this is
(IMHO) much less complicated and much more needed than e.g.
infinitely big integers which we already have.

If Ruby wants to move forward, it needs transparent String support
and hopefully separation of String and ByteArray, since this un-
separation brought us code which is mostly wrong (currently most of
existing Ruby code breaks if string encoding is honoured, as can be
seen from experience of brave people who modified String class).

Ruby is my favourite language, and if it would have String support as
suggested, software development would be just pure joy...

Please listen to the people which tell of disastrous experience in
other languages. And for good experience, I develop in Cocoa in Mac
OS X for many many years, and it has great String class (ok, the
suggested Ruby class would be even better, but still). Plus it has
separated String and Byte array. The results are superb. There is no
problems, and nobody ever worries about strings and encodings. Ever.
You can check the mailing lists.


izidor
A2b2f4ee23989dc68529baef9cbddcd6?d=identicon&s=25 Julian 'Julik' Tarkhanov (Guest)
on 2006-06-25 19:28
(Received via mailing list)
On 25-jun-2006, at 19:18, Izidor Jerebic wrote:
>
> Please listen to the people which tell of disastrous experience in
> other languages. And for good experience, I develop in Cocoa in Mac
> OS X for many many years, and it has great String class (ok, the
> suggested Ruby class would be even better, but still). Plus it has
> separated String and Byte array. The results are superb. There is
> no problems, and nobody ever worries about strings and encodings.
> Ever. You can check the mailing lists.

The greatest about Cocoa is that I'm able to suspect that 99 percent
of the programs I use do The Right Thing when I want to input russian
text in there, and NOT because the programmer did something special to
make it work. Because if he had to, he wouldn't. In contrast, 70
percent of Carbon applications are not even capable of displaying the
text properly (let alone letting me type it in).
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-25 21:08
(Received via mailing list)
On 6/25/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> Or using Win32 API ending with W could allow you living in the
> Unicode?

Matz,

I've mentioned it before, but I will be happy to make the Windows APIs
work with Unicode once the m17n Strings exist. Yes, I will be making
them use either UTF-8 (conversion required, most likely to be compatible
with existing code) or UTF-16 (no conversion required). It will work
well: I have done a similar implementation for code that I have written
at work.

-austin
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-25 21:15
(Received via mailing list)
On 6/25/06, Izidor Jerebic <ij.rubylist@gmail.com> wrote:
> If Ruby wants to move forward, it needs transparent String support and
> hopefully separation of String and ByteArray, since this un-
> separation brought us code which is mostly wrong (currently most of
> existing Ruby code breaks if string encoding is honoured, as can be
> seen from experience of brave people who modified String class).

This is an incorrect and unsupportable statement. It is completely
unnecessary to separate unencoded (e.g., binary) String support into
String and ByteArray.

Please don't try to assume that the problem is this completley
unnecessary division. The problem is that existing strings are
completely unencoded and have no way of being flagged with an encoding
that is supported in any way across all of Ruby.

People are making really *stupid* assumptions based on what choices
other development teams have made, and it's irritating.

Ruby does not need a String with an internal representation in Unicode;
Ruby does not need a separate byte vector. An unencoded string can be
treated as a byte vector with no problems; if it is determined to have
textual meaning, it can be tagged with an encoding very simply and from
that point be treated as a meaningful string. There are times when the
encoding is *not* best treated in Unicode, especially if there are
potential conversion errors.

-austin
F1d37642fdaa1662ff46e4c65731e9ab?d=identicon&s=25 Charles O Nutter (Guest)
on 2006-06-25 22:40
(Received via mailing list)
On 6/25/06, Austin Ziegler <halostatue@gmail.com> wrote:
>
> Ruby does not need a String with an internal representation in Unicode;
> Ruby does not need a separate byte vector. An unencoded string can be
> treated as a byte vector with no problems; if it is determined to have
> textual meaning, it can be tagged with an encoding very simply and from
> that point be treated as a meaningful string. There are times when the
> encoding is *not* best treated in Unicode, especially if there are
> potential conversion errors.
>

When is a ByteArray not a ByteArray? When is a String not a String? Is
it
correct to mingle the two concepts perpetually, when they each have
fairly
specific definitions? My problem with continuing to treat String as a
byte
vector is that it forces two somewhat incompatible concepts on the same
class and the same methods. If you can use a String as both a byte
vector
and as a sequence of characters by calling the same methods, then
setting or
clearing encoding suddenly has the side-effect of changing how elements
of
String are to be treated. If you are providing separate methods for
working
with bytes as opposed to working with characters, then you are already
splitting the two concepts.

(As an aside, does it make sense that I read from a binary file into a
String? Can I reliably assume that binary content in a String should be
logically manipulable as text strings are? Should my binary String work
anywhere and everywhere a text-based String does? I would think that
binary
content neither walks nor quacks like a String.)

By your definition, a String can be treated as a ByteArray so long as
its
internal string does not have an encoding. What do I use if I want to
have
an encoding and still use byte vector semantics?

It is appropriate that a String is no longer usable as a ByteArray as a
result of changing some state? If there exists any state where String
cannot
be logically treated as a byte array, then String != ByteArray in the
general case either. The encoding of a String's internal representation
should not dictate the outward behavior of the String.

If, however, you completely separate the two concepts, there's no
dichotomy.
In that case, a String deals with characters, and you do not have
guarantees
about byte-boundaries or indexed elements. You only have guarantees
about
characters, as it should be. Simultaneously, ByteArray would allow you
to
always work with a vector (array) of bytes, regardless of what those
bytes
contain.

I'll end it off saying this: I think it's a no-brainer that for dealing
with
streams of bytes, there should be a non-string byte vector class. If
folks
are insistent on keeping them the same class, you can't logically
continue
to call it a String and have it fulfill the dual purposes of byte vector
and
character vector at the same time. If you plan to provide methods for
supporting both behaviors, you're putting two distinct behaviors into
the
same type.

I understand the unwillingness to move away from String as a byte
vector,
but with multibyte support coming you really can't have String ==
ByteArray
without causing problems somewhere. They simply don't have the same
behavior, and trying to pretend they do is asking for trouble.
2a321daf565791ad30ac5ee945abf59a?d=identicon&s=25 Izidor Jerebic (Guest)
on 2006-06-25 22:46
(Received via mailing list)
On 25.6.2006, at 21:12, Austin Ziegler wrote:

> String and ByteArray.
Well, if it is a byte array, it is not a String (an array of
characters), is it?

If Ruby would have RegEx operations on byte arrays, there would be no
need for untyped quasi String. API that has two incompatible things
as one class is just plain ugly and wrong.

Reading jpeg image in a String is totally wrong.  You need bytes. You
get characters, but they aren't really characters, they are bytes.
Until something happens (maybe) and they are characters (maybe), or
they are not (maybe). img_var[5] is what? 6th byte? 9th 2 bytes if
encoding is utf8? What exactly? This is a clear API? There is no need
for bytes masquerading as Strings. None. This practice just confuses
the writer and the reader of the code. You need either bytes or
Strings. Never both in the same variable. They are semantically
totally different. At least they should be (we would not have
problems if people would honour this distinction).

>
> Please don't try to assume that the problem is this completley
> unnecessary division. The problem is that existing strings are
> completely unencoded and have no way of being flagged with an encoding
> that is supported in any way across all of Ruby.

The problem is exactly this: the separation between bytes and
characters. This is the general problem we have and discuss right
now. API should help us solve the problem.

And you apparently missed all the attempts to extend String (also
with encodings a la 1.9) that failed because of existing software,
not because of Ruby.
>
> Ruby does not need a String with an internal representation in
> Unicode;

Nobody says at this point of conversation that we need internal
representation in unicode for all strings. We just want to avoid
thinking about ANY encoding. We have other things to do. So having a
transparent conversions between compatible encodings is a must.

> Ruby does not need a separate byte vector. An unencoded string can be
> treated as a byte vector with no problems.
> ; if it is determined to have
> textual meaning, it can be tagged with an encoding very simply

It can be, but it is not and will not be. Do you read emails? The
problem is that people do not do things like that. And then other
people have problems. If all the code you run is yours, then you are
right. For many people that is not true.

> . There are times when the
> encoding is *not* best treated in Unicode, especially if there are
> potential conversion errors.

Why do you keep on about this?

Once again - WE DO NOT CARE WHAT ENCODING IS THERE. We just want the
string operations to work without any extra programming work when
operands have compatible encodings.

As written very well by Lugovoi Nikolai:

>  b) no need to care what encodings have strings returned from third-
> party code;
>  c) using explicit stated conversion options for external IO.
> 3) on Unicode and i18n : at least to have a set of classes for
> Unicode-specific tasks (collation, normalization, string search,
> locale-aware formatting etc.) that would efficiently work with Ruby
> strings.

Me too, please.


izidor
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-06-26 00:16
(Received via mailing list)
On 6/25/06, Izidor Jerebic <ij.rubylist@gmail.com> wrote:
> >
> > This is an incorrect and unsupportable statement. It is completely
> > unnecessary to separate unencoded (e.g., binary) String support into
> > String and ByteArray.
>
> Well, if it is a byte array, it is not a String (an array of
> characters), is it?
>
> If Ruby would have RegEx operations on byte arrays, there would be no
> need for untyped quasi String. API that has two incompatible things
> as one class is just plain ugly and wrong.

Here you contradict yourself. Regexes are string (character)
operations, and you want them on byte arrays. So the concepts aren't
really separate. Similarily, when you read part of a file, and use it
to determine what kind of file it was you do not want to convert that
part into another class or re-read it because somebody decided String
and ByteVector are separate.

Plus this has been already mentioned here.

Michal
B1102f65359ee629df508c7857f03b1c?d=identicon&s=25 Phillip Hutchings (Guest)
on 2006-06-26 00:25
(Received via mailing list)
> Here you contradict yourself. Regexes are string (character)
> operations, and you want them on byte arrays. So the concepts aren't
> Similarily, when you read part of a file, and use it
> to determine what kind of file it was you do not want to convert that
> part into another class or re-read it because somebody decided String
> and ByteVector are separate.

Why not? When I read CGI params I get them as strings, but if I want
to add them together I need to convert them to integers, because
someone decided that "1" != 1. This is a good thing, so you don't get
"5 purple elephants"+"3 monkeys" = 7, like you do in PHP. Likewise,
when you read from a file/socket/whatever you might not be getting a
real string, you might be getting a byte array. They are fundamentally
different things, a byte array may happen to contain text at some
point, but some time later it may be just a stream of data. Conversely
a String _always_ contains human-readble text in whatever encoding you
want.

As someone who has to work with Unicode in PHP, I'd say it's important
to separate the types. If you want to display something to a user you
have to know what it is, but when you're reading a file you don't
care, unless you know what's in it.

A Unicode String could be a subclass of the byte array with some
niceties for dealing with multibyte characters. Just a thought.
87fe25bf0272d8ad886dda793bdcbbd9?d=identicon&s=25 Tim Bray (Guest)
on 2006-06-26 00:37
(Received via mailing list)
On Jun 25, 2006, at 1:45 PM, Izidor Jerebic wrote:

> Well, if it is a byte array, it is not a String (an array of
> characters), is it?

+1 to this and to Nutter previously.  Text strings and byte arrays
are different kinds of things and both are useful and I don't see any
benefit from trying to pretend they're the same thing.  But some
apparently-smart people seem to think there is a benefit; perhaps
they could explain it in simple terms for those of us insufficiently-
clued to see it? -Tim
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-26 00:59
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Mon, 26 Jun 2006 05:38:46 +0900, "Charles O Nutter"
<headius@headius.com> writes:

|When is a ByteArray not a ByteArray? When is a String not a String? Is it
|correct to mingle the two concepts perpetually, when they each have fairly
|specific definitions? My problem with continuing to treat String as a byte
|vector is that it forces two somewhat incompatible concepts on the same
|class and the same methods.

A string is a sequence of data that can be represented by small
integers.  Some may want to treat them as CharacterStrings, other may
want to treat them ByteStrings.  They are not different as you say.
On many platforms, a file can contain text data or binary data.  Is a
chunk of data read from a open file a text, or binary?  If you
separate ByteArray and (Character) String, you will need to have two
separate IO classes, BinaryIO and TextIO, etc.  Or you will need
explicit conversion from read ByteArray to CharacterString.  That
makes Ruby programs look a lot like Java programs, which I don't want
them to be.

One of the good property of Ruby class library is a small number of
classes.  A class might have multiple roles.  For example, a Ruby
Array can be treated as Stacks, Queues, etc.  And it is a good thing,
rather than having separate classes for each role.  Why can't Strings
be both sequence of text and binary data?

							matz.
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-26 03:02
(Received via mailing list)
On 6/25/06, Izidor Jerebic <ij.rubylist@gmail.com> wrote:
> Well, if it is a byte array, it is not a String (an array of
> characters), is it?

It could be indistinguishable from such. Even a Unicode string is
ultimately an array of bytes in memory. It just happens that there's a
higher level abstraction that can be used to interpret that particular
array of bytes. What you're asking for is rather like the difference
between std::string and std::vector<unsigned char>. They represent the
same thing, but don't work the same. If you're going to have a String
and ByteVector that work the same (except that the String also has the
higher-level interpretation of characters), is it meaningfully a
different object?

I think not. Indeed, I think that having a separate object for these
would increase the overall complexity and reduce the usability overall.

>> Please don't try to assume that the problem is this completley
>> unnecessary division. The problem is that existing strings are
>> completely unencoded and have no way of being flagged with an
>> encoding that is supported in any way across all of Ruby.
> The problem is exactly this: the separation between bytes and
> characters. This is the general problem we have and discuss right now.
> API should help us solve the problem.

> And you apparently missed all the attempts to extend String (also with
> encodings a la 1.9) that failed because of existing software, not
> because of Ruby.

Excuse me? You don't know what you're talking about here. No existing
version of Ruby has a String with encodings. Not even Ruby 1.9. Any
extension which tries to do this *will fail* because there is no way to
enforce this extension's semantics on all of Ruby and all extensions.
Ruby 1.9 will be different because the m17n String will be a guaranteed
behaviour in Ruby.

The problem is not the separation between bytes and characters, but
that there's no way *in Ruby* to distinguish between the two, at least
not reliably.

>> Ruby does not need a String with an internal representation in
>> Unicode;
> Nobody says at this point of conversation that we need internal
> representation in unicode for all strings. We just want to avoid
> thinking about ANY encoding. We have other things to do. So having a
> transparent conversions between compatible encodings is a must.

I think that you're confusing me with someone else. Most people who have
advocated a separate ByteVector have been unable to articulate exactly
what this would buy us, and most have also advocated an internal Unicode
representation of Strings. I have been one of the ones who have
advocated transparent conversions all along. Frankly, with coersion, it
would be possible to upconvert to a compatible conversion between any
encoding.

>> Ruby does not need a separate byte vector. An unencoded string can be
>> treated as a byte vector with no problems. ; if it is determined to
>> have textual meaning, it can be tagged with an encoding very simply
> It can be, but it is not and will not be. Do you read emails? The
> problem is that people do not do things like that. And then other
> people have problems. If all the code you run is yours, then you are
> right. For many people that is not true.

"Is not" is a useless term. OF COURSE IT ISN'T -- right now. In the
future, with the m17n Strings, it could be -- and would be. And yes, I
have read every single one of these emails about Unicode. Most of them
have been ignorant of anything but their own narrow needs and clueless
about good API design.

>> There are times when the encoding is *not* best treated in Unicode,
>> especially if there are potential conversion errors.
> Why do you keep on about this?
>
> Once again - WE DO NOT CARE WHAT ENCODING IS THERE. We just want the
> string operations to work without any extra programming work when
> operands have compatible encodings.

I suggest you look through the Unicode threads again. You'll find
your statement is untrue. There are a lot of people who (foolishly) want
Unicode to be the only internal representation of Strings in Ruby.

> As written very well by Lugovoi Nikolai:
>> What I, as Ruby user, wish for Unicode/M17N support:
>> 1) reliability and consistency:
>>  a) String should be abstraction for character sequence,
>>  b) String methods shouldn't allow me to garble internal
>>     representation;
>>  c) treating String as byte sequence is handy, but must be explict
>>     stated.

An unencoded -- raw -- String would be *only* interpretable as a byte
sequence unless "recoded." Aside from that, everything said above would
be true.

>> 2) coding comfort:
>>  a) no need to care what encodings have strings while working with
>>     them;
>>  b) no need to care what encodings have strings returned from third-
>>     party code;
>>  c) using explicit stated conversion options for external IO.

You'll always need to care, even if you're using Unicode. You can't
*not* care and claim to be doing Unicode or m17n work. We can *reduce*
those concerns, but you *CANNOT* be ignorant of this at any time.

>> 3) on Unicode and i18n : at least to have a set of classes for
>> Unicode-specific tasks (collation, normalization, string search,
>> locale-aware formatting etc.) that would efficiently work with Ruby
>> strings.
> Me too, please.

That would be useful.

-austin
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-26 03:14
(Received via mailing list)
On 6/25/06, Phillip Hutchings <sitharus@sitharus.com> wrote:
>> Here you contradict yourself. Regexes are string (character)
>> operations, and you want them on byte arrays. So the concepts aren't
>> Similarily, when you read part of a file, and use it to determine
>> what kind of file it was you do not want to convert that part into
>> another class or re-read it because somebody decided String and
>> ByteVector are separate.
> Why not? When I read CGI params I get them as strings, but if I want
> to add them together I need to convert them to integers, because
> someone decided that "1" != 1. This is a good thing, so you don't get
> "5 purple elephants"+"3 monkeys" = 7, like you do in PHP.

Sorry, but "reading" CGI params is a red herring. You may get it as one
thing and then convert it to something else.

> Likewise, when you read from a file/socket/whatever you might not be
> getting a real string, you might be getting a byte array. They are
> fundamentally different things, a byte array may happen to contain
> text at some point, but some time later it may be just a stream of
> data. Conversely a String _always_ contains human-readble text in
> whatever encoding you want.

Okay. What class should I get here?

  data = File.open("file.txt", "rb") { |f| f.read }

Under the people who want separate ByteVector and String class, I'll
need *two* APIs:

  st = File.open("file.txt", "rb") { |f| f.read_string }
  bv = File.open("file.txt", "rb") { |f| f.read_bytes }

Stupid, stupid, stupid, stupid. If I have guessed wrong about the
contents of file.txt, I have to rewind and read it again. Better to
*always* read as bytes and then say, "this is actually UTF-8". This
would be as stupid in C++, Java, or C#:

  class File
  {
	bool read(string& st);
	bool read(byte_vector& bv);
  }

Yes, I can't actually read into the item, but have to call an accessor.
Moronic design, mostly because I can't do:

  class File
  {
	string read(void);
	byte_vector read(void);
  }

That would help in static languages, but they can't do that -- and Ruby
can't do it either, since variables are just labels.

> As someone who has to work with Unicode in PHP, I'd say it's important
> to separate the types. If you want to display something to a user you
> have to know what it is, but when you're reading a file you don't
> care, unless you know what's in it.

The problem here is not unification. The problem here is that PHP is
stupid. It is generally recognised that Ruby's API decisions are much
smarter than most other languages, and this is a good example of where
this would happen.

> A Unicode String could be a subclass of the byte array with some
> niceties for dealing with multibyte characters. Just a thought.

Unnecessary and overcomplex.

-austin
B1102f65359ee629df508c7857f03b1c?d=identicon&s=25 Phillip Hutchings (Guest)
on 2006-06-26 03:23
(Received via mailing list)
> Sorry, but "reading" CGI params is a red herring. You may get it as one
> thing and then convert it to something else.

Exactly.

> > Likewise, when you read from a file/socket/whatever you might not be
> > getting a real string, you might be getting a byte array. They are
> > fundamentally different things, a byte array may happen to contain
> > text at some point, but some time later it may be just a stream of
> > data. Conversely a String _always_ contains human-readble text in
> > whatever encoding you want.
>
> Okay. What class should I get here?
>
>   data = File.open("file.txt", "rb") { |f| f.read }

A byte vector. Unknown input, so you just get a stream of bytes.

> Under the people who want separate ByteVector and String class, I'll
> need *two* APIs:
>
>   st = File.open("file.txt", "rb") { |f| f.read_string }
>   bv = File.open("file.txt", "rb") { |f| f.read_bytes }

Why? This looks needlessly complex.

string = File.open('file.txt', 'r') {f.read.to_s(:utf-8)}

Or possibly
string = File.open('file.txt', 'r') {f.read(:utf8)}
bytes = File.open('file.txt', 'r') {f.read(:bytearray)}

with no argument assuming it's a default encoding. But with this
approach the same class could be used for both, which takes us full
circle ;)

> > As someone who has to work with Unicode in PHP, I'd say it's important
> > to separate the types. If you want to display something to a user you
> > have to know what it is, but when you're reading a file you don't
> > care, unless you know what's in it.
>
> The problem here is not unification. The problem here is that PHP is
> stupid. It is generally recognised that Ruby's API decisions are much
> smarter than most other languages, and this is a good example of where
> this would happen.

Hence why I'm using Ruby, but I'm paid for PHP. Ruby is by far the
nicer language.

The best approach to my untrained eye would be for some sort of global
setting for all libraries to operate on, and the developer has to
ensure that all data are read in that encoding. Hopefully it will make
dealing with legacy data will be easier. The ideal situation would be
for everything to be in one encoding, but that just doesn't happen.
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-26 04:24
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Mon, 26 Jun 2006 10:22:15 +0900, "Phillip Hutchings"
<sitharus@sitharus.com> writes:

|>   st = File.open("file.txt", "rb") { |f| f.read_string }
|>   bv = File.open("file.txt", "rb") { |f| f.read_bytes }
|
|Why? This looks needlessly complex.
|
|string = File.open('file.txt', 'r') {f.read.to_s(:utf-8)}
|
|Or possibly
|string = File.open('file.txt', 'r') {f.read(:utf8)}
|bytes = File.open('file.txt', 'r') {f.read(:bytearray)}

They are equally more complex than the current design.  If File can
return String or ByteArray, why shouldn't String with "no encoding"
behave as sequence of bytes instead of separating?  Are there any
specific operations that should be in ByteArray but not in String, or
vise versa?

							matz.
F1d37642fdaa1662ff46e4c65731e9ab?d=identicon&s=25 Charles O Nutter (Guest)
on 2006-06-26 04:43
(Received via mailing list)
One clarification I'd like to add to this: I'm not saying that a
ByteArray
needs to be added, but if you're going to treat String as a ByteArray,
then
perhaps there should be another type for character vectors?

Perhaps through some logic (perhaps the fact that this is the "way it
is" in
Ruby 1.8) String does == ByteArray. If I could play devil's advocate for
a
moment, maybe the new, fancy m17n String, however it's implemented,
should
be a different class?

String == ByteArray in form and function
CharString == a string of characters with some particular encoding,
character logic, and so on

Perhaps even CharString < String, so it retains byte-level read/write
operations.

There's another obvious advantage here...APIs that currently return a
byte
array String will continue to do so, as they work in Ruby 1.8.
CharString
could also be implemented today for Ruby 1.8, providing an encoding and
character-aware String implementation for applications that need it.

My only point about the dichotomy between <byte collection treated as a
string> and <character collection treated as a string> is that at some
level, they imply different behaviors, different APIs, different
interfaces.
Perhaps the answer is not to change existing Ruby code to use a m17n
String
while trying to retain byte array capabilities in the same time...but
maybe
it's worth considering that the new behavior warrants a separate type?

String.to_cs(:utf8) => CharString
String retains current interface and semantics
CharString gains the [n] => character or single-char string rather than
int,
etc.

I know you (matz) want to break as much as possible with the 2.0
release,
but I still don't see the advantage of marrying the "byte array string"
and
"char string" types in the same class when separate types and behaviors
would be more logical and break far less.
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-26 04:43
(Received via mailing list)
On 6/25/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
>| bytes = File.open('file.txt', 'r') {f.read(:bytearray)}
> They are equally more complex than the current design.  If File can
> return String or ByteArray, why shouldn't String with "no encoding"
> behave as sequence of bytes instead of separating?  Are there any
> specific operations that should be in ByteArray but not in String, or
> vise versa?

There are operations for Strings (#each_character, perhaps) that make
less sense for ByteVectors than for character-based Strings. But
everything or nearly everything you would want to do with a ByteVector
you would want to do with a String, and some operations from Strings
make sense on ByteVectors (regexp operations).

I would much rather keep the API -- and the class library -- simple. I
would rather do this:

  st = File.open("file.txt", "rb", :encoding => :utf8) { |f| f.read }

or

  bv = File.open("file.txt", "rb") { |f| f.read }
  st = bv.to_encoding(:utf8)

-austin
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-26 05:02
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Mon, 26 Jun 2006 11:37:45 +0900, "Charles O Nutter"
<headius@headius.com> writes:
|I know you (matz) want to break as much as possible with the 2.0 release,
|but I still don't see the advantage of marrying the "byte array string" and
|"char string" types in the same class when separate types and behaviors
|would be more logical and break far less.

I still don't see how separate types and behaviors would be more
logical and break far less.  For example, if I want to check EXIF
conformance of a jpeg file, I do

  def self.exif_file? (filename)
    exif_header = "\xff\xd8\xff\xe1"
    magic = File.open(filename) {|f| f.read(4) }
    magic == exif_header
  end

I am not sure what you expect about separation, but I doubt separation
would make above code to "be more logical and break far less".

							matz.
87fe25bf0272d8ad886dda793bdcbbd9?d=identicon&s=25 Tim Bray (Guest)
on 2006-06-26 06:43
(Received via mailing list)
On Jun 25, 2006, at 6:11 PM, Austin Ziegler wrote:

> Under the people who want separate ByteVector and String class, I'll
> need *two* APIs:
>
>  st = File.open("file.txt", "rb") { |f| f.read_string }
>  bv = File.open("file.txt", "rb") { |f| f.read_bytes }

Maybe I'm missing something, but in today's networked heterogeneous
environment, that first call looks deeply dangerous to me.  I don't
see how you can expect to get a String out of a file in the general
case.  Files contain bytes, strings contain characters, and
pretending you can get from one to the other without explicit
encoding specification or inference is unsound.

Pardon me if I'm missing something obvious. -Tim
87fe25bf0272d8ad886dda793bdcbbd9?d=identicon&s=25 Tim Bray (Guest)
on 2006-06-26 06:49
(Received via mailing list)
On Jun 25, 2006, at 7:21 PM, Yukihiro Matsumoto wrote:

> Are there any
> specific operations that should be in ByteArray but not in String, or
> vise versa?

Well, on strings, indexing and substring operations and iterators and
regular expressions should (at least optionally) have character
rather than byte semantics, right?   Another example is encoding-
normalization (combining diacritics, etc) which doesn't apply to byte
arrays. -Tim
2a321daf565791ad30ac5ee945abf59a?d=identicon&s=25 Izidor Jerebic (Guest)
on 2006-06-26 06:53
(Received via mailing list)
On 26.6.2006, at 5:01, Yukihiro Matsumoto wrote:

> I am not sure what you expect about separation, but I doubt separation
> would make above code to "be more logical and break far less".

Above code assumes all file operations return byte arrays. What is
the code when we want to obtain String of characters?

What if there is some $KCODE (or equivalent) setting somewhere in the
program before these lines? What would be the effect of that?

The problem is the auto-magic encoding handling which is required to
have text processing be as simple as it is now. You can have either
text processing (which adds encoding handling for us, combines bytes
in characters etc.) or byte processing (which does not). How do we
distinguish between the two modes of operation?

The obvious way is by adding a ByteArray. But maybe there is better
way...

izidor
E0526a6bf302e77598ef142d91bdd31c?d=identicon&s=25 Daniel DeLorme (Guest)
on 2006-06-26 07:26
(Received via mailing list)
Yukihiro Matsumoto wrote:
> I am not sure what you expect about separation, but I doubt separation
> would make above code to "be more logical and break far less".

Just jumping into the discussion here, I have to agree with Matz. A
char-vector
is simply a higher-level representation of a byte-vector, not different
enough
to warrant two entirely separate classes.

I think the real issue is not technical but rather a problem of
perception and
education. Ever since C-style strings, programmers have learned to view
a string
as an array of chars. So when we need to do char-string manipulation, we
resort
to pointer arithmetic when it fact the "correct" and ruby-native way of
manipulating strings is with regular expressions. Instead of giving in
to this
old string-as-array mentality, maybe we should teach people to use
regular
expressions? Hmmm, probably impossible.

A string can be interpreted as both a sequence of bytes or a sequence of
characters, but the methods can be confusing. Obviously, upcase and
downcase are
operations at the character level, but what is [] supposed to do? From
the ruby
point of view, str[0..3] gives you the first 4 bytes and
str.scan(/^..../) gives
you the first 4 characters. But for the majority with the
string-as-array
mentality, [] is ambiguous; does it give you access to the bytes or to
the
characters of the string? In the interest of facilitating education,
there needs
to be a clear disambiguation; instead of str[0..3] it should be
str.byte(0..3)
and str.char(0..3) -- with maybe the latter one giving a warning along
the lines
of "use regular expressions!" ;-)  That way the ambiguity between
byte-vector
and char-vector could be resolved.

Daniel
Cb75e9a5b18ad023ab1cce64e7cdebab?d=identicon&s=25 Lothar Scholz (Guest)
on 2006-06-26 07:36
(Received via mailing list)
Hello Tim,

TB> Well, on strings, indexing and substring operations and iterators
and
TB> regular expressions should (at least optionally) have character
TB> rather than byte semantics, right?

For UTF-8 which hopefully will rule the world soon, the worst libraries
i have seen are trying to do this. But it is not the intention of the
designers and with an implementation that works on characters
you loose the genius encoding style of UTF-8.

Of course some operation are more difficult, but this is left for good
reasons to the application programmer. Only few cases of string
manipulation need some special (non ASCII) character handling.
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-26 07:51
(Received via mailing list)
On 6/25/06, Charles O Nutter <headius@headius.com> wrote:
> One clarification I'd like to add to this: I'm not saying that a ByteArray
> needs to be added, but if you're going to treat String as a ByteArray, then
> perhaps there should be another type for character vectors?

There's no meaningful distinction between the division of
ByteArray/String and String/CharString. I do *not* believe that this
is a viable option. The *sole* argument in favour is that we could add
a CharString to Ruby 1.8 -- but I believe that this would be
stampeding us in the wrong direction.

Even if CharString < String, there will be problems -- people already
note that there are issues with subclasses of the built-in classes.

> My only point about the dichotomy between <byte collection treated as a
> string> and <character collection treated as a string> is that at some
> level, they imply different behaviors, different APIs, different interfaces.
> Perhaps the answer is not to change existing Ruby code to use a m17n String
> while trying to retain byte array capabilities in the same time...but maybe
> it's worth considering that the new behavior warrants a separate type?

This is where I disagree with you completely. If I have a String that
contains ISO-8859-15 data, it *happens* that s#byte_count and s#length
are the same value. It differs with UTF-8 data, but the interpretation
of a Character is, at best, a *trait* of the data being stored. I have
*really* given this a lot of thought, and I really do think that Matz
is right about this and that the people who want Unicode-native
strings are wrong. This sort of sucks for JRuby because of problems
with Java. But I do not think that Sun made the right decision with
Java. If nothing else, they ended up backing a dead "standard" during
the initial phases, and have had to hack out since then.

> I know you (matz) want to break as much as possible with the 2.0 release,
> but I still don't see the advantage of marrying the "byte array string" and
> "char string" types in the same class when separate types and behaviors
> would be more logical and break far less.

It *isn't* more logical. It doubles the number of required APIs for
IO. It *completely* complicates things from that perspective, with
little value for the people who have to implement character-oriented
data routines.

-austin
E34b5cae57e0dd170114dba444e37852?d=identicon&s=25 Logan Capaldo (Guest)
on 2006-06-26 07:51
(Received via mailing list)
On Jun 26, 2006, at 1:24 AM, Daniel DeLorme wrote:

> I think the real issue is not technical but rather a problem of
> perception and education. Ever since C-style strings, programmers
> have learned to view a string as an array of chars. So when we need
> to do char-string manipulation, we resort to pointer arithmetic
> when it fact the "correct" and ruby-native way of manipulating
> strings is with regular expressions. Instead of giving in to this
> old string-as-array mentality, maybe we should teach people to use
> regular expressions? Hmmm, probably impossible.

Regular expressions are a very powerful tool, but they do not
describe the entire set of operations one would reasonably want to
perform on a string. Or perhaps they do but in a needlessly complex
way. I want to get the first letter (character?) of a sentence, in
pure regexp terms I'd do this: str.match(/\A./)[0] It's needlessly
cryptic. Note that I'm not trying to make a commentary on whether or
not character string/byte string should be separate, just trying to
point out that "use regular expressions" shouldn't always be the answer.
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-26 08:10
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Mon, 26 Jun 2006 13:51:33 +0900, Izidor Jerebic
<ij.rubylist@gmail.com> writes:

|>
|> I am not sure what you expect about separation, but I doubt separation
|> would make above code to "be more logical and break far less".
|
|Above code assumes all file operations return byte arrays. What is
|the code when we want to obtain String of characters?

  line = File.open(filename, "r", "utf8") {|f| f.gets }

|What if there is some $KCODE (or equivalent) setting somewhere in the
|program before these lines? What would be the effect of that?

I think IO#read shall always return "binary" string, since its
specified length should always be in bytes.  Anyway, when in doubt,
you can explicitly specify "binary" encoding,

|The problem is the auto-magic encoding handling which is required to
|have text processing be as simple as it is now. You can have either
|text processing (which adds encoding handling for us, combines bytes
|in characters etc.) or byte processing (which does not). How do we
|distinguish between the two modes of operation?

By explicitly setting their encoding to "binary", e.g.

  text = obtain_string_data()
  text.encoding = "binary"
  ...

|The obvious way is by adding a ByteArray. But maybe there is better
|way...

Show me the pseudo code using ByteArray, I will show you its
counterpart using String with encoding tag.

							matz.
47b1910084592eb77a032bc7d8d1a84e?d=identicon&s=25 Joel VanderWerf (Guest)
on 2006-06-26 08:16
(Received via mailing list)
Logan Capaldo wrote:
> Regular expressions are a very powerful tool, but they do not describe
> the entire set of operations one would reasonably want to perform on a
> string. Or perhaps they do but in a needlessly complex way. I want to
> get the first letter (character?) of a sentence, in pure regexp terms
> I'd do this: str.match(/\A./)[0] It's needlessly cryptic. Note that I'm
> not trying to make a commentary on whether or not character string/byte
> string should be separate, just trying to point out that "use regular
> expressions" shouldn't always be the answer.
>

irb(main):001:0> "It's needlessly cryptic."[/./]
=> "I"

Not disagreeing, just trying to get more credit for regexes.

irb(main):010:0> "It's needlessly cryptic."[/.{17}(.)/, 1]
=> "r"

That's a bit more cryptic.
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-26 08:16
(Received via mailing list)
On 6/26/06, Tim Bray <tbray@textuality.com> wrote:
> pretending you can get from one to the other without explicit
> encoding specification or inference is unsound.

Um. You're not missing anything -- I'm mocking the API pair that would
be required to make this work as certain advocates have suggested.

> Pardon me if I'm missing something obvious. -Tim

You're not. IO should be done on byte buffers. There's no meaningful
and useful distinction between a byte buffer and a string at the most
basic level. There's an additional interpretation that's possible at a
higher level (giving character-oriented operations), but that in and
of itself does not imply a need for a separation of the two concepts.
(Indeed, I find myself infuriated in C++ when I have to do something
that would work well with std::vector<unsigned char> and I'm actually
working with std::string -- or vice versa.)

-austin
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-26 08:16
(Received via mailing list)
On 6/26/06, Tim Bray <tbray@textuality.com> wrote:
> On Jun 25, 2006, at 7:21 PM, Yukihiro Matsumoto wrote:
> > Are there any
> > specific operations that should be in ByteArray but not in String, or
> > vise versa?
> Well, on strings, indexing and substring operations and iterators and
> regular expressions should (at least optionally) have character
> rather than byte semantics, right?   Another example is encoding-
> normalization (combining diacritics, etc) which doesn't apply to byte
> arrays. -Tim

Those are interpretations of the data underlying the String, though.
Nothing says we can't use these sort of operations still, especially
with Ruby's dynamic objects. But I *firmly* believe that it can be
done in a way so as to not require the separation of a String from a
Byte Array.

-austin
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-26 08:23
(Received via mailing list)
On 6/26/06, Izidor Jerebic <ij.rubylist@gmail.com> wrote:
> >
> > I am not sure what you expect about separation, but I doubt separation
> > would make above code to "be more logical and break far less".
> Above code assumes all file operations return byte arrays. What is
> the code when we want to obtain String of characters?

As Tim Bray pointed out in a response to me, trying to get a String
from a file is a ludicrous operation. I was mocking the API required
(e.g., File#read_string or something equally bozonic). You need to
read your data and *then* mark it as a String with a particular
encoding. And if you *globally* change the interpretation of File#read
to be String, you will be breaking the ability to read truly binary
data.

> The problem is the auto-magic encoding handling which is required to
> have text processing be as simple as it is now. You can have either
> text processing (which adds encoding handling for us, combines bytes
> in characters etc.) or byte processing (which does not). How do we
> distinguish between the two modes of operation?
>
> The obvious way is by adding a ByteArray. But maybe there is better
> way...

Yes. It's to actually read what has been suggested. The m17n String
won't be a magic bullet. But you'll be able to do something like:

  bv = File.open("file.txt", "rb") { |f| f.read }
  sv = bv.with_encoding(:utf8)

Or something like that. And you can still do bv == "\xff\xd8\xff\xe1"
as appropriate.

-austin
E0526a6bf302e77598ef142d91bdd31c?d=identicon&s=25 Daniel DeLorme (Guest)
on 2006-06-26 08:29
(Received via mailing list)
Logan Capaldo wrote:
>
> Regular expressions are a very powerful tool, but they do not describe
> the entire set of operations one would reasonably want to perform on a
> string. Or perhaps they do but in a needlessly complex way. I want to
> get the first letter (character?) of a sentence, in pure regexp terms
> I'd do this: str.match(/\A./)[0] It's needlessly cryptic. Note that I'm
> not trying to make a commentary on whether or not character string/byte
> string should be separate, just trying to point out that "use regular
> expressions" shouldn't always be the answer.

It's funny, maybe I'm just dumb but I can't think of a single
*real-world*
example where you'd want to access particular characters of a string.
Why do you
want the first char? In the context of a byte string there might be
something
special at position n (e.g. exif header), but in the context of a
human-readable
string what is there? For example, if you want that first char in order
to check
if it's a space or not, you should use str =~ /^ /, etc, etc. I honestly
can't
think of any real-world examples where regular expressions are less
appropriate
than pointer arithmetic. Can you illuminate me with some?

Daniel
2a321daf565791ad30ac5ee945abf59a?d=identicon&s=25 Izidor Jerebic (Guest)
on 2006-06-26 09:00
(Received via mailing list)
On 26.6.2006, at 8:08, Yukihiro Matsumoto wrote:

>
> |What if there is some $KCODE (or equivalent) setting somewhere in the
> |program before these lines? What would be the effect of that?
>
> I think IO#read shall always return "binary" string, since its
> specified length should always be in bytes.  Anyway, when in doubt,
> you can explicitly specify "binary" encoding,

Oh, I see. So basically IO always returns ByteArray, and one needs to
convert it to String of characters explicitly (or implicitly by
specifying a parameter to IO).

No magic tagging with encoding. Well, this is nice and easy to
understand.

But how will this influence the simplicity of small programs in Ruby
which deal with data in known (single) encoding? I was under
impression that there would be some magic global setting which will
enable such programs to use Strings in correct encoding.

Thank you for clarifications. They are most welcome...

izidor
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-26 09:34
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Mon, 26 Jun 2006 15:58:30 +0900, Izidor Jerebic
<ij.rubylist@gmail.com> writes:

|But how will this influence the simplicity of small programs in Ruby
|which deal with data in known (single) encoding? I was under
|impression that there would be some magic global setting which will
|enable such programs to use Strings in correct encoding.

The detail is not fixed yet but it would honor locales for the default
encoding.

							matz.
2f717e37b332d7816dbf732b3fc8ee72?d=identicon&s=25 Dmitrii Dimandt (Guest)
on 2006-06-26 09:38
(Received via mailing list)
On 6/26/06, Daniel DeLorme <dan-ml@dan42.com> wrote:
>
> It's funny, maybe I'm just dumb but I can't think of a single *real-world*
> example where you'd want to access particular characters of a string. Why do you
> want the first char? In the context of a byte string there might be something
> special at position n (e.g. exif header), but in the context of a human-readable
> string what is there? For example, if you want that first char in order to check
> if it's a space or not, you should use str =~ /^ /, etc, etc. I honestly can't
> think of any real-world examples where regular expressions are less appropriate
> than pointer arithmetic. Can you illuminate me with some?

Substrings? Finding occurence of a string in a nother string? Why
shouldn't str[0..3] work on characters (for a string with encoding
set)? Maybe I want to do something like str[0] =
Unicode::upcase(str[0])? :)

Isn't that what Humane Interface Design
(http://www.martinfowler.com/bliki/HumaneInterface.html) is all about
;-)

Regular expressions _are_ cryptic. They are powerful, but do I need a
sledgehammer when I need a paperclip?
Dfa842ab64f794363e66d7cce85ba277?d=identicon&s=25 Alexey Borzenkov (snaury)
on 2006-06-26 09:45
Yukihiro Matsumoto wrote:
> You said Tcl has Unicode support that works well with you.  So that I
> think treating all of them in UTF-8 is OK for you.

It's actually not about treating everything in UTF-8, it just unifies
everything in Tcl in a way that you can have all variety of characters
in strings.

> Then how can it
> determine which should be in the current code page, or in Unicode?
> Or using Win32 API ending with W could allow you living in the
> Unicode?

Well, currently (just downloaded latest cvs sources) ruby uses ansi
versions of CreateFile and FindFirstFile/FindNextFile APIs, so even if I
say, for example, KCODE to UTF-8 (not sure how you can currently make
ruby work with UTF-8) ansi versions of APIs are still called, and that
means that

  1) if there are filenames with characters that don't fall in range of
current codepage, I will receive '?' in place of them when I enumerate
directory contents.
  2) I receive filenames in current code page, and not in UTF-8
  3) There is no way for me to open a file with these characters using
standard ruby classes

The same with win32ole extension, I can see a lot of ole_wc2mb/ole_mb2wc
there, which breaks things horribly when interoperating with, for
example, Excel and trying to work with russian/greek/japanese and all
other languages all on the same sheet (after I process the sheet,
modifying all of the cells, it will just strip all languages except
russian from it).

In *nixes you can just change your locale to *.UTF-8 and you're ok with
that, because everything you receive when enumerating directory is
UTF-8, and File.open will expect UTF-8. Unfortunately, for Windows that
is not possible: MS already provides 'wide' versions of APIs for those
who need them, and there is no UTF-8 ANSI codepage you can set as
default (because UTF-8 codepage in Windows is somewhat 'virtual', for
conversion purposes only).

In Tcl you have all of your strings in UTF-8, and when Tcl interoperates
with the rest of the world, it converts strings appropriately (for
example, on Win9x there are mostly no 'wide' APIs, so it converts
strings to current code page and uses ansi APIs, but on WinNT it
converts it to unicode and uses 'wide' APIs). What I was thinking is
maybe a way for setting "current codepage" for ruby on win32 (including
possibility to set it to UTF-8), and so that when ruby works with the
world it would use 'wide' APIs when possible, converting to and from
this codepage (so that instead the way it is Tcl when it is hard-coded
to be UTF-8, there would be a possibility to choose), because there are
no other way to do that on Windows by user (user can't set current
codepage to UTF-8).
Dfa842ab64f794363e66d7cce85ba277?d=identicon&s=25 Alexey Borzenkov (snaury)
on 2006-06-26 09:50
Snaury Miyoto wrote:
> Yukihiro Matsumoto wrote:
>> Then how can it
>> determine which should be in the current code page, or in Unicode?
>> Or using Win32 API ending with W could allow you living in the
>> Unicode?
> Well, currently (just downloaded latest cvs sources) ruby uses ansi
> versions of CreateFile and FindFirstFile/FindNextFile APIs, so even if I
> say, for example, KCODE to UTF-8 (not sure how you can currently make
> ruby work with UTF-8) ansi versions of APIs are still called, and that
> means that
> The same with win32ole extension, I can see a lot of ole_wc2mb/ole_mb2wc
> there, which breaks things horribly when interoperating with, for
> example, Excel and trying to work with russian/greek/japanese and all
> other languages all on the same sheet (after I process the sheet,
> modifying all of the cells, it will just strip all languages except
> russian from it).

Ah, well, for ole that's not true, only now I realized I can set
codepage there to UTF-8, but still similar thing for win32 file io (and
maybe for other things where win32 API or win32 cruntime used) would be
great.
E0526a6bf302e77598ef142d91bdd31c?d=identicon&s=25 Daniel DeLorme (Guest)
on 2006-06-26 10:08
(Received via mailing list)
Dmitrii Dimandt wrote:
> Substrings? Finding occurence of a string in a nother string?

Those operations are precisely what regexes are best at.

> shouldn't str[0..3] work on characters (for a string with encoding
> set)? Maybe I want to do something like
> str[0] = Unicode::upcase(str[0])? :)

What about
   str.sub!(/^./){ |c| Unicode::upcase(c) }
That hardly seems more cryptic to me.

It's not that I don't understand the attraction; it's just that I think
when
handling char-strings it's best to change your mental model to something
further
away from char/byte arrays.

BTW, if str[0..3] returns the first 4 characters, then how do I get the
first 4
bytes?

Daniel
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-06-26 14:21
(Received via mailing list)
On 6/26/06, Daniel DeLorme <dan-ml@dan42.com> wrote:
>
> It's funny, maybe I'm just dumb but I can't think of a single *real-world*
> example where you'd want to access particular characters of a string. Why do you
> want the first char? In the context of a byte string there might be something
> special at position n (e.g. exif header), but in the context of a human-readable
> string what is there? For example, if you want that first char in order to check
> if it's a space or not, you should use str =~ /^ /, etc, etc. I honestly can't
> think of any real-world examples where regular expressions are less appropriate
> than pointer arithmetic. Can you illuminate me with some?

Have you looked at the "short but unique" ruby quiz?

Also when you are building some search trees or such you want access
to letters one by one.

Thanks

Michal
A2b2f4ee23989dc68529baef9cbddcd6?d=identicon&s=25 Julian 'Julik' Tarkhanov (Guest)
on 2006-06-26 14:51
(Received via mailing list)
On 26-jun-2006, at 3:01, Austin Ziegler wrote:

> I suggest you look through the Unicode threads again. You'll find
> your statement is untrue. There are a lot of people who (foolishly)
> want
> Unicode to be the only internal representation of Strings in Ruby.

Let's say there are people who not-so-foolishly believe that trying
to ahve strings in all possible
encodings is not technically possible and the aforementioned people
don't understand how
a system cen reliably handle them. Especially the aformentioned
people remember that Strings in Ruby
are mutable and can transition from being Unicode to being "something-
else" in one method call.
A2b2f4ee23989dc68529baef9cbddcd6?d=identicon&s=25 Julian 'Julik' Tarkhanov (Guest)
on 2006-06-26 14:55
(Received via mailing list)
On 26-jun-2006, at 3:11, Austin Ziegler wrote:


>
> Stupid, stupid, stupid, stupid. If I have guessed wrong about the
> contents of file.txt, I have to rewind and read it again. Better to
> *always* read as bytes and then say, "this is actually UTF-8". This
> would be as stupid in C++, Java, or C#:

Not so fast, let's say you read from a file:

>  st = File.open("file.txt", "rb") { |f| f.read(4056) }

and you recieve a PART of a unicode string (because you cannot know
where to stop reading before yoy look into the structure).
The only way to make what you read valid now is to slide along the
byte length and try to catch the bytes that you skipped.
Should I continue?
A2b2f4ee23989dc68529baef9cbddcd6?d=identicon&s=25 Julian 'Julik' Tarkhanov (Guest)
on 2006-06-26 14:58
(Received via mailing list)
On 26-jun-2006, at 8:27, Daniel DeLorme wrote:

> It's funny, maybe I'm just dumb but I can't think of a single *real-
> world* example where you'd want to access particular characters of
> a string.

Well, think again. You have a truncate(text) helper in Rails which
truncates the text to X characters and "dot dot dot". The easiest
example. Or you have excerpts... etc.
A2b2f4ee23989dc68529baef9cbddcd6?d=identicon&s=25 Julian 'Julik' Tarkhanov (Guest)
on 2006-06-26 15:07
(Received via mailing list)
On 26-jun-2006, at 10:07, Daniel DeLorme wrote:

>   str.sub!(/^./){ |c| Unicode::upcase(c) }
> That hardly seems more cryptic to me.
It does seem unnatural and hints that you are working with an
encoding-incapable language, because
people who are lucky to be in ASCII will be able to do

str[0] = str[0].upcase

but people who are not will have to invent silly workarounds.

>
> It's not that I don't understand the attraction; it's just that I
> think when handling char-strings it's best to change your mental
> model to something further away from char/byte arrays.
>
> BTW, if str[0..3] returns the first 4 characters, then how do I get
> the first 4 bytes?

str.bytes[0..3] seems OK to me. That is: for Strings the character-
based routines are the base ones, and byte routines are secondary.
Not the "chars" accessor I had to bolt on
right now. The problem is that you have to PROTECT an ignorant
programmer from things like normalization and character unity and
NEVER allow him to cut into a character
of a multibyte string UNLESS he especially mentions that he wants it
that way.
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-06-26 15:29
(Received via mailing list)
On 6/26/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:
> > need *two* APIs:
>
> >  st = File.open("file.txt", "rb") { |f| f.read(4056) }
>
> and you recieve a PART of a unicode string (because you cannot know
> where to stop reading before yoy look into the structure).
> The only way to make what you read valid now is to slide along the
> byte length and try to catch the bytes that you skipped.
> Should I continue?

Why would you read 4096 bytes in the first place?

If you knew the file is in some weird multibyte encoding you should
have set it for the stream, and read something meaningful.

If it is "ascii compatible" (ISO-8859-*, cp*, utf-8, .. ) you can just
use gets.

Otherwise there is no meaningful string content.

Note that 4096 bytes is always OK for UTF-32 (or similar plain wide
character encodings), and may at worst get you half of a surrogate
character for UTF-16. And strings will have to handle incomplete
characters anyway - they may result from some delays/buffering in
network IO or such.

Thanks

Michal
A2b2f4ee23989dc68529baef9cbddcd6?d=identicon&s=25 Julian 'Julik' Tarkhanov (Guest)
on 2006-06-26 15:42
(Received via mailing list)
On 26-jun-2006, at 15:27, Michal Suchanek wrote:
>
> Why would you read 4096 bytes in the first place?
This is a pattern. If a file has no line endings, but just one (very
logn) stream of characters - can you really use gets?
>
> If you knew the file is in some weird multibyte encoding you should
> have set it for the stream, and read something meaningful.

Or there should be a facility that preserves you from reading
incomplete strings. But is it implied that if I set IO.encoding = foo
the IO objects will prevent me? Will they go out to the provider
of the io and get the missing remaining bytes?
In the case of Unicode the absolute, rigorous minimum is to NEVER
EVER slice into a codepoint, and it can go anywhere you want in terms
of complexity (because
slicing between codepoints is also not the way).
>
> If it is "ascii compatible" (ISO-8859-*, cp*, utf-8, .. ) you can
> just use gets.
>
> Otherwise there is no meaningful string content.
>
> Note that 4096 bytes is always OK for UTF-32 (or similar plain wide
> character encodings),
Of which UTF-32 is the only one that is relevant for Unicode, and if
you investigated the subject a little
you would know that slicing Unicode strings at codepoint boundaries
is often NOT enough. That way you can cut a part of
a compound character, a modifier codepoint or an RTL override
remarkably easily, which will just give you a different character
altogether (or alter your string
diplay in a particularly nasty way - that is, _reverse_ your string
display for the remaining output of you program  if you remove an RTL
override terminator).

> and may at worst get you half of a surrogate
> character for UTF-16. And strings will have to handle incomplete
> characters anyway - they may result from some delays/buffering in
> network IO or such.

This is exactly why the notion of having strings both as byte buffers
and character vectors seems a little difficult. 90 percent of my use
cases for Ruby need characters, not bytes
- and I would love to hint it specifically shall that be needed. The
problem right now is that Ruby does not distinguish these at the moment.
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-06-26 16:36
(Received via mailing list)
On 6/26/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:
>
> On 26-jun-2006, at 15:27, Michal Suchanek wrote:
> >
> > Why would you read 4096 bytes in the first place?
> This is a pattern. If a file has no line endings, but just one (very
> logn) stream of characters - can you really use gets?

But can you work with the file in parts then? If there is no
meaningful internal structure you have to work with the file in its
entirety (or do a block copy  but you should not be concerned with
characters then).
If there is a structure you may use alternate line endings.

> of complexity (because
> slicing between codepoints is also not the way).

At most you can expect it to hold incomplete codepoints until they are
read fully I guess. However, incomplete codepoints are going to exist
anyway so the strings must deal with them in one way or another.

> you would know that slicing Unicode strings at codepoint boundaries
> is often NOT enough. That way you can cut a part of
> a compound character, a modifier codepoint or an RTL override
> remarkably easily, which will just give you a different character
> altogether (or alter your string
> diplay in a particularly nasty way - that is, _reverse_ your string
> display for the remaining output of you program  if you remove an RTL
> override terminator).

If the file has some meaningful structure (like line endings or XML)
you should get the complete parts. If it does not you have to deal
with it. And nobody can do it for you except the one who chose the
format in which th file was saved.

> problem right now is that Ruby does not distinguish these at the moment.
But the problem is you cannot distinguish them, not that you do not
have separate classes for them.

Michal
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-26 16:52
(Received via mailing list)
On 6/26/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:
> On 26-jun-2006, at 3:11, Austin Ziegler wrote:
> >  st = File.open("file.txt", "rb") { |f| f.read(4056) }
> and you recieve a PART of a unicode string (because you cannot know
> where to stop reading before yoy look into the structure).
> The only way to make what you read valid now is to slide along the
> byte length and try to catch the bytes that you skipped.
> Should I continue?

Sure. It won't make you any more correct. Let's play with your example:

  st = File.open("file.txt", "rb", :encoding => :utf8) { |f|
f.read(4096) }

Okay. Am I reading 4096 bytes or 4096 characters? The *correct* and
*least surprising* behaviour is to read the specified number of bytes.
Instead it would be better to expose the minimum amount required to
work with this:

  bv = File.open("file.txt", "rb") { |f| f.read(4096) }
  bv.encoding = :utf8
  bv.encoding_valid? # will return false if the whole string isn't a
valid UTF-8 sequence.

You're really looking for something that is, in the end, completely
unworkable and unnecessarily complex in doing so. The m17n String --
with byte vector characteristics retained -- maintains a clear, simple
API with few exceptions that would have to be memorised or understood.
Adding another class *doubles* the size of the class hierarchy that
has to be understood, and if there are *any* variances between them
the number of exceptions effectively doubles. If there *aren't* any
variances between the class APIs, then what's the point of separating
them in the first place?

A string is an ordered sequence of characters. A byte vector is an
ordered sequence of bytes. If your string is suitably flexible, then
it can say that a byte vector is a string where each character is one
byte long and that collation (etc.) are determined by the byte value.
We're not talking rocket science here. Stop trying to make it such.

-austin
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-26 17:05
(Received via mailing list)
On 6/26/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:
> On 26-jun-2006, at 15:27, Michal Suchanek wrote:
>> Why would you read 4096 bytes in the first place?
> This is a pattern. If a file has no line endings, but just one (very
> logn) stream of characters - can you really use gets?

>> If you knew the file is in some weird multibyte encoding you should
>> have set it for the stream, and read something meaningful.
> Or there should be a facility that preserves you from reading
> incomplete strings. But is it implied that if I set IO.encoding = foo
> the IO objects will prevent me? Will they go out to the provider of
> the io and get the missing remaining bytes? In the case of Unicode the
> absolute, rigorous minimum is to NEVER EVER slice into a codepoint,
> and it can go anywhere you want in terms of complexity (because
> slicing between codepoints is also not the way).

Anyone who wants to set all IO operations to a particular encoding is
making a huge mistake. Individual IO operations or handles could be set
to a particular encoding, but you would have a high probability of
breaking code external to you that did any IO operations if you forced
all IO to use your encodings.

> you can cut a part of a compound character, a modifier codepoint or an
> RTL override remarkably easily, which will just give you a different
> character altogether (or alter your string diplay in a particularly
> nasty way - that is, _reverse_ your string display for the remaining
> output of you program  if you remove an RTL override terminator).

Oh, I understand that very well. At least as well as you do. However,
that is independent of whether IO works on encoded or unencoded values.
It's easy enough to check the validity of your encoding, too. If you're
not checking external input for taintedness, then you're doing silly
things, too. One *cannot* hide too much of the complexity from Unicode,
because to do so will increase the chance that programmers not as smart
as you are will, well, screw the pooch royally.

>> and may at worst get you half of a surrogate character for UTF-16.
>> And strings will have to handle incomplete characters anyway - they
>> may result from some delays/buffering in network IO or such.
> This is exactly why the notion of having strings both as byte buffers
> and character vectors seems a little difficult. 90 percent of my use
> cases for Ruby need characters, not bytes - and I would love to hint
> it specifically shall that be needed. The problem right now is that
> Ruby does not distinguish these at the moment.

Yes, and that's where your opposition to maintaining this is
persistently misguided. Ruby *will* distinguish between a String without
an encoding and a String with an encoding. You're basing your opposition
to tomorrow's behaviour based on today's (known bad) behaviour. Please,
stop doing that.

And while most of your use cases deal with characters, code that I've
written deals with both bytes and characters in equal measures.

-austin
D57f4a4788599a38494865a121f16bbe?d=identicon&s=25 Dmitry Severin (Guest)
on 2006-06-26 17:08
(Received via mailing list)
On 6/19/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
>
> |- at present time Ruby parser can parse only sources in ASCII compatible
> |encoding.  Would it change?
>
> No.  Ruby would not allow scripts in EBCDIC, nor UTF-16, although it
> allows processing of those encoding.
>
>
And what about minilanguages, incorporated in Ruby: regexp patterns,
sprintf, strftime patterns etc.?
Regexps syntax uses several metachars (  []{}()+-*?.\: ) and  latin
letters
- lower and upper.
But there are charsets/encodings which don't have some of them, e.g.:
GB_2312-80 has none of them, JIS_X0201 doesn't have backslash,
ebcdic-cp-ar1
doesn't have backslash, square and curly brackets.
So, regexp patterns can't be constructed for these charsets/encodings.
10d4acbfdaccb4eee687a428ca00a5d8?d=identicon&s=25 Jim Weirich (weirich)
on 2006-06-26 17:54
I've been following this debate with some interest.  Alas, since my
unicode/m17n experience is quite limited, I don't have a strong opinion
in the matter.

But the following caught my eye:

Austin Ziegler wrote:
> [...] Ruby *will* distinguish between a String without
> an encoding and a String with an encoding. You're basing your opposition
> to tomorrow's behaviour based on today's (known bad) behaviour.

Part of the problem is that we are basing our discussions on
descriptions of what will happen in the future, but that makes it
difficult to understand the issues involved without real code.

What I would like to see is prototype implementations of both
approaches, and see the differences in how they effect the code.  I'm
more interested in anwering questions like "How do I safely concatenate
strings with potentially different encodings" and "How do I do I/O with
encoded strings" rather than addressing efficiency questions.  In other
words, how do the different approaches effect the way I write code.

I think it would be a great idea to prototype these ideas in real code
to understand the advantages and disadvantages of each.

-- Jim Weirich
01d68aff859065b5cbc1cfc67cb16871?d=identicon&s=25 Keith Fahlgren (Guest)
on 2006-06-26 18:00
(Received via mailing list)
On Monday 26 June 2006 11:54 am, Jim Weirich wrote:
> I think it would be a great idea to prototype these ideas in real
> code to understand the advantages and disadvantages of each.

+1^2
E34b5cae57e0dd170114dba444e37852?d=identicon&s=25 Logan Capaldo (Guest)
on 2006-06-26 18:49
(Received via mailing list)
On Jun 26, 2006, at 2:13 AM, Joel VanderWerf wrote:

>
It's funny I'm always forgetting you can index by regexp. But this
brings up a good point, this is Ruby, with the new Hash / named
argument syntax we can do:

"It's needlessly cryptic."[byte:2]

This doesn't add anything at all to the conversation, but I think it
looks good, and it's in the "make similiar things look similar" vein.

Indexing Strings

s[0]      # The first character
s[/./]    # The first character
s[byte:0] # The first byte (of a string with some non ascii
compatible encoding)
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-26 19:01
(Received via mailing list)
On 6/26/06, Jim Weirich <jim@weirichhouse.org> wrote:
> descriptions of what will happen in the future, but that makes it
> I think it would be a great idea to prototype these ideas in real code
> to understand the advantages and disadvantages of each.

I mostly agree with you here (about prototyping), Jim. There are a few
things that I think can be done without working code. I often start from
this point in my own programs, anyway. I'll try to address each of your
questions as I understand them. Hopefully, Matz or other participants
will step in and correct me where I'm wrong.

Before I get started, there are two orthogonal divisions here. The first
division is about the internal representation of a String. There is a
camp that very strongly believes that some Unicode encoding is the only
right way to internally represent String data. Sort of like Java's
String without the mistake of char being UCS-2. The other camp strongly
believes that forcing a single universal encoding is a mistake for a
variety of reasons and would rather have an unencoded internal
representation with an interpretive encoding tag available. These two
camps can be referred to as UnicodeString and m17nString. I think that I
can be safely classified as in the m17nString camp -- but there are
caveats to that which I will address in a moment.

The second division is about the suitability of a String as a
ByteVector. Some folks believe that the twain should never meet, others
believe that there's little to meaningfully distinguish them in practice
and that the resulting API would be unnecessarily complex. I can safely
be classified in the latter camp.

There is an open question about the resulting String class about how
well it will work with various arcane features of Unicode such as
combining characters, RTL/LTR marks, etc. and these are good questions.
Ultimately, I believe that the answer is that it should support them as
transparently as possible without (a) hiding *too* much and (b)
compromising support for multiple encodings.

Your first question:

  How do I safely concatenate strings with potentially different
  encodings?

This deals with the first division. Under the UnicodeString camp, you
would *always* be able to safely concatenate strings because they never
have a separate encoding. All incoming data would have to be classified
as binary or character data and the character data would have to be
converted from its incoming code page to the internal representation.

Under the m17nString camp, Matz has promised that compatible encodings
would work transparently. I have gone a little further and suggested
that we have a conversion mechanism similar to #coerce for Number
values. I could then combine text from Win1252 and SJIS to get a
Unicode result. Or, if I knew that my target could *only* handle SJIS, I
would force that to result in an error.

Your second question:

  How do I do I/O with encoded strings?

This also sort of deals with the first, but it also deals with the
second. Note, by the way, that the UnicodeString camp would *require* a
completely separate ByteArray class because you could not then read a
JPEG into a String -- its values would be converted to Unicode
representations, rendering it unusable as a JPEG.

The two class (String/ByteArray) camp would probably require that you
either (1) change all IO operations using a pragma-style setting to
encoded strings, (2) change individual IO operations, (3) use a
separate API, or (4) read a ByteArray and *convert* it to a
UnicodeString. Either way, they seem to want an API where they can say
"read this IO and give me a UnicodeString as output" and conversely
"read this IO and give me a ByteArray as output." (Note: this could
apply whether we have a UnicodeString or an m17nString -- but the
requests have come most often from UnicodeString supporters.)

The one class camp keeps file IO as it is. You can "encourage" a
particular encoding with a variant of #2:

  d1 = File.open("file.txt", "rb", encoding: :utf8) { |f| f.read }
  d2 = File.open("file.txt", "rb") { |f|
	f.encoding = :utf8
	f.read
  }

However, whether you use an encoding or not, you still get a String
back. Consider:

  s1 = File.open("file.txt", "rb") { |f| f.read }
  s2 = File.open("file.txt", "rb", encoding: :utf8) { |f| f.read }

  s1.class == s2.class # true
  s1.encoding == s2.encoding # false

But that doesn't mean I have to keep treating s1 as a raw data byte
array -- or even convert it.

  s1.encoding = :utf8
  s1.encoding == s2.encoding # true

I think that the fundamental difference here is whether you view encoded
strings as fundamentally different objects, or whether you view the
encodings as *lenses* on how to interpret the object data. I prefer the
latter view.

-austin
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-26 19:04
(Received via mailing list)
On 6/26/06, Logan Capaldo <logancapaldo@gmail.com> wrote:
>
> compatible encoding)
I kinda like that.

-austin
7264fb16beeea92b89bb42023738259d?d=identicon&s=25 Christian Neukirchen (Guest)
on 2006-06-26 19:45
(Received via mailing list)
"Austin Ziegler" <halostatue@gmail.com> writes:

> I would much rather keep the API -- and the class library -- simple. I
> would rather do this:
>
>  st = File.open("file.txt", "rb", :encoding => :utf8) { |f| f.read }
>
> or
>
>  bv = File.open("file.txt", "rb") { |f| f.read }
>  st = bv.to_encoding(:utf8)

Partly off-topic, but important nevertheless: *Then* it's the right
time to drop that damn "rb" by making it default and let the people
stuck in the \r\n-age use :encoding => "win-ansi" or "dos" or whatever.
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-26 19:49
(Received via mailing list)
On 6/26/06, Christian Neukirchen <chneukirchen@gmail.com> wrote:
>
> Partly off-topic, but important nevertheless: *Then* it's the right
> time to drop that damn "rb" by making it default and let the people
> stuck in the \r\n-age use :encoding => "win-ansi" or "dos" or whatever.

Oh, please, yes. I get tired of libraries breaking because people
don't use "rb" and I'm on Windows.

-austin
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-06-26 20:39
(Received via mailing list)
On 6/26/06, Austin Ziegler <halostatue@gmail.com> wrote:
> On 6/26/06, Jim Weirich <jim@weirichhouse.org> wrote:

> caveats to that which I will address in a moment.
Note that a fixed encoding UnicodeString has several caveats:
- you have only one encoding, and while it may be optimal in some
respects it may be suboptimal  in other. This leads to split among
UnicodeString supporters - about which encoding to choose. m17n solves
this neatly by allowing you to choose the encoding for every
application at least.
  -utf-8 - most likely encountered on io (especially network) = less
conversions. Space efficient for languages using Latin script
  -utf-16 - sometimes encountered on io (file names on certain
systems). Space efficient for most(?) other languages
  -utf-32 - fast indexing/slicing. Generally easier manipulation (but
only inside the string class)
-you cannot use a non-unicode encoding, or even have both unicode and
non-unicode (with characters outside of unicode) strings without
chnaging the interpreter incompatibly

Another subdivision exists among m17n camp about what strings are
compatible. The behavior in some other languages (which some find
unfortunate) is that strings with different encodings are incompatible
(ie operations on two strings always have to take strings with the
same encoding). In Matz's current proposal the only improvement over
this is allowing to add 7-bit ascii string to strings where this makes
sense (ie. to ISO-8859-[12], cp85[02], utf-8).
The other position is to make strings to coerce themselves
automatically if lossless conversion exists (ie cp1251, cp852, and
iso-8859-2 should be the same set of characters ordered differently
iirc, and most character sets can be safely converted to utf-8). I
could count myself into the autoconversion camp.

Yet another subdivision is about the exact meaning of string.encoding
= :utf8. It can either just change the tag or check that string is
indeed a valid utf-8 character seequence. Matz thinks that without
checking autoconversion would be too unreliable. I think that checking
would be good for debugging or when one wants to be paranoid. But the
ability to turn it off when I think (or find out) that my application
spends lots of time checking needlessly could be handy.

> Ultimately, I believe that the answer is that it should support them as
> have a separate encoding. All incoming data would have to be classified
> as binary or character data and the character data would have to be
> converted from its incoming code page to the internal representation.
>
> Under the m17nString camp, Matz has promised that compatible encodings
> would work transparently. I have gone a little further and suggested
> that we have a conversion mechanism similar to #coerce for Number
> values. I could then combine text from Win1252 and SJIS to get a
> Unicode result. Or, if I knew that my target could *only* handle SJIS, I
> would force that to result in an error.

The answer also depends on what strings are compatible. If most
strings are incompatible, you would convert all strings and other data
structures you get from IO or external libraries to your chosen
encoding, and you will only concatenate strings with the same
encoding.
With autoconversion it will just work most of the time (ie when you
work with string that can be converted to unicode).

Writing to streams that do not support all unicode characters is going
to be a problem most of the time (when you do not work in the output
encoding). Unless write attempts the conversion first, and only fails
when there are non-convertible characters.

>
> Your second question:
>
>   How do I do I/O with encoded strings?
>
...
> However, whether you use an encoding or not, you still get a String
>
>   s1.encoding = :utf8
>   s1.encoding == s2.encoding # true
>
> I think that the fundamental difference here is whether you view encoded
> strings as fundamentally different objects, or whether you view the
> encodings as *lenses* on how to interpret the object data. I prefer the
> latter view.

If you consider s3 = File.open('legacy.txt','rb',:iso885915) { |f|
f.read }
without autoconversion you would have to immediately do s3.recode :utf8
otherwise s1 + s3 would not work.

The same for stuff you get from database queries (unless you are sure
you always get the right encoding), text you get from the web, emails,
third party libraries, etc.

Thanks

Michal
2a321daf565791ad30ac5ee945abf59a?d=identicon&s=25 Izidor Jerebic (Guest)
on 2006-06-26 21:28
(Received via mailing list)
On 26.6.2006, at 20:37, Michal Suchanek wrote:

>> array -- or even convert it.
>
> If you consider s3 = File.open('legacy.txt','rb',:iso885915) { |f|
> f.read }
> without autoconversion you would have to immediately do
> s3.recode :utf8
> otherwise s1 + s3 would not work.

Yes. This shows that if there is no autoconversion, programmer will
always need to recode to a common app encoding if the aplication is
to work without problems. And if we always need to recode strings
which we receive from third-part classes/libraries, encoding handling
will either consume half of the program lines  or people won't do it
and programs will be full of errors. As can be seen from experience
of other languages (and Ruby), the second option will prevail and we
will be in a mess not much better than today.

Therefore m17n without autconversion (as is current Matz's proposal)
gains us almost nothing. If we have no autoconversion, my vote goes
to Unicode internal encoding (because it implicitly handles
autoconversion problems).

On the topic of ByteArray: my concern is that the distinction between
bytes and characters will not be clear and therefore we need to
introduce ByteArray to separate bytes from characters, to ensure
reliability and predictability of code like result = File.open
( "file" ) { |f| f.read 1000 } (now tell me what 'result' is?}.

If there will be clear and simple rules, such as "IO always returns
binary strings if not given encoding parameter" then this distinction
will not need to be additionally enforced by separating classes. One
String class will do.

On the other hand, if there will be all kinds of automatic encoding
tagging for convenience of simple-script-writers, then we need
ByteArray to prevent error-prone code with undefined results.

izidor
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-26 21:47
(Received via mailing list)
On 6/26/06, Izidor Jerebic <ij.rubylist@gmail.com> wrote:
> of other languages (and Ruby), the second option will prevail and we
> will be in a mess not much better than today.

I doubt this is in the least bit true. The real problem is that you're
trying to suggest a doomsday scenario based on what currently exists and
emotion. I'm saying that your cure is far worse than disease.

> Therefore m17n without autconversion (as is current Matz's proposal)
> gains us almost nothing. If we have no autoconversion, my vote goes to
> Unicode internal encoding (because it implicitly handles
> autoconversion problems).

So does the coersion proposal that I've made without locking ourselves
into Unicode. If I have a thousand files that are Mojikyo-encoded, it
becomes very inefficient for me to work with it in Unicode and far
easier to work with Mojikyo directly.

I couldn't make sense of your last paragraph.

-austin
2a321daf565791ad30ac5ee945abf59a?d=identicon&s=25 Izidor Jerebic (Guest)
on 2006-06-26 22:21
(Received via mailing list)
On 26.6.2006, at 21:46, Austin Ziegler wrote:

>> and programs will be full of errors. As can be seen from experience
>> of other languages (and Ruby), the second option will prevail and we
>> will be in a mess not much better than today.
>
> I doubt this is in the least bit true.
> I'm saying that your cure is far worse than disease.

Basically, I am just advocating to get autoconversion into "official"
proposal. I am not proposing unicode. But if there is no
autoconversion, unicode is better. This claim is supposed to get
support for autoconversion :-)

BTW, you may have no problems at all. We, on the other hand, have
lots of problems (in Ruby and other languages) which can be traced to
exactly this hope of "all programmers will be doing lots of manual
work to make things safe for others". You are deluded.

In environments which already have this cure (internal unicode),
there are no such enormous problems as we experience in those without
this cure. So sucessess and failures I describe are based on real
experience. Unlike your claims, which are just opinions.

I am not saying that unicode encoding is the ideal solution. But it
turned out to be quite good one, and for sure much better than manual
checking/changing of encoding.

>
>> Therefore m17n without autconversion (as is current Matz's proposal)
>> gains us almost nothing. If we have no autoconversion, my vote
>> goes to
>> Unicode internal encoding (because it implicitly handles
>> autoconversion problems).
>
> So does the coersion proposal that I've made without locking ourselves
> into Unicode.

But that is your proposal (and mine and several others'), not Matz's.
Current "official" proposal will make a mess.

>
> I couldn't make sense of your last paragraph.

Well, tell me what exactly do I get when this code executes:

result = File.open( "file ) { |f| f.read( 1000 ) }

What is 'result' ? Binary string under all circumstances? Or maybe
sometimes I get a String and sometimes I get a binary String? Which
one under what circumstances?

This is called error-prone code with undefined results.

We have two equally good options:
1. If we change API and IO returns ByteArray, we have no confusion.
2. If we have clear and simple rules about IO returning Strings, we
also have no confusion.

Therefore, if there will be complex auto-magic String tagging with
encoding, I prefer introducing ByteArray, because it will prevent
errors.


izidor
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-26 22:49
(Received via mailing list)
On 6/26/06, Izidor Jerebic <ij.rubylist@gmail.com> wrote:
> On 26.6.2006, at 21:46, Austin Ziegler wrote:
> > I doubt this is in the least bit true.
> > I'm saying that your cure is far worse than disease.
> BTW, you may have no problems at all. We, on the other hand, have
> lots of problems (in Ruby and other languages) which can be traced to
> exactly this hope of "all programmers will be doing lots of manual
> work to make things safe for others". You are deluded.

Um. Not what I'm saying. I want as much clean autoconversion as
possible without being forced into it. But much *more* than that, I
want an API that works reasonably well with all sorts of encodings. I
want String#[] to work equally well with Mojikyo, ASCII, ISO-8859-12,
and UTF-8.

> In environments which already have this cure (internal unicode),
> there are no such enormous problems as we experience in those without
> this cure. So sucessess and failures I describe are based on real
> experience. Unlike your claims, which are just opinions.

No, they're not just opinions. They're experiences that I've had with
real situations as well where we have a hard time dealing with
autoconversion. Stupid automatic behaviour is worse than manual
behaviour *every time*.

> > I couldn't make sense of your last paragraph.
> Well, tell me what exactly do I get when this code executes:
>
> result = File.open( "file ) { |f| f.read( 1000 ) }

Aside from a syntax error from your missing quote? ;)

This would probably be an unencoded String. If you want an encoded
String, you would specify it on the File object either during
construction or afterwards.

The need for ByteArray is nonexistent.

-austin
F1d37642fdaa1662ff46e4c65731e9ab?d=identicon&s=25 Charles O Nutter (Guest)
on 2006-06-26 22:56
(Received via mailing list)
On 6/26/06, Austin Ziegler <halostatue@gmail.com> wrote:
>
> So does the coersion proposal that I've made without locking ourselves
> into Unicode. If I have a thousand files that are Mojikyo-encoded, it
> becomes very inefficient for me to work with it in Unicode and far
> easier to work with Mojikyo directly.
>

Perhaps this debate should be weighing those encodings that could not
reasonably (or perhaps, easily) be represented in a pure-unicode String
versus those that could. Would it be reasonable to say that if 90% of
Ruby
users would never have a pressing need for a non-unicode-encodable
String,
then an uber-String that's entirely encoding-agnostic would be better
written as an extension for those special cases? Do we really need to
encumber all of Ruby for the needs of a relative few?
10d4acbfdaccb4eee687a428ca00a5d8?d=identicon&s=25 Jim Weirich (weirich)
on 2006-06-26 23:03
Austin Ziegler wrote:
> Um. Not what I'm saying. I want as much clean autoconversion as [...]

Clarification question:  When you say autoconversion, do you mean:

(A) Automatically convert input strings to a given encoding (independent
of the question of a single vs multiple encodings).

(B) When combining strings, autoconvert incompatible encodings into
compatible encodings before combining.

I was thinking you meant (B), but I get the impression that Austin is
replying to (A) (since Austin's coerce suggestion sounds a lot like
(B)).

Thanks.

-- Jim Weirich
2a321daf565791ad30ac5ee945abf59a?d=identicon&s=25 Izidor Jerebic (Guest)
on 2006-06-26 23:16
(Received via mailing list)
On 26.6.2006, at 22:46, Austin Ziegler wrote:

> This would probably be an unencoded String. If you want an encoded
> String, you would specify it on the File object either during
> construction or afterwards.

This seems too good to be true :-)

How will e.g. Japanese (or we non-English Europeans), which now use
default $KCODE, write their Ruby scripts? Will we need to specify
encoding in every script for every IO? This can get cumbersome very
fast. Not really Ruby style.

But if there will be some default encoding, it will interfere with
said rules about return values. And that may cause errors when I run
script meant for some other default encoding.

This problem makes me think that rules won't be so simple as
described now (actually, Matz said that this detail is not fixed yet).

We'll see. I have just voiced my concerns about separation between
bytes and characters. Must wait for the master to present solution
(and hope he considers these problems)...


izidor
10d4acbfdaccb4eee687a428ca00a5d8?d=identicon&s=25 Jim Weirich (weirich)
on 2006-06-26 23:20
Thanks for the response, Austin.  It seemed to help clearify the issues
(at least for me).

Austin Ziegler wrote:
>   d1 = File.open("file.txt", "rb", encoding: :utf8) { |f| f.read }

Question:  Does the encoding parameter specify the encoding of the file,
or the encoding of the strings you get back (my guess is both).

Related question: In environments that use a lot of different encodings,
are there ways or conventions for specifying the encoding, or do you
just have to "know".

>   s1.encoding = :utf8

Another Question:  When you set the encoding, are you:

(A) Just changing the encoding specifier without changing the
underlaying string.

(B) Re-encoding the string according to the new encoding specifier.

(B) seems to be implied by the attribute notation, but that seems a bit
dangerous in my mind.

Thanks.

-- Jim Weirich
2a321daf565791ad30ac5ee945abf59a?d=identicon&s=25 Izidor Jerebic (Guest)
on 2006-06-26 23:22
(Received via mailing list)
On 26.6.2006, at 23:04, Jim Weirich wrote:

> Clarification question:  When you say autoconversion, do you mean:
>
> (A) Automatically convert input strings to a given encoding
> (independent
> of the question of a single vs multiple encodings).
>
> (B) When combining strings, autoconvert incompatible encodings into
> compatible encodings before combining.

Autoconversion (as suggested by many people in this thread) is meant
to convert string in *compatible but different* encoding to the
encoding of other string (or common compatible superset encoding), to
facilitate the operation using those two strings.

Point A is the great can of worms and source of errors, which I
suggested can be avoided by either:
1. Very simple and strict rules on String encoding of return values
2. Introduction of ByteArray as return values


izidor
2a321daf565791ad30ac5ee945abf59a?d=identicon&s=25 Izidor Jerebic (Guest)
on 2006-06-26 23:25
(Received via mailing list)
On 26.6.2006, at 22:55, Charles O Nutter wrote:

> reasonably (or perhaps, easily) be represented in a pure-unicode
> String
> versus those that could. Would it be reasonable to say that if 90%
> of Ruby
> users would never have a pressing need for a non-unicode-encodable
> String,
> then an uber-String that's entirely encoding-agnostic would be better
> written as an extension for those special cases?

Ahem, no.
100% of Ruby lanuage creators say that they need something better
than Unicode :-)

And if we get both unicode and other stuff, there is no point in
discussing it, no?

Provided we get autoconversion, of course.


izidor
F1d37642fdaa1662ff46e4c65731e9ab?d=identicon&s=25 Charles O Nutter (Guest)
on 2006-06-26 23:44
(Received via mailing list)
On 6/26/06, Izidor Jerebic <ij.rubylist@gmail.com> wrote:
>
> Ahem, no.
> 100% of Ruby lanuage creators say that they need something better
> than Unicode :-)
>
> And if we get both unicode and other stuff, there is no point in
> discussing it, no?
>
> Provided we get autoconversion, of course.
>

All due respect to matz and companyand the wondrous thing they have
wrought,
but *nobody* is perfect. Accepting a decision blindly based on who is
making
it is a recipe for trouble. My only concern is that while the proposed
m17n
implementation may make Ruby more perfect and more ideal for at least
one
person, it may (emphasis on 'may') make it harder for many thousands of
others. Does that make sense? I'm sure there will be those who argue
that
Ruby is matz's creation and matz's creation alone, but there's a lot of
people with a vested interest in "the Ruby way". A little critical
analysis
of the "benevolent dictator's" decisions is always prudent.

If we get unicode and it's a lot harder than people like, or if it
causes
unpleasant compatibility, portability, or interoperability issues, then
we're no better off.

Hey, the uber-string m17n impl might be the most amazing, remarkable
thing
ever to come along. It just seems based on a lot of anecdotal evidence
that
this approach is very complex and very dangerous, and arguably has never
been done right yet. matz and company are amazing hackers, but is it a
good
risk to take? Is it worth it for 10% of Ruby users or less?

And again, I mean no disrespect by questioning the Ruby elders. It's
just my
way.
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-26 23:47
(Received via mailing list)
On 6/26/06, Charles O Nutter <headius@headius.com> wrote:
> written as an extension for those special cases? Do we really need to
> encumber all of Ruby for the needs of a relative few?

I do not believe that this is a viable argument for "killing". At
best, this is an argument for making sure that Unicode support *rock*
in Ruby. It doesn't mean we need to make those "special" cases harder
than they need to be.

-austin
87fe25bf0272d8ad886dda793bdcbbd9?d=identicon&s=25 Tim Bray (Guest)
on 2006-06-26 23:48
(Received via mailing list)
On Jun 26, 2006, at 2:15 PM, Izidor Jerebic wrote:

> How will e.g. Japanese (or we non-English Europeans), which now use
> default $KCODE, write their Ruby scripts? Will we need to specify
> encoding in every script for every IO? This can get cumbersome very
> fast. Not really Ruby style.

I think that anyone, living in any country, working in any language,
who counts on one global variable to specify the encoding of any file
they might want to read, will very soon have lots of nasty
surprises.   Ten years ago, you could do this; no longer.  -Tim
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-26 23:54
(Received via mailing list)
On 6/26/06, Jim Weirich <jim@weirichhouse.org> wrote:
> Thanks for the response, Austin.  It seemed to help clearify the issues
> (at least for me).
>
> Austin Ziegler wrote:
> >   d1 = File.open("file.txt", "rb", encoding: :utf8) { |f| f.read }
> Question:  Does the encoding parameter specify the encoding of the file,
> or the encoding of the strings you get back (my guess is both).

I would assume both, based on what I've seen from Matz.

> Related question: In environments that use a lot of different encodings,
> are there ways or conventions for specifying the encoding, or do you
> just have to "know".

In my experience, you just have to "know" unless you can do some
detection of the encoding. I think that only UTF-16 or UTF-32 is
really amenable to this ;) This is one of the problems that I've seen
with the encoding work that I've done. If I'm reading a list of files
from a NetWare server, what encoding is the data in? I don't
necessarily have a Unicode interface -- and my code page may not match
the server's code page. *Whenever* you're dealing with legacy data,
you have to "agree" or guess and hope you're right.

>>   s1.encoding = :utf8
> Another Question:  When you set the encoding, are you:
>
> (A) Just changing the encoding specifier without changing the
> underlaying string.
> (B) Re-encoding the string according to the new encoding specifier.

> (B) seems to be implied by the attribute notation, but that seems a bit
> dangerous in my mind.

I personally consider it to be (A) because I believe that encoding is
a lens. If you want (B) it should be s1.recode(:utf8). But #recode
would not work on an encoding of "binary" (or "raw"); #recode would be
similar to the Iconv steps you would use today.

-austin
2ffac40f8a985a2b2749244b8a1c4161?d=identicon&s=25 Mike Stok (Guest)
on 2006-06-27 01:52
(Received via mailing list)
On 26-Jun-06, at 1:03 PM, Austin Ziegler wrote:

>> argument syntax we can do:
>> s[byte:0] # The first byte (of a string with some non ascii
>> compatible encoding)
>
> I kinda like that.

Presumably this is general arm waving, because s[/./] need not return
the first character of a non-empty string, unless you mean s[/./m] or
some uglier alternative

ratdog:~ mike$ irb --simple-prompt
 >> "\nx"[/./]
=> "x"
 >> "\nx"[/./m]
=> "\n"

Mike

--

Mike Stok <mike@stok.ca>
http://www.stok.ca/~mike/

The "`Stok' disclaimers" apply.
Bd0203dc8478deb969d72f52e741bd4f?d=identicon&s=25 Daniel Baird (Guest)
on 2006-06-27 01:55
(Received via mailing list)
On 6/27/06, Austin Ziegler <halostatue@gmail.com> wrote:
> construction or afterwards.
>
> The need for ByteArray is nonexistent.


..or, to put that another way, when you see "unencoded String", feel
free to
say "ByteArray" in your head.

;D
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-27 03:59
(Received via mailing list)
On 6/26/06, Mike Stok <mike@stok.ca> wrote:
> >> s[0]      # The first character
> >> s[/./]    # The first character
> >> s[byte:0] # The first byte (of a string with some non ascii
> >> compatible encoding)
> > I kinda like that.
> Presumably this is general arm waving, because s[/./] need not return
> the first character of a non-empty string, unless you mean s[/./m] or
> some uglier alternative

I'm referring to s[byte: 0]. It's elegant.

-austin
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-27 03:59
(Received via mailing list)
On 6/26/06, Daniel Baird <danielbaird@gmail.com> wrote:
> > The need for ByteArray is nonexistent.
> ..or, to put that another way, when you see "unencoded String", feel free to
> say "ByteArray" in your head.

There's a point where you're right. But there's a point where you're
wrong. My point is simply that we don't need a separate class for
this, because character encodings are *ways* of interpreting a vector
of bytes.

-austin
E7559e558ececa67c40f452483b9ac8c?d=identicon&s=25 unknown (Guest)
on 2006-06-27 04:18
(Received via mailing list)
On Jun 26, 2006, at 9:57 PM, Austin Ziegler wrote:
> I'm referring to s[byte: 0]. It's elegant.

It seems a bit weighty.  It requires the allocation of a Hash simply
to index a byte vector.

   s.byte(0)

seems just as readable without the overhead.

Gary Wright
E0526a6bf302e77598ef142d91bdd31c?d=identicon&s=25 Daniel DeLorme (Guest)
on 2006-06-27 04:21
(Received via mailing list)
Charles O Nutter wrote:
> Hey, the uber-string m17n impl might be the most amazing, remarkable thing
> ever to come along. It just seems based on a lot of anecdotal evidence that
> this approach is very complex and very dangerous, and arguably has never
> been done right yet. matz and company are amazing hackers, but is it a good
> risk to take? Is it worth it for 10% of Ruby users or less?

I'd like to point out that MySQL has m17n strings, and it rocks.

Daniel
E34b5cae57e0dd170114dba444e37852?d=identicon&s=25 Logan Capaldo (Guest)
on 2006-06-27 06:05
(Received via mailing list)
On Jun 26, 2006, at 10:16 PM, gwtmp01@mac.com wrote:

>
> Gary Wright
>

**Must defend random syntax that I invented ;-)**

It only has to allocate a hash depending on the named argument interface

e.g.
# not real ruby syntax, affaik

def [](char_index = nil, byte: nil)
   ...
end
87fe25bf0272d8ad886dda793bdcbbd9?d=identicon&s=25 Tim Bray (Guest)
on 2006-06-27 08:32
(Received via mailing list)
On Jun 26, 2006, at 7:20 PM, Daniel DeLorme wrote:

> I'd like to point out that MySQL has m17n strings, and it rocks.

I am often unable to get Unicode strings from Perl into MySQL and
back out without breaking them.  Haven't tried the Ruby/MySQL combo;
does it work better? -Tim
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-27 09:46
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Tue, 27 Jun 2006 06:52:14 +0900, "Austin Ziegler"
<halostatue@gmail.com> writes:

|> Austin Ziegler wrote:
|> >   d1 = File.open("file.txt", "rb", encoding: :utf8) { |f| f.read }
|> Question:  Does the encoding parameter specify the encoding of the file,
|> or the encoding of the strings you get back (my guess is both).
|
|I would assume both, based on what I've seen from Matz.

I think so.

|> Another Question:  When you set the encoding, are you:
|>
|> (A) Just changing the encoding specifier without changing the
|> underlaying string.
|> (B) Re-encoding the string according to the new encoding specifier.
|
|> (B) seems to be implied by the attribute notation, but that seems a bit
|> dangerous in my mind.
|
|I personally consider it to be (A) because I believe that encoding is
|a lens. If you want (B) it should be s1.recode(:utf8). But #recode
|would not work on an encoding of "binary" (or "raw"); #recode would be
|similar to the Iconv steps you would use today.

str.encoding="ascii" would cause (A).

							matz.
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-27 10:05
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Tue, 27 Jun 2006 00:05:22 +0900, "Dmitry Severin"
<dmitry.severin@gmail.com> writes:

|And what about minilanguages, incorporated in Ruby: regexp patterns,
|sprintf, strftime patterns etc.?

Good point.  Currently they don't support non ASCII compatible
encoding (including UTF-16 and UTF-32, but this is not fundamental
restriction).

							matz.
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-27 10:24
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Tue, 27 Jun 2006 06:43:30 +0900, "Charles O Nutter"
<headius@headius.com> writes:
|All due respect to matz and companyand the wondrous thing they have wrought,
|but *nobody* is perfect. Accepting a decision blindly based on who is making
|it is a recipe for trouble. My only concern is that while the proposed m17n
|implementation may make Ruby more perfect and more ideal for at least one
|person, it may (emphasis on 'may') make it harder for many thousands of
|others. Does that make sense? I'm sure there will be those who argue that
|Ruby is matz's creation and matz's creation alone, but there's a lot of
|people with a vested interest in "the Ruby way". A little critical analysis
|of the "benevolent dictator's" decisions is always prudent.

Good point.

|If we get unicode and it's a lot harder than people like, or if it causes
|unpleasant compatibility, portability, or interoperability issues, then
|we're no better off.
|
|Hey, the uber-string m17n impl might be the most amazing, remarkable thing
|ever to come along. It just seems based on a lot of anecdotal evidence that
|this approach is very complex and very dangerous, and arguably has never
|been done right yet. matz and company are amazing hackers, but is it a good
|risk to take? Is it worth it for 10% of Ruby users or less?

But unfortunately, the implementer is living among those "10% or
less".  So it's a risk already taken, choosing a language designed by
such a person. ;-)

Anyway, please give me a chance to be proven wrong (or right).
I will try not to make lives of thousands of others hard.

							matz.
93d566cc26b230c553c197c4cd8ac6e4?d=identicon&s=25 Pit Capitain (Guest)
on 2006-06-27 10:27
(Received via mailing list)
Charles O Nutter schrieb:
> (...)
> Hey, the uber-string m17n impl might be the most amazing, remarkable
> thing ever to come along. It just seems based on a lot of anecdotal
> evidence that this approach is very complex and very dangerous, and
> arguably has never been done right yet. matz and company are amazing
> hackers, but is it a good risk to take? Is it worth it for 10% of
> Ruby users or less?
> (...)

Charles, could it be that "the uber-string m17n implementation" would
make your life as JRuby implementer a lot harder? ;->

Regards,
Pit
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-06-27 15:30
(Received via mailing list)
On 6/26/06, Charles O Nutter <headius@headius.com> wrote:
> versus those that could. Would it be reasonable to say that if 90% of Ruby
> users would never have a pressing need for a non-unicode-encodable String,
> then an uber-String that's entirely encoding-agnostic would be better
> written as an extension for those special cases? Do we really need to
> encumber all of Ruby for the needs of a relative few?

Its' been asked already.

Again: How does the possibility to store non-unicode characters in
strings encumber you?

Michal
F1d37642fdaa1662ff46e4c65731e9ab?d=identicon&s=25 Charles O Nutter (Guest)
on 2006-06-27 16:31
(Received via mailing list)
It won't matter much either way to JRuby, since Java's going to
internalize
all strings as UTF-16 anyway. Those encodings that can't be represented
in
unicode simply won't work, since that's just a platform limitation we'll
probably live with. There's always the option of building our own
uber-string based on what matz creates (porting to Java wouldn't be
impossible, or perhaps even difficult) but we'll cross that bridge when
we
come to it.

I'm just trying to play both sides of the fence here since there seems
to be
a number of people opposed to or doubtful of the m17n uberstring. As a
Ruby
platform implementer of a sort I'd like to make sure those concerns are
considered.
F1d37642fdaa1662ff46e4c65731e9ab?d=identicon&s=25 Charles O Nutter (Guest)
on 2006-06-27 16:34
(Received via mailing list)
On 6/27/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
>
> But unfortunately, the implementer is living among those "10% or
> less".  So it's a risk already taken, choosing a language designed by
> such a person. ;-)


That's certainly fair to say, and I'm optimistic that whatever the best
decision is you'll make it right. It's especially heartening that you
are an
active participant in this debate; I know certain other language
designers
that are less open to comment and criticism.

Anyway, please give me a chance to be proven wrong (or right).
> I will try not to make lives of thousands of others hard.
>
>                                                         matz.


It seems you're giving yourself the change to be proven wrong already.
I'll
just watch that process as it moves forward and do what I can to mix
things
up.
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-27 16:54
(Received via mailing list)
On 6/27/06, Michal Suchanek <hramrach@centrum.cz> wrote:
>> non-unicode-encodable String, then an uber-String that's entirely
>> encoding-agnostic would be better written as an extension for those
>> special cases? Do we really need to encumber all of Ruby for the
>> needs of a relative few?
> Its' been asked already.
>
> Again: How does the possibility to store non-unicode characters in
> strings encumber you?

To be fair to Charles, he would benefit immensely from a Unicode
internal representation because he could then simply *and cleanly* use
Java Strings as Ruby Strings in JRuby.

With an m17n String, he will need to have something else that isn't
compatible with Java Strings, which hurts JRuby's use as a Java glue
language. I think that there are ways around this. Maybe make the JRuby
String class have an internal something like:

  class JRuby.String
  {
	private Java.Lang.String	unicode;
	private ByteVector			m17n;
	private Java.Lang.String	encoding;
	private bool				isUnicode;
  }

That way, if it's a Unicode encoding -- regardless of what's desired --
he could use the unicode member; otherwise internally he uses the
ByteVector. (Strictly speaking, for non-"raw" or "binary" encodings, he
could always use the unicode member and convert as necessary.)

-austin
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-06-27 17:23
(Received via mailing list)
On 6/27/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> But unfortunately, the implementer is living among those "10% or
> less".  So it's a risk already taken, choosing a language designed by
> such a person. ;-)

That also means that the implementor has much better understanding of
internationalization issues   than those who live in the US ;-)

This should give us at least sound base string class. And since the
class is open in Ruby automatic this or that can be added.

Thanks

Michal
F1d37642fdaa1662ff46e4c65731e9ab?d=identicon&s=25 Charles O Nutter (Guest)
on 2006-06-27 18:28
(Received via mailing list)
On 6/27/06, Austin Ziegler <halostatue@gmail.com> wrote:
>         private Java.Lang.String        encoding;
>         private bool                            isUnicode;
>   }
>

This would certainly be an option once matz has solved all the hard
problems
of an encoding-free String. Some minimal testing of a byte[] based UTF-8
Java String replacement has shown that there are very few general
performance issues arising from reimplementing string with a different
data
structure (a testament to Java's JIT, since most Java code runs faster
without native bits). When there's something concrete in the m17n plan,
we
shouldn't have much difficulty supporting it. We could also run with
pure
unicode internally as well, for folks who didn't need any
unicode-incompatible encodings. Without the m17n code ready for general
consumption, it's hard to say what path will be best.

The other advantage of a byte[] or ByteVector-based JRuby string is for
IO;
currently we use Java's StringBuffer for handling mutable string
operations.
This works well, but StringBuffer maintains a char[] internally, so for
every byte of IO we waste a byte. We're considering various options to
improve that, and the end result may be closer to the UberString than to
Java's own.

So yes, there's some alterior motive in my support for pure Unicode and
ByteArray, but any path taken will be implementable in JRuby. However, I
support those because I feel they simplify rather than complicate, and
not
because they might be easier to implement in Java.
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-27 19:21
(Received via mailing list)
On 6/27/06, Charles O Nutter <headius@headius.com> wrote:
> So yes, there's some alterior motive in my support for pure Unicode and
> ByteArray, but any path taken will be implementable in JRuby. However, I
> support those because I feel they simplify rather than complicate, and not
> because they might be easier to implement in Java.

IME, more classes complicates. Sometimes the complexity is necessary
because it is simpler than the alternative, but I don't believe that
this is the case here. As I said, most of my opposition is based on
(1) stupid statically typed languages and (2) an inability to tell
Ruby what type you want back from a method call (this is a good thing,
because it in part prevents #1 ;).

-austin
C5be24289f1471f3da84864a6677af12?d=identicon&s=25 Garance A Drosehn (Guest)
on 2006-06-27 23:56
(Received via mailing list)
On 6/26/06, Daniel DeLorme <dan-ml@dan42.com> wrote:
>
> It's funny, maybe I'm just dumb but I can't think of a single *real-world*
> example where you'd want to access particular characters of a string.

If that is the case, then why doesn't Ruby remove *all* substring
notation?  If everyone is so comfortable with manipulating strings
via regexp's, then why does the language bother to support
my_str[a..b] , my_str[a...b] , and my_str[a,b] ?

I don't mean to sound all-worked-up over this, but it does seem
hard to believe that those method calls for String are never used
in real-world code.
Cf6d0868b2b4c69bac3e6f265a32b6a7?d=identicon&s=25 Daniel Martin (Guest)
on 2006-06-28 03:00
(Received via mailing list)
Daniel DeLorme <dan-ml@dan42.com> writes:

> It's funny, maybe I'm just dumb but I can't think of a single
> *real-world* example where you'd want to access particular characters
> of a string.

I'll point you at my solution to ruby quiz #83: (short but unique)

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/...

How would you write the method string_similarity without access to
each character?  (This method computes the length of the longest
common substring)

How would you compute the Levenshtein distance (edit distance) between
two strings without access to each character?

How would you pull strings out of a file with fixed-width fields?
With regular expressions?  Really?  What if you had a hundred fields?
E0526a6bf302e77598ef142d91bdd31c?d=identicon&s=25 Daniel DeLorme (Guest)
on 2006-06-28 03:28
(Received via mailing list)
Tim Bray wrote:
> On Jun 26, 2006, at 7:20 PM, Daniel DeLorme wrote:
>
>> I'd like to point out that MySQL has m17n strings, and it rocks.
>
> I am often unable to get Unicode strings from Perl into MySQL and back
> out without breaking them.  Haven't tried the Ruby/MySQL combo; does it
> work better? -Tim

I've never had any problems. You just have to make sure the client
correctly
tells the server what encoding it is using. The only annoyance is that
MySQL
will silently change inconvertible characters to '?', but that's part of
the
MySQL design philosophy rather than inherent to m17n strings.

Daniel
E0526a6bf302e77598ef142d91bdd31c?d=identicon&s=25 Daniel DeLorme (Guest)
on 2006-06-28 04:03
(Received via mailing list)
Daniel Martin wrote:
> I'll point you at my solution to ruby quiz #83: (short but unique)
>
> http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/...
>
> How would you write the method string_similarity without access to
> each character?  (This method computes the length of the longest
> common substring)
>
> How would you compute the Levenshtein distance (edit distance) between
> two strings without access to each character?

I'll grant that I don't have enough imagination and that there *are*
cases where
you want character access. But it seems to me that the main use case is
for
something like this:
    str = "cogito <b>ergo</b> sum"
    i = str.index("<b>") + 3
    j = str.index("</b>",i)
    str[i...j]
    => "ergo"
and for that common case, regexes are far more appropriate:
    str.match(/<b>(.*?)<\/b>/)[1]
    => "ergo"

Advocating regexes-only for character manipulation is certainly extreme.
I'm
just saying that byte access and character access needs to have
different
semantics. If you look at the current ruby String API, bytes are
accessed
through integer positions and characters are accessed through regexes.
The byte
are char APIs are quite distinct, it's just that everybody is using the
byte API
and expecting to get characters as a result.

 From what I understand (and please correct me if I'm wrong), ruby2 will
fix
that by changing the api so that integer positions represent characters
instead
of bytes. For binary strings, those two concepts map to the same reality
so it
won't be such a backward-incompatible change. I just wonder what will be
the
behavior of str[0]. Will it return a 0..255 integer in the case of
binary string
and a 1-character string in the case of encoding-set string? Now *that*
would be
an API nightmare.

> How would you pull strings out of a file with fixed-width fields?
> With regular expressions?  Really?  What if you had a hundred fields?

Hmm, fixed width records and fields were created for the purpose of fast
access
to data, i.e. seek to position recnum*reclength and extract reclength
bytes;
they only make sense in the case of single-byte characters. So this is
more a
case of byte access.


Daniel
2a321daf565791ad30ac5ee945abf59a?d=identicon&s=25 Izidor Jerebic (Guest)
on 2006-06-28 07:44
(Received via mailing list)
On 27.6.2006, at 19:19, Austin Ziegler wrote:

> As I said, most of my opposition is based on
> (1) stupid statically typed languages and (2) an inability to tell
> Ruby what type you want back from a method call (this is a good thing,
> because it in part prevents #1 ;).

First, "most of my opposition" is not useful in discussion and is a
straw-man, because we are not counting people here, we try to
evaluate reasons for and against. One person with good reason should
overcome 1000 not-so-good posts. This is not about winning the
argument, it's about having the best solution.

About (2), inability to tell in advance in your program whether you
get bytes or characters from a method in core (or any other) API is
NOT a good thing. This causes innumerable problems and unexpected
behaviour if programmer expects one and code sometimes gets the
other. The API should prevent such errors, either by very simple and
strict rules that enable easy prediction, or by introducing
ByteArray, which makes prediction trivial. This is not about duck-
typing, it's about randomly having semantically different results.

Since the rules are not fixed yet, nobody can say whether one or the
other solution is better. But if the API is not very clear or
requires lots of manual specifying in code, we will be in a mess,
similar to today.


izidor
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-28 18:50
(Received via mailing list)
On 6/28/06, Izidor Jerebic <ij.rubylist@gmail.com> wrote:
> On 27.6.2006, at 19:19, Austin Ziegler wrote:
> > As I said, most of my opposition is based on
> > (1) stupid statically typed languages and (2) an inability to tell
> > Ruby what type you want back from a method call (this is a good thing,
> > because it in part prevents #1 ;).
> First, "most of my opposition" is not useful in discussion and is a
> straw-man, because we are not counting people here, we try to
> evaluate reasons for and against. One person with good reason should
> overcome 1000 not-so-good posts. This is not about winning the
> argument, it's about having the best solution.

You have misread my English. I am not referring to people who oppose
my position; I am referring to my opposition to a separate ByteArray
class. However, I have yet to see even a mediocre reason for a
separate ByteArray.

> About (2), inability to tell in advance in your program whether you
> get bytes or characters from a method in core (or any other) API is
> NOT a good thing. This causes innumerable problems and unexpected
> behaviour if programmer expects one and code sometimes gets the
> other. The API should prevent such errors, either by very simple and
> strict rules that enable easy prediction, or by introducing
> ByteArray, which makes prediction trivial. This is not about duck-
> typing, it's about randomly having semantically different results.

You'll *never* get that without type hinting. And type hinting for
return types would be as bad as anything else for Ruby. Consider this
copy function:

  def copy_file(inf, outf)
    open(inf, "rb") { |fin|
      File.open(outf, "wb") { |fout|
        fout.write fin.read
      }
    }
  end

Why didn't I use File.open? Because I can now do this:

  require 'open-uri'
  copy_file("http://www.ruby-lang.org/en", "ruby-lang-en.html")

I didn't get a "File" object from Kernel#open; I got (in this case) a
Tempfile.

> Since the rules are not fixed yet, nobody can say whether one or the
> other solution is better. But if the API is not very clear or
> requires lots of manual specifying in code, we will be in a mess,
> similar to today.

Quite simply, you're either wrong or you don't understand the
parameters of the problem. I'd rather assume the latter.

However, if you want to ensure a particular class is returned from a
Ruby method, you must have a method which guarantees that it will only
return that class (or nil, perhaps). Therefore, with a separate
ByteArray class, we would *of necessity* see parallel File operations
or a separate IO class hierarchy or (worst of all!) constructors which
tell the File to return String or ByteArray depending on how it was
constructed.

There is *no possible good argument* for separating ByteArray from
String in Ruby. Not with what it would do to the rest of the API, and
I don't think that anyone who wants a ByteArray is thinking beyond
String issues.

-austin
C914fa463a2b1b067586c6432b12a824?d=identicon&s=25 Juergen Strobel (Guest)
on 2006-06-28 19:48
(Received via mailing list)
On Mon, Jun 26, 2006 at 11:21:59AM +0900, Yukihiro Matsumoto wrote:
> |string = File.open('file.txt', 'r') {f.read.to_s(:utf-8)}
>
> 							matz.

Any additional complexity here should be offset later, when doing
operations on the read data as appropriate for its type.

Of course, the first line should raise an exception if file.txt is not
utf8 encoded, this saves extra complexity down the line, and is a real
difference between the two. I imagine Bytevector would be implemented
with maximum performance and space efficiency in mind, while String is
a higher level class streamlined for easy of use.

There could be accessor for the Bytevector to convert it (or parts of
it) to a String, for cases where you really need to read mixed
string/data from somewhere.

string = bytes.to_str(:utf8)
string2 = bytes[1..5].to_str(:utf8)

Or maybe a StrStream-like interface:

bytes.stream_open("r") do |b|
  s = b.read(:utf8)
  ...
end

-Jürgen
2a321daf565791ad30ac5ee945abf59a?d=identicon&s=25 Izidor Jerebic (Guest)
on 2006-06-28 20:42
(Received via mailing list)
On 28.6.2006, at 18:48, Austin Ziegler wrote:

>> About (2), inability to tell in advance in your program whether you
>> get bytes or characters from a method in core (or any other) API is
>> NOT a good thing. This causes innumerable problems and unexpected
>> behaviour if programmer expects one and code sometimes gets the
>> other. The API should prevent such errors, either by very simple and
>> strict rules that enable easy prediction, or by introducing
>> ByteArray, which makes prediction trivial. This is not about duck-
>> typing, it's about randomly having semantically different results.
>
> You'll *never* get that without type hinting.

I think you do not understand what the problem is, because your claim
is so obviously false.

How can I get that with very simple rule: all IO#read (and similar)
calls always return binary Strings.

No type hinting in sight, but I always know whether my code receives
Strings or binary Strings. But this simple option is clearly not
possible, because it complicates the text processing in simple
scripts. We'll see how complicated the final rules will be.

Alternative (actually equivalent) to the above is: all IO#readbytes
calls return ByteArray objects, and we need separate call
IO#readstring which always return Strings with encoding.


izidor
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-28 20:45
(Received via mailing list)
On 6/28/06, Juergen Strobel <strobel@secure.at> wrote:
> Any additional complexity here should be offset later, when doing
> operations on the read data as appropriate for its type.

It won't be. All of the complexity of the m17n String will be inside of
the String, not exposed (by default) to the user. Stop thinking of the
encoding of a String as something that makes the String a unique object;
instead it is a lens that gives meaning to the bytes of the String.

> Of course, the first line should raise an exception if file.txt is not
> utf8 encoded,

The internal format of String is not going to be Unicode by default.
Matz has already said that. I happen to agree with him.

> this saves extra complexity down the line, and is a real difference
> between the two. I imagine Bytevector would be implemented with
> maximum performance and space efficiency in mind, while String is a
> higher level class streamlined for easy of use.

These two items are not mutually exclusive. Think a little more about
humane design and you'll see that two wholly separate classes require a
lot more than what you're assuming and would end up in programmers
making even dumber assumptions than they do today, because they'd think
they're "protected" during IO because they're getting a String. This is
not a safe assumption. Ever.

The separate byte vector class is needlessly complex and solves exactly
nothing that isn't already solved in a better way.

-austin
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-28 20:53
(Received via mailing list)
On 6/28/06, Izidor Jerebic <ij.rubylist@gmail.com> wrote:
> I think you do not understand what the problem is, because your claim
> is so obviously false.

Oh, bollocks. Go ahead, pull the other one.

> IO#readstring which always return Strings with encoding.
Nope. Not nearly equivalent and a lot dumber. I've just spent the last
week explaining in simple terms why it's dumb. You want to *at least*
double the complexity of the IO API because you're either unwilling or
incapable of considering anything but your ByteArray concept.

I, for one, am not willing to consider an extensively more complex API
because your imagination is lacking.

-austin
2a321daf565791ad30ac5ee945abf59a?d=identicon&s=25 Izidor Jerebic (Guest)
on 2006-06-28 20:56
(Received via mailing list)
On 28.6.2006, at 18:48, Austin Ziegler wrote:

> There is *no possible good argument* for separating ByteArray from
> String in Ruby. Not with what it would do to the rest of the API, and
> I don't think that anyone who wants a ByteArray is thinking beyond
> String issues.

Oh, really? So it is OK for this code to sometimes receive binary
String and sometimes String with encoding:
io = SomeIO.open( .... )
v = io.read( 1000 )

This is the most problematic part of String handling. Because if my
code expects this 'v' to be binary string, v[0..15] is the first 16
bytes (maybe a message header or something). If this is encoded
string (because some setting changed outside of my code), v[0..15]
will be some random amount of data.

This is the error that happens right now and will happen in the
future also, if the rules are not clear.


izidor
2a321daf565791ad30ac5ee945abf59a?d=identicon&s=25 Izidor Jerebic (Guest)
on 2006-06-28 21:06
(Received via mailing list)
On 28.6.2006, at 20:43, Austin Ziegler wrote:

> Think a little more about
> humane design and you'll see that two wholly separate classes
> require a
> lot more than what you're assuming and would end up in programmers
> making even dumber assumptions than they do today, because they'd
> think
> they're "protected" during IO because they're getting a String.
> This is
> not a safe assumption. Ever.

True. That's why most solutions do not offer String IO, but only
ByteArray. But for language with large part of usage being text
processing, this brings lots of conversions into code, which as Matz
said, makes it like Java. But it is the safe way. Just the way you
like it - no automatic conversion :-)

But most of us would not like the language which makes you type all
the conversions manually in code, even for single-line scripts. Which
we would not have any more. Scripts would be at least two lines - one
line for conversion code :-)

izidor
Cb48ca5059faf7409a5ab3745a964696?d=identicon&s=25 unknown (Guest)
on 2006-06-28 21:19
(Received via mailing list)
On Thu, 29 Jun 2006, Austin Ziegler wrote:

> There is *no possible good argument* for separating ByteArray from String in
> Ruby. Not with what it would do to the rest of the API, and I don't think
> that anyone who wants a ByteArray is thinking beyond String issues.

i woulnd't go that far.  i'm wanting a byte array and thinking beyond
string
issues about every 1-2 hrs in my job.  for example

   f = open 'grayscale_image.dat'

   n_rows.times do
     row = f.read n_cols

     # now i have to this this
     row = row.split(//).map{|char| char[0]}

     # because here i need to do
     avg_pixel_value = row[31,5].inject(0){|avg,n| avg += n} / 5.0

     if some_range.include? avg_pixel_value
       ...
     end
   end

this may have nothing to do with unicode issues - but would love to have
'array of bytes' style io operations, though i've not thought about api
for
more that 1 second.

anyhow - we actually want byte arrays more often than strings.

regards.

-a
2a321daf565791ad30ac5ee945abf59a?d=identicon&s=25 Izidor Jerebic (Guest)
on 2006-06-28 21:19
(Received via mailing list)
On 28.6.2006, at 20:51, Austin Ziegler wrote:

> Nope. Not nearly equivalent and a lot dumber. I've just spent the last
> week explaining in simple terms why it's dumb.

Equivalent in their prediction power. This is the problem I discuss -
both give 100% results independent of environment, and ByteArray
version is maybe even somewhat firmer because there is even different
class of result, not only encoding.

You have not given any solution to any of the problems in code
examples I have given, related to the problem of predicting the class/
encoding of result.

I'd say the solution would prove you know what the problem is.

Except if you say that this (random String encoding in result) is not
a problem.
Then this discussion really can't progress. And we can agree that we
disagree and stop right here.


izidor
2a321daf565791ad30ac5ee945abf59a?d=identicon&s=25 Izidor Jerebic (Guest)
on 2006-06-28 21:47
(Received via mailing list)
On 28.6.2006, at 21:18, Izidor Jerebic wrote:

> I'd say the solution would prove you know what the problem is.
>
> Except if you say that this (random String encoding in result) is
> not a problem.
> Then this discussion really can't progress. And we can agree that
> we disagree and stop right here.

And to clear the air - I am not advocating ByteArray unconditionally.
I have just explained one crucial problem, and the ByteArray is
simplistic solution to the problem. I would much prefer some really
creative and simple String solution. But I do not have it and have
not seen it yet.

Hopefully Matz (or anybody, really) will surprise us with elegant,
balanced solution.


izidor
E34b5cae57e0dd170114dba444e37852?d=identicon&s=25 Logan Capaldo (Guest)
on 2006-06-28 22:19
(Received via mailing list)
On Jun 28, 2006, at 3:16 PM, ara.t.howard@noaa.gov wrote:

>
>     # now i have to this this
>     row = row.split(//).map{|char| char[0]}
>

This is off on a tangent here but, ara why not just

row = row.to_enum(:each_byte).to_a
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-29 02:07
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Thu, 29 Jun 2006 03:53:52 +0900, Izidor Jerebic
<ij.rubylist@gmail.com> writes:

|Oh, really? So it is OK for this code to sometimes receive binary
|String and sometimes String with encoding:
|io = SomeIO.open( .... )
|v = io.read( 1000 )

No, as I said before, reading with length specified shall always
return binary strings, since it counts in bytes, whereas gets,
readline etc. would return encoded strings.

							matz.
F889bf17449ffbf62345d2b2d316a937?d=identicon&s=25 Michal Suchanek (Guest)
on 2006-06-29 07:13
(Received via mailing list)
On 6/28/06, Izidor Jerebic <ij.rubylist@gmail.com> wrote:
> io = SomeIO.open( .... )
> v = io.read( 1000 )
>
> This is the most problematic part of String handling. Because if my
> code expects this 'v' to be binary string, v[0..15] is the first 16
> bytes (maybe a message header or something). If this is encoded
> string (because some setting changed outside of my code), v[0..15]
> will be some random amount of data.
>
> This is the error that happens right now and will happen in the
> future also, if the rules are not clear.

I would think that STD* should use locale (or equvialent) for default
encoding. So should popen. And open should use locale to determine the
encoding of *file names*. This migt be different from the encoding of
STD* (ie on Windows).

For file io it might be reasonable to set the default encoding from
locale as well. However, there is  no reason why the files should
contain text. So to make things clear the io should be binary by
default for files, network, and anything else (except the pipes
mentioned above).

For short scripts one could change that by assigning some global that
specifies the default encoding. For anything else it is reasonable to
demand that everybody sets the encoding when calling open. Even issue
a warning about that. If you want to know what encoding you get there
is not other way.
And it is not addding complexity. today you do not specify encoding
but you also do not get anything that deals with it.

Thanks

Michal
C914fa463a2b1b067586c6432b12a824?d=identicon&s=25 Juergen Strobel (Guest)
on 2006-06-29 08:45
(Received via mailing list)
On Thu, Jun 29, 2006 at 03:43:54AM +0900, Austin Ziegler wrote:
> On 6/28/06, Juergen Strobel <strobel@secure.at> wrote:
> >Any additional complexity here should be offset later, when doing
> >operations on the read data as appropriate for its type.
>
> It won't be. All of the complexity of the m17n String will be inside of
> the String, not exposed (by default) to the user. Stop thinking of the
> encoding of a String as something that makes the String a unique object;
> instead it is a lens that gives meaning to the bytes of the String.

Having said lens adds complexity. I'll always have to think of the
data and the lens. You are very absolute in denying this, I wonder
why.

>
> >Of course, the first line should raise an exception if file.txt is not
> >utf8 encoded,
>
> The internal format of String is not going to be Unicode by default.
> Matz has already said that. I happen to agree with him.

Please stop beating this dead horse. Noone is disputing Matz's right
to implement as he likes.

And this is not about the String per se, in that line of code clearly
something supposed to be UTF8 is read from a file, and if the file
doesn't contain valid UTF8, I'll expect an exception. Not getting that
exception adds complexity to my code, because I'll have to verify it
later on manually, and it may obscure the source of the error if I
forget. Complexity added in both cases.

Prior point proven.

> not a safe assumption. Ever.
>
> The separate byte vector class is needlessly complex and solves exactly
> nothing that isn't already solved in a better way.

Without a prototype, this is speculation at best. Programmers would be
protected by exceptions from invalid String I/O operations. Human
interface design hinges on a lot more and different things than this
one special detail, I can't imagine it will change a lot, and many
ruby programmers aren't that dumb as you make it. This is a red herring.

OT, you should watch whom you call dumb, stupid or foolish here, even
by implication.

That said, I am waiting for M17N as Matz has decided on that, and I
suspect noone else is going to implement anything else for now. But
don't tell me it'll be just perfect for everyone, when discussed use
cases already show it won't be. Matz himself said, that in order to
cater to his own special interest group, he is willing to sacrifice
some convenience for others.

-Jürgen
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-29 08:57
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Thu, 29 Jun 2006 15:44:10 +0900, Juergen Strobel
<strobel@secure.at> writes:

|That said, I am waiting for M17N as Matz has decided on that, and I
|suspect noone else is going to implement anything else for now. But
|don't tell me it'll be just perfect for everyone, when discussed use
|cases already show it won't be. Matz himself said, that in order to
|cater to his own special interest group, he is willing to sacrifice
|some convenience for others.

Did I said so?  I am not going to sacrifice anybody.  At least I am
trying not to, even though I cannot promise.

							matz.
C914fa463a2b1b067586c6432b12a824?d=identicon&s=25 Juergen Strobel (Guest)
on 2006-06-29 09:35
(Received via mailing list)
On Thu, Jun 29, 2006 at 03:56:55PM +0900, Yukihiro Matsumoto wrote:
> |some convenience for others.
>
> Did I said so?  I am not going to sacrifice anybody.  At least I am
> trying not to, even though I cannot promise.
>
> 							matz.

I don't think you can possibly cater to everyone here.  Simplicissity,
Flexibility, Performance, take any two. My impression is that M17N is
going for maximum flexibility with good performance, but for
e.g. Unicode only users there'll be some extra complexity to be aware
of. I don't think you'll sacrifice Unicode users totally, but it is
not your top priority either.

And I understood you expressed this yourself in the following quote.

On Tue, Jun 27, 2006 at 05:21:27PM +0900, Yukihiro Matsumoto wrote:
> |ever to come along. It just seems based on a lot of anecdotal evidence that
>
> 							matz.
>

-Jürgen
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-06-29 09:44
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Thu, 29 Jun 2006 16:33:19 +0900, Juergen Strobel
<strobel@secure.at> writes:

|I don't think you can possibly cater to everyone here.  Simplicissity,
|Flexibility, Performance, take any two. My impression is that M17N is
|going for maximum flexibility with good performance, but for
|e.g. Unicode only users there'll be some extra complexity to be aware
|of. I don't think you'll sacrifice Unicode users totally, but it is
|not your top priority either.

I can't promise implementation simplicity.  Because it would not be
inside.  But I am trying to build "pseudo simplicity", which means
simplicity in the appearance.  For example, text processing code with
file I/O in Ruby will keep being much simpler than Java.

|And I understood you expressed this yourself in the following quote.

Don't get me wrong without context.  You've said that "this approach
is complex, and worth it for 10% or less of Ruby users".  And I said,
"unfortunately I am one of those 10% or less.  You cannot stop Ruby
being (implementation) complex".  Clear?

							matz.
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2006-06-29 13:13
(Received via mailing list)
On 6/29/06, Juergen Strobel <strobel@secure.at> wrote:
> I don't think you can possibly cater to everyone here.  Simplicissity,
> Flexibility, Performance, take any two. My impression is that M17N is
> going for maximum flexibility with good performance, but for
> e.g. Unicode only users there'll be some extra complexity to be aware
> of. I don't think you'll sacrifice Unicode users totally, but it is
> not your top priority either.

Um. You make the same error, I think, that some others have. There are
two measures of complexity to be measured. The first is implementation
complexity. The second is use complexity. I fully expect that the
implementation of the m17nString is going to be complex. (I think it
will be simpler than most naysayers are suggesting, but it will
certainly be more complex than anything that currently exists.)
However, I believe that the use complexity -- that is, the external
API in both C (for extensions) and Ruby --- is going to be relatively
low. Maybe a little more complex than we have today.

The *actual* complexity in use is going to depend on your needs. If
you're dealing with Unicode and binary data only -- as will likely be
the case -- you will find it much easier to use than someone who has
to deal with multiple encodings at once.

-austin
C914fa463a2b1b067586c6432b12a824?d=identicon&s=25 Juergen Strobel (Guest)
on 2006-06-30 01:06
(Received via mailing list)
On Thu, Jun 29, 2006 at 04:42:49PM +0900, Yukihiro Matsumoto wrote:
> |not your top priority either.
> "unfortunately I am one of those 10% or less.  You cannot stop Ruby
> being (implementation) complex".  Clear?
>
> 							matz.

First, it wasn't me who brought this up, the quote about the 10% is
from "Charles O Nutter". Second, I know a complex implementation
doesn't mean the interface has to be complex, on the contrary.

My fear is that the interface will still be more complex than really
neccessary for *me* -- not that I would expect this is reason enough
you deviate from your plans. Voicing my own concerns and wishes about
the interface design is a thing I can do though, in the hope that such
feedback will be useful to you, or at least informative to other
readers.

I still think that you won't be able to please everybody, that's just
not possible. No evangelist will ever convince me. But I am eager to
see for myself how close you can come (and where you will compromise).

-Jürgen
F84eb4d8ec7ec9a694d18ca9e54db3f0?d=identicon&s=25 Ivan Mashchenko (ivan)
on 2007-05-31 22:29
Hello, everyone. I am sorry, I was a bit embarassed by the quantity of
text in this discussion and I may have read it not enough carefully to
firure out the answer, and it (discussion) itself seems to be a year
old, so I've decided to ask:

Finally, is there a convenient support for Unicode in Ruby? Or, if not,
when will it be?

I am going to develop an international website (with pages in some
european languages, including those using non-latin alphabets). I think
it should prove to be a good idea to make such a website totally in
Unicode (probably UTF-16), without using any legacy encodings at all.
The DBMS I am going to use is Oracle 10g (Express edition until it comes
to its limitations).

As well I would like to ask when the next Ruby release is planned to. If
it comes this year, I should probably try nightly builds as it seems to
be wise to start a new project targeting ea version of the next release.

Thanks in advance.
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (austin)
on 2007-06-01 00:31
(Received via mailing list)
On 5/31/07, Ivan Mashchenko <ivan.mashchenko@gmail.com> wrote:
> Hello, everyone. I am sorry, I was a bit embarassed by the quantity of
> text in this discussion and I may have read it not enough carefully to
> firure out the answer, and it (discussion) itself seems to be a year
> old, so I've decided to ask:

> Finally, is there a convenient support for Unicode in Ruby? Or, if not,
> when will it be?

There are a lot of answers to that question, and I strongly suggest
you search as this is a hotly debated discussion.

Google is more useful for searching this than ruby-forum.com. You will
find out when there will be a new release, and the current state of
Unicode.

-austin
19fdf8bd123216b5056fb856cf1a5771?d=identicon&s=25 _why (Guest)
on 2007-06-01 08:16
(Received via mailing list)
On Fri, Jun 01, 2007 at 05:29:31AM +0900, Ivan Mashchenko wrote:
> Finally, is there a convenient support for Unicode in Ruby? Or, if not,
> when will it be?

Well, Ruby 1.9 (which is due in December) will have some Unicode
support.  (So you'll have a `chars` method on strings, like with
Rails.)  Matz is working on it right now even, as he posted that he
was tooling around with string.c earlier this week on his blog.

That is, nothing's been checked in yet.  Because he wants it to be
good, you see?

_why
95ece3bc20c5dc43685302703a1e99bd?d=identicon&s=25 Erik Hollensbe (Guest)
on 2007-06-01 08:25
(Received via mailing list)
On 2007-05-31 15:30:50 -0700, "Austin Ziegler" <halostatue@gmail.com>
said:

> you search as this is a hotly debated discussion.
>
> Google is more useful for searching this than ruby-forum.com. You will
> find out when there will be a new release, and the current state of
> Unicode.

If it helps any, I've moved ~2000 web pages in an internal work project
that had mixed UTF-8/cp-1252 (in the content, not just between pages)
and ruby handled it very gracefully. I was using 1.8.5-p12 and Hpricot
(but not Hpricot's encoding features, which last I checked are broken)
for the process.

While I'm certainly not an authority on the subject, I've thoroughly
battle-tested this and it works with a high degree of confidence.
Certainly better than perl and libxml2, which was our original
implementation.
18d3c84ca5a017fe3e96490afaea28aa?d=identicon&s=25 Richard Conroy (Guest)
on 2007-06-01 11:51
(Received via mailing list)
On 5/31/07, Ivan Mashchenko <ivan.mashchenko@gmail.com> wrote:
> Finally, is there a convenient support for Unicode in Ruby? Or, if not,
> when will it be?

It depends on your definition of 'convenient'.

The short answer is that unicode applications can be made in Ruby,
particularly Web Apps. It is not especially difficult, but it is not
'for free' or seamless. You generally have to use an encoding-aware
string type, or modify the existing string class to support multi-byte
characters.

A longer answer would contain references to the fact that there are
multiple options here, that web apps (Rails in particular) are ahead
of pure Ruby in terms of Unicode, and that there are actually a lot
of projects to investigate.

The hardest part of Ruby and Unicode is that not all of the libraries
support it, or that some of the meta-hackery to the string class
could break libraries that expect chars.length to equal bytes.length
(there are other examples). Some of the more popular libraries are
like this, or they inherit the encoding from your O/S settings and
cannot be driven from an API.

> I am going to develop an international website (with pages in some
> european languages, including those using non-latin alphabets). I think
> it should prove to be a good idea to make such a website totally in
> Unicode (probably UTF-16), without using any legacy encodings at all.

Well yes, but I would use UTF-8 instead. Its Unicode designed for the
web (and UTF-16 is a bit wierd in some ways - there are at least 3 kinds
of UTF-16 that I am aware of).

Rails 1.2 introduced some pretty impressive support for Unicode in the
last release, all of the major i18n plugins should be compatible with
these changes by now.

> As well I would like to ask when the next Ruby release is planned to. If
> it comes this year, I should probably try nightly builds as it seems to
> be wise to start a new project targeting ea version of the next release.

AFAIK there is no release schedule. YARV is basically Ruby 1.9, and it
is scheduled for release around the end of the year. However there is no
firm commitment to make it the next Ruby version. Also Ruby 1.9 is going
to break/deprecate stuff - I wouldn't develop against it, it will be a
rough experience.
Ruby 1.9 is kind of a staging release; migrating from 1.8 -> 1.9 is
going
to be tricky, but 1.9 -> 2.0 should be a drop in; that the intention -
isolate
the biggest changes to the 1.9 release.

If you are moving to Ruby 1.9, do it with a complete working
application.
Or better still, develop against Rails versions, not Ruby versions. Let
the
Rails team figure out the best Ruby migration strategy for you.
F84eb4d8ec7ec9a694d18ca9e54db3f0?d=identicon&s=25 Ivan Mashchenko (ivan)
on 2007-06-01 13:28
Richard Conroy wrote:

> It depends on your definition of 'convenient'.

IMHO convinient is as in C#. There I don't have to bother how are
strings stored in memroy, they just do work and are international.

> Well yes, but I would use UTF-8 instead.

Won't there be a problem if the data is stored in UTF-16 (as far as I
know Orace, NVARCHAR uses 16-bit per symbol)

> Also Ruby 1.9 is going to break/deprecate stuff - I wouldn't develop against it
> migrating from 1.8 -> 1.9 is going to be tricky

So why should anyone develop a new project against 1.8 if it is going to
be deprecated?

> If you are moving to Ruby 1.9, do it with a complete working
> application.

But isn't it going to be tricky, as you've said?

I dont have to be moving for now as I have no line of Ruby code (I have
only an idea in my head) for today. And no Ruby experience (I am C++,
C#, Java and T-SQL developer). I've chosen Ruby as it seems almost good
and free.

Have I understood you correctly - you think I should make it Ruby 1.8
and then do a tricky move when it comes?

> Or better still, develop against Rails versions, not Ruby versions.

This advice can prove useful. I'll think about it.
18d3c84ca5a017fe3e96490afaea28aa?d=identicon&s=25 Richard Conroy (Guest)
on 2007-06-01 16:25
(Received via mailing list)
On 6/1/07, Ivan Mashchenko <ivan.mashchenko@gmail.com> wrote:
> Richard Conroy wrote:
>
> > It depends on your definition of 'convenient'.
>
> IMHO convinient is as in C#. There I don't have to bother how are
> strings stored in memroy, they just do work and are international.

It's not *that* convenient. By default Ruby strings are 8-byte. You can
make
them Unicode strings very easily through a library (kCODE IIRC), and
they
will behave as unicode in a way that you don't have to think about. You
don't
have to use a different string type.

The problem occurs when you use code that you didn't write that expects
strings to be single-byte. So every time you evaluate a Ruby library,
Rails
plugin or gem, you have to do more homework than you would in the
unicode centric Java or C#.

> > Well yes, but I would use UTF-8 instead.
>
> Won't there be a problem if the data is stored in UTF-16 (as far as I
> know Orace, NVARCHAR uses 16-bit per symbol)

Every database worth using lets you specify the encoding of your string
and character types. Check your manuals or the Oracle forums. Anything
that is any way associated with web development supports UTF-8.

>
> > Also Ruby 1.9 is going to break/deprecate stuff - I wouldn't develop against it
> > migrating from 1.8 -> 1.9 is going to be tricky
>
> So why should anyone develop a new project against 1.8 if it is going to
> be deprecated?

Okay, you misunderstood me. There is a feature roadmap towards Ruby 2.0,
where major changes are coming in; the two biggest that I recall are
Unicode
support and native/pre-emptive threads. The only reasonable way to
implement
them are by altering the behaviour of core classes and the standard
library.

This will mean that Ruby code of any sophistication written for Ruby
1.8, including
many libraries is likely to break.

Ruby 1.8 is not going away. Ruby is an open language, with a public
source
repository. Unlike with .Net say, where Microsoft distribute the runtime
in
binary only-form and can make older versions difficult to get. You have
no
obligation to migrate to the most recent version, and there is no
technical
reason that multiple runtimes (application specific) cannot co-exist on
the
same machine.

Chasing the latest release is really something that you only do with
commercial
languages. It's not something that is generally done with open
languages.

>
> > If you are moving to Ruby 1.9, do it with a complete working
> > application.
>
> But isn't it going to be tricky, as you've said?

It would be one hell of a lot easier than developing against a moving
target, not knowing if the issues in your code are your issues or
due to the latest release candidate.

Bleeding edge software development is for people who can spare a
lot of blood loss;

> I dont have to be moving for now as I have no line of Ruby code (I have
> only an idea in my head) for today. And no Ruby experience (I am C++,
> C#, Java and T-SQL developer). I've chosen Ruby as it seems almost good
> and free.

Yeah, its a great language. Make a point of checking out the JRuby
project.
Its an exceptionally well developed Ruby runtime; it is considerably
more
than an interpreter or language bridge - the JRuby guys have basically
doubled the size of the Java platform (or Ruby platform depending on
POV).
Ruby is strong where Java is weak, and vice versa.

> Have I understood you correctly - you think I should make it Ruby 1.8
> and then do a tricky move when it comes?

Use Rails, where the most compelling features in Ruby 1.9/2.0 are
already
present: Unicode, native concurrency (via processes) and good
performance
(via all those <foo>caching mechanisms). When the Rails guys go Ruby 1.9
you can.

> > Or better still, develop against Rails versions, not Ruby versions.
>
> This advice can prove useful. I'll think about it.

regards,
Richard.
1c0cd550766a3ee3e4a9c495926e4603?d=identicon&s=25 John Joyce (Guest)
on 2007-06-02 01:00
(Received via mailing list)
On Jun 1, 2007, at 9:23 AM, Richard Conroy wrote:

> them Unicode strings very easily through a library (kCODE IIRC),
> unicode centric Java or C#.
>
> 2.0,
>
> same machine.
>> But isn't it going to be tricky, as you've said?
>> only an idea in my head) for today. And no Ruby experience (I am C++,
> on POV).
> Ruby 1.9
> you can.
>
>> > Or better still, develop against Rails versions, not Ruby versions.
>>
>> This advice can prove useful. I'll think about it.
>
> regards,
> Richard.
>
Objective-C (through the Cocoa framework) also handles Unicode
superbly. Problem is, it is not cross-platform and is in fact
strictly OS X stuff, but you could indeed use those libraries
(NSString, etc...) through RubyCocoa, but of course that is far from
convenient or optimal for most purposes.

Ideally, if major OS vendors got behind Ruby full force and put their
Unicode know-how into the codebase, things would be smoother. They're
the ones who really have already figured out pretty good ways to
handle that stuff, and all the major scripting languages could
benefit from it.
Please log in before posting. Registration is free and takes only a minute.
Existing account

NEW: Do you have a Google/GoogleMail, Yahoo or Facebook account? No registration required!
Log in with Google account | Log in with Yahoo account | Log in with Facebook account
No account? Register here.