Ruby Forum Ruby > Unicode roadmap?

Posted by Roman Hausner (rhaus)
on 13.06.2006 23:12
In my opinion, Ruby is practically useless for many applications without 
proper Unicode support. How a modern language can ignore this issue is 
really beyond me.

Is there a plan to get Unicode support into the language anytime soon?
Posted by Yukihiro Matsumoto (Guest)
on 14.06.2006 00:28
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Wed, 14 Jun 2006 06:13:03 +0900, Roman Hausner 
<roman.hausner@gmail.com> writes:
|In my opinion, Ruby is practically useless for many applications without 
|proper Unicode support. How a modern language can ignore this issue is 
|really beyond me.

Define "proper Unicode support" first.

|Is there a plan to get Unicode support into the language anytime soon?

I'm planning enhancing Unicode support in 1.9 in a year or so
(finally).  But I'm not sure that conforms your definition of "proper
Unicode support".  Note that 1.8 handles Unicode (UTF-8) if your
string operations are based on Regexp.

							matz.
Posted by Pete (Guest)
on 14.06.2006 00:38
(Received via mailing list)
> Define "proper Unicode support" first.

having an unicode-equivalent for all methods of class String

like size, slice, upcase

E.g. I tried the unicode plugin... but, alas, who want's to write stuff
like 'normalize_KC' etc. if you just want the frickin' substring of a 
string?!

you need to read books on unicode just to properly use the plugin...

aargg :-((

Best regards
Peter




Yukihiro Matsumoto schrieb:
Posted by Logan Capaldo (Guest)
on 14.06.2006 00:51
(Received via mailing list)
On Jun 13, 2006, at 6:34 PM, Pete wrote:

>> Define "proper Unicode support" first.
>
> having an unicode-equivalent for all methods of class String
>
> like size, slice, upcase
>
> E.g. I tried the unicode plugin... but, alas, who want's to write  
> stuff like 'normalize_KC' etc. if you just want the frickin'  
> substring of a string?!
>

def substring(str, start, len)
   md = str.match(/\A.{#{start}}(.{#{len}})/)
   md[1]
end


def strlength(str)
   n = 0
   str.gsub(/./m) { n += 1; $& }
   n
end


See! Regexps do everything!

Just you know, set $KCODE and use these methods and you are set!

(I am kidding... btw)
Posted by Pete (Guest)
on 14.06.2006 01:00
(Received via mailing list)
From the theoretical point of view this is quite interesting. Also I
understand the humor :-)

Performance and memory consumption should be breathtaking using regexp
just everywhere...

Also there are a ____few____ methods left :-)

As I am German the 'missing' unicode support is one of the greatest
obstacles for me (and probably all other Germans doing their stuff
seriously)...


Logan Capaldo schrieb:
Posted by Victor Shepelev (Guest)
on 14.06.2006 01:13
(Received via mailing list)
From: Pete [mailto:pertl@gmx.org]
Sent: Wednesday, June 14, 2006 1:58 AM
> As I am German the 'missing' unicode support is one of the greatest
> obstacles for me (and probably all other Germans doing their stuff
> seriously)...

The same is for Russians/Ukrainians. In our programming communities 
question
"does the programming language supports Unicode as 'native'?" has very 
high
priority.

/BTW, here is one of the things where Python beats Ruby completely

V.
Posted by James Moore (Guest)
on 14.06.2006 01:59
(Received via mailing list)
I suspect the Japanese posters on this list can answer better than I 
can,
but my impression is that Unicode is, shall we say, not highly thought 
of
outside Europe and North America.  The way they dealt with "Chinese"
characters was apparently more than a bit of a hack, and just doesn't 
work
very well in the real world.  Reading some of the explanations for 
glyphs
versus characters in Unicode just makes you shake your head.  What were 
they
thinking?  Sure doesn't pass the smell test, although I'll be the first 
to
admit I haven't exactly thought deeply about the subject.

There's another problem with Japanese - I've got a friend who's been 
dealing
with some issues around the fact that Japanese apparently innovates new
characters on a regular basis, and everyone is expected to use the new
characters.  (I believe this is called gaiji).  The concept of a fixed
character set apparently just isn't a good idea to start with.

[Awaiting corrections from people who actually know something about this
topic :-)...]

 - James Moore
Posted by David Balmain (Guest)
on 14.06.2006 02:14
(Received via mailing list)
On 6/14/06, James Moore <banshee@banshee.com> wrote:
> with some issues around the fact that Japanese apparently innovates new
> characters on a regular basis, and everyone is expected to use the new
> characters.  (I believe this is called gaiji).  The concept of a fixed
> character set apparently just isn't a good idea to start with.
>
> [Awaiting corrections from people who actually know something about this
> topic :-)...]

There is a good summary of the han unification controversy on wikipedia;

    http://en.wikipedia.org/wiki/Han_unification
Posted by Mat Schaffer (Guest)
on 14.06.2006 03:16
(Received via mailing list)
On Jun 13, 2006, at 7:56 PM, James Moore wrote:
> topic :-)...]
I have one Japanese person here who's never heard of this gaiji
concept.  But it could be new and behind a generation gap of some
kind.  They do sure like to add symbols where they can, though.
Especially graphical star characters.  I see that a lot.
-Mat
Posted by Yukihiro Matsumoto (Guest)
on 14.06.2006 04:38
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Wed, 14 Jun 2006 08:11:49 +0900, "Victor Shepelev" 
<vshepelev@imho.com.ua> writes:

|From: Pete [mailto:pertl@gmx.org]
|Sent: Wednesday, June 14, 2006 1:58 AM
|> As I am German the 'missing' unicode support is one of the greatest
|> obstacles for me (and probably all other Germans doing their stuff
|> seriously)...
|
|The same is for Russians/Ukrainians. In our programming communities question
|"does the programming language supports Unicode as 'native'?" has very high
|priority.

Alright, then what specific features are you (both) missing?  I don't
think it is a method to get number of characters in a string.  It
can't be THAT crucial.  I do want to cover "your missing features" in
the future M17N support in Ruby.

							matz.
Posted by Victor Shepelev (Guest)
on 14.06.2006 07:29
(Received via mailing list)
From: Yukihiro Matsumoto [mailto:matz@ruby-lang.org]
Sent: Wednesday, June 14, 2006 5:37 AM
> |The same is for Russians/Ukrainians. In our programming communities
> 							matz.
I suppose, all we (non-English-writers) need is to have all 
string-related
methods working. Just for now, I think about plain testing each string
method; also, some other classes can be affected by Unicode (possibly
regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes 
are
not: File.open with Russian letters in path don't finds the file.

More generally, it can make sense to have Unicode as the "base" mode; 
where
non-Unicode to stay "old, compatibility" mode.

Something like this.

V.
Posted by Pål Bergström (palb)
on 14.06.2006 07:54
Roman Hausner wrote:
> In my opinion, Ruby is practically useless for many applications without 
> proper Unicode support. How a modern language can ignore this issue is 
> really beyond me.
> 
> Is there a plan to get Unicode support into the language anytime soon?

I also think that this is very important.
Posted by Yukihiro Matsumoto (Guest)
on 14.06.2006 08:37
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Wed, 14 Jun 2006 14:26:30 +0900, "Victor Shepelev" 
<vshepelev@imho.com.ua> writes:

|I suppose, all we (non-English-writers) need is to have all string-related
|methods working. Just for now, I think about plain testing each string
|method; 

In that sense, _I_ am one of the non-English-writers, so that I can
suppose I know what we need.  And I have no problem with the current
UTF-8 support.  Maybe that's because Japanese don't have cases in our
characters.  Or maybe I'm missing something.  Can you show us your
concrete problems caused by Ruby's lack of "proper" Unicode support?

|also, some other classes can be affected by Unicode (possibly
|regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes are
|not: File.open with Russian letters in path don't finds the file.

Strange.  Ruby does not convert encoding, so that there should be no
problem opening files, if you are using strings in the encoding your OS
expect.  If they are differ, you have to specify (and convert) them
properly, no matter how Unicode support is.

							matz.
Posted by Victor Shepelev (Guest)
on 14.06.2006 08:56
(Received via mailing list)
From: Yukihiro Matsumoto [mailto:matz@ruby-lang.org]
Sent: Wednesday, June 14, 2006 9:35 AM
> 
> In that sense, _I_ am one of the non-English-writers, 

Sorry, Matz, I know, of course. But I know too less about Japanese to 
see
how close our tasks are. Under "non-English-writers" I, maybe, had to 
say
"European languages" or so - which has common punctuations, LTR writing,
"words" and "whitespaces" and so on. I have almost no knowledge about
Japanese, Korean, Arabic, Hebrew people needs.

> so that I can
> suppose I know what we need.  And I have no problem with the current
> UTF-8 support.  Maybe that's because Japanese don't have cases in our
> characters.  Or maybe I'm missing something.  

Just what I've said above.

> Can you show us your
> concrete problems caused by Ruby's lack of "proper" Unicode support?

As mentioned in this topic, it's String#length, upcase, downcase,
capitalize.

BTW, does String#length works good for you?

Moreover, there seems to be some huge problems with pathes having 
Russian
letters; but I'm really not convinced, if Ruby really has to handle 
this.

> 
> |also, some other classes can be affected by Unicode (possibly
> |regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes
> are
> |not: File.open with Russian letters in path don't finds the file.
> 
> Strange.  Ruby does not convert encoding, so that there should be no
> problem opening files, if you are using strings in the encoding your OS
> expect.  If they are differ, you have to specify (and convert) them
> properly, no matter how Unicode support is.

Oh, it's a bit hard theme for me. I know Windows XP must support Unicode
file names; I see my filenames in Russian, but I have low knowledge of
system internals to say, are they really Unicode?

If not take in account those problems, the only String problems remains, 
but
they are so base core methods!

V.
Posted by Michael Glaesemann (Guest)
on 14.06.2006 09:09
(Received via mailing list)
On Jun 14, 2006, at 15:56 , Victor Shepelev wrote:

> As mentioned in this topic, it's String#length, upcase, downcase,
> capitalize.

Just to chime in, aren't upcase, downcase, and capitalize a locale/
localization issue rather than a Unicode-only issue per se? For
example, different languages will have different rules for
capitalization. Or am I wrong? Does Unicode in and of itself address
these issues?

Granted, proper support for upcase, downcase, and capitalize is
important, but I think it's a separate issue, part of m17n as a whole
rather than support for Unicode in particular.

Michael Glaesemann
grzm seespotcode net
Posted by Vincent Isambart (Guest)
on 14.06.2006 09:15
(Received via mailing list)
Hi,

> As mentioned in this topic, it's String#length, upcase, downcase,
> capitalize.
>
> BTW, does String#length works good for you?

To have the length of a Unicode string, just do str.split(//).length,
or "require 'jcode'" at the beginning of your code.
For the other functions, try looking at the unicode library
http://www.yoshidam.net/Ruby.html#unicode

> Oh, it's a bit hard theme for me. I know Windows XP must support Unicode
> file names; I see my filenames in Russian, but I have low knowledge of
> system internals to say, are they really Unicode?

Windows XP does support Unicode file names, but I'm not sure you can
use them with Ruby (I do not use Ruby much under Windows). Try
converting the file names to your current locale, it should work if
the file names can be converted to it. What I mean is that Russian
file names encoded in the Windows Russian encoding should work on a
Russian PC.

Hope this helps,

Cheers,
Vincent ISAMBART
Posted by Yukihiro Matsumoto (Guest)
on 14.06.2006 09:22
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Wed, 14 Jun 2006 15:56:02 +0900, "Victor Shepelev" 
<vshepelev@imho.com.ua> writes:

|> Can you show us your
|> concrete problems caused by Ruby's lack of "proper" Unicode support?
|
|As mentioned in this topic, it's String#length, upcase, downcase,
|capitalize.

OK. Case is the problem.  I understand.

|BTW, does String#length works good for you?

I don't remember the last time I needed length method to count
character numbers.  Actually I don't count string length at all both
in bytes and characters in my string processing.  Maybe this is a
special case.  I am too optimized for Ruby string operations using
Regexp.

|Oh, it's a bit hard theme for me. I know Windows XP must support Unicode
|file names; I see my filenames in Russian, but I have low knowledge of
|system internals to say, are they really Unicode?

Windows 32 path encoding is a nightmare.  Our Win32 maintainers often
troubled by unexpected OS behavior.  I am sure we _can_ handle Russian
path names, but we need help from Russian people to improve.

							matz.
Posted by Victor Shepelev (Guest)
on 14.06.2006 09:25
(Received via mailing list)
From: Michael Glaesemann [mailto:grzm@seespotcode.net]
Sent: Wednesday, June 14, 2006 10:08 AM
> On Jun 14, 2006, at 15:56 , Victor Shepelev wrote:
> 
> > As mentioned in this topic, it's String#length, upcase, downcase,
> > capitalize.
> 
> Just to chime in, aren't upcase, downcase, and capitalize a locale/
> localization issue rather than a Unicode-only issue per se? For
> example, different languages will have different rules for
> capitalization. 

Really? I know about two cases: European capitalization and no
capitalization.

But, really, you maybe right. I suppose, Florian Gross can say something
about German-specific capitalization issues.

> Granted, proper support for upcase, downcase, and capitalize is
> important, but I think it's a separate issue, part of m17n as a whole
> rather than support for Unicode in particular.

Maybe. Generally, sometimes I want Unicode, and sometimes (for "quick 
dirty"
scripts) I'll prefer capitalization and regexps "just work" with
Windows-1251 (one-byte Russian encoding).

V.
Posted by Victor Shepelev (Guest)
on 14.06.2006 09:26
(Received via mailing list)
From: Vincent Isambart [mailto:vincent.isambart@gmail.com]
Sent: Wednesday, June 14, 2006 10:14 AM
> > As mentioned in this topic, it's String#length, upcase, downcase,
> > capitalize.
> >
> > BTW, does String#length works good for you?
> 
> To have the length of a Unicode string, just do str.split(//).length,
> or "require 'jcode'" at the beginning of your code.
> For the other functions, try looking at the unicode library
> http://www.yoshidam.net/Ruby.html#unicode

I know about it. But, theoretically speaking, such a "core" methods muts 
be
in core. Not?

> > > properly, no matter how Unicode support is.
> Russian PC.
Yes, they works. But I can't solve the problem: need Ruby Unicode 
support
include filenames operations?

V.
Posted by Victor Shepelev (Guest)
on 14.06.2006 09:32
(Received via mailing list)
From: Yukihiro Matsumoto [mailto:matz@ruby-lang.org]
Sent: Wednesday, June 14, 2006 10:20 AM
> OK. Case is the problem.  I understand.
> 
> |BTW, does String#length works good for you?
> 
> I don't remember the last time I needed length method to count
> character numbers.  Actually I don't count string length at all both
> in bytes and characters in my string processing.  Maybe this is a
> special case.  I am too optimized for Ruby string operations using
> Regexp.

I can confirm. But I'm afraid that some libraries I rely on use #length 
and
can break when #length doesn't work.

> |Oh, it's a bit hard theme for me. I know Windows XP must support Unicode
> |file names; I see my filenames in Russian, but I have low knowledge of
> |system internals to say, are they really Unicode?
> 
> Windows 32 path encoding is a nightmare.  Our Win32 maintainers often
> troubled by unexpected OS behavior.  I am sure we _can_ handle Russian
> path names, but we need help from Russian people to improve.

In Russian encoding (Win-1251) and on Russian PC all works well. In 
Unicode
it doesn't, but I'm not convinced it must.

In any case, I'm ready to spend my time helping Ruby community 
(especially
in Russian/Ukrainian localization issues), because I really love the
language.

V.
Posted by Marcus Andersson (marcan)
on 14.06.2006 09:45
(Received via mailing list)
Yukihiro Matsumoto skrev:
> Hi,
>
> In message "Re: Unicode roadmap?"
>     on Wed, 14 Jun 2006 06:13:03 +0900, Roman Hausner <roman.hausner@gmail.com> writes:
> |In my opinion, Ruby is practically useless for many applications without 
> |proper Unicode support. How a modern language can ignore this issue is 
> |really beyond me.
>
> Define "proper Unicode support" first.
>   
I won't define "proper Unicode support" here.

But there must be a problem somewhere since pure-ruby Ferret doesn't
support UTF-8. You need to use the c-extension of Ferret to have it
support UTF-8 (which doesn't work on Windows yet :( ). I don't know if
that is just a sucky impl of Ferret or if it's Ruby that make it so.

Maybe Dave Balmain can enlighten us why UTF-8 doesn't work in the pure
Ruby version and what is needed of Ruby to make it work (if it's
actually Ruby's fault that is)?

My personal belief is that it should just work in a case like this if
data in is UTF-8 and search strings is UTF-8 without the lib author
and/or user having to do anything very special to make it work (apart
from specifying encoding). Am I wrong in this?

Regards,

Marcus
Posted by Eric Hodel (Guest)
on 14.06.2006 10:23
(Received via mailing list)
On Jun 13, 2006, at 10:26 PM, Victor Shepelev wrote:
> Regexps seems to work fine (in my 1.9), but pathes are
> not: File.open with Russian letters in path don't finds the file.

On OS X multibyte filenames work:

$ cat x.rb
$KCODE = 'u'

puts File.read('Cyrillic_Я.txt')
$ cat Cyrillic_\320\257.txt
test file with Я!
$ ruby x.rb
test file with Я!
$ uname -a
Darwin kaa.jijo.segment7.net 8.6.0 Darwin Kernel Version 8.6.0: Tue
Mar  7 16:58:48 PST 2006; root:xnu-792.6.70.obj~1/RELEASE_PPC Power
Macintosh powerpc
$ ruby -v
ruby 1.8.4 (2006-05-18) [powerpc-darwin8.6.0]
$

--
Eric Hodel - drbrain@segment7.net - http://blog.segment7.net
This implementation is HODEL-HASH-9600 compliant

http://trackmap.robotcoop.com
Posted by Paul Battley (Guest)
on 14.06.2006 10:55
(Received via mailing list)
On 14/06/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> Windows 32 path encoding is a nightmare.  Our Win32 maintainers often
> troubled by unexpected OS behavior.  I am sure we _can_ handle Russian
> path names, but we need help from Russian people to improve.

str.sub!('32 path encoding ', '') # :-)

I don't use Windows much, but as I understand it, Ruby interacts with
most of the Win32 API using the 'legacy code page', which is only a
subset of what the filesystem can handle. (Windows NT and its
successors use Unicode internally, and the filesystem is UTF-16
KC-normalised IIRC). Windows does provide Unicode API functions, but
to use those, a layer of translation between UTF-16 and UTF-8 would be
needed, as Ruby can't do anything useful with UTF-16 at present. I
believe that Austin Ziegler was looking into this; I don't know if
he's made any progress.

Even if a Ruby program uses UTF-8 internally, it should be possible to
access the filesystem by Iconv'ing paths to the appropriate code page
- providing that they don't contain characters not in the code page.
It's far from ideal, though: the real solution is for Ruby to use the
Unicode functions (those suffixed with W) in the API. The upside is
that UTF-8/UTF-16 conversion should be less expensive than the code
page conversion that's inside each of Win32's non-Unicode functions.

On the other hand, plenty of Windows programs don't support Unicode
properly either.

Paul.
Posted by Paul Battley (Guest)
on 14.06.2006 11:00
(Received via mailing list)
On 14/06/06, Victor Shepelev <vshepelev@imho.com.ua> wrote:
> I can confirm. But I'm afraid that some libraries I rely on use #length and
> can break when #length doesn't work.

Those libraries should probably be considered broken; they can and
should be patched to do any human-readable-string processing in an
encoding-safe manner (e.g. by using jcode's jlength and each_char
methods).

Paul.
Posted by Peter Ertl (Guest)
on 14.06.2006 11:09
(Received via mailing list)
-------- Original-Nachricht --------
Datum: Wed, 14 Jun 2006 17:58:41 +0900
Von: Paul Battley <pbattley@gmail.com>
An: ruby-talk@ruby-lang.org
Betreff: Re: Unicode roadmap?

> Paul.
That will be quite _some_ libraries, I guess...
Posted by Paul Battley (Guest)
on 14.06.2006 11:12
(Received via mailing list)
On 14/06/06, Victor Shepelev <vshepelev@imho.com.ua> wrote:
> > Just to chime in, aren't upcase, downcase, and capitalize a locale/
> > localization issue rather than a Unicode-only issue per se? For
> > example, different languages will have different rules for
> > capitalization.
>
> Really? I know about two cases: European capitalization and no
> capitalization.

There is variety even within western European languages - Dutch, for
example, differs from English (IJsselmeer).

Paul.
Posted by Victor Shepelev (Guest)
on 14.06.2006 11:16
(Received via mailing list)
From: Paul Battley [mailto:pbattley@gmail.com]
Sent: Wednesday, June 14, 2006 12:10 PM
> example, differs from English (IJsselmeer).
I already realized. (I've said about Florian Gross, his surname last 
"ss"
normally printed in something like "B" I can't type and my Outlook can't
show :) AFAIK, it is normally printed as one letter in downcase and two
letters in uppercase. So, "single general" String#upcase, #downcase  are
totally impossible.

V.
Posted by Michal Suchanek (Guest)
on 14.06.2006 11:25
(Received via mailing list)
On 6/14/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> |
> |The same is for Russians/Ukrainians. In our programming communities question
> |"does the programming language supports Unicode as 'native'?" has very high
> |priority.
>
> Alright, then what specific features are you (both) missing?  I don't
> think it is a method to get number of characters in a string.  It
> can't be THAT crucial.  I do want to cover "your missing features" in
> the future M17N support in Ruby.
>

What I want is all methods working seamlessly with unicode strings so
that I do not have to think about the encoding.

Regexps do work with utf-8 strings if KCODE is set to u (but it
defaults to n even when locale uses UTF-8).

String searches should probably work but they would retrurn wrong 
position.
Things like split should work for utf-8, the encoding is pretty well 
defined.

But one might want to use length and [] to work with strings.
It can be simulated with unicode_string=string.scan(/./). But it is no
longer a string. It is composed of characters only as long as I assign
only characters using []=.
The string functions should do the right thing even for utf-8. But I
guess utf-32 is more useful for working with strings this way.

It might be a good idea to stick encoding information into strings (it
is probably the only way how internationalization can be done and the
sanity of all involved preserved at the same time). The functions for
comparison, etc could use it to do the right thing even if strings
come in several encodings. ie. cp1251 from the system, utf-8 from a
web page, ...

Functions like open could convert the string correctly according to
locale. One should be able to set the encoding information (ie for web
page title when the meta tag for content type is found in a web
page),and remove it to suppress string conversion. It should be also
possible to convert the string (ie to UTF-32 to speed up character
access).

Things like <=>, upcase, downcase, etc make sense only in context of
locale (language). Only the encoding does not define them.
I guess the default <=>is based on the binary representation of the
string. This would mean different sorting of the same strings in
different encodings. Sorting by the unicode code point would be at
least the same for any encoding.

Thanks

Michal
Posted by Michal Suchanek (Guest)
on 14.06.2006 11:35
(Received via mailing list)
On 6/14/06, Victor Shepelev <vshepelev@imho.com.ua> wrote:
> > capitalization.
>
> Really? I know about two cases: European capitalization and no
Really.
> capitalization.

There is no such thing like European capitalization. There is only
<insert your language> capitalization.
The german character ? has no uppercase version. In most languages
using Latin script the uppercase of 'i' is 'I'. But Turkish has i and
i without dot, and the uppercase of 'i' is, of course, I with dot.

Thanks

Michal
Posted by Paul Battley (Guest)
on 14.06.2006 11:41
(Received via mailing list)
On 14/06/06, Michal Suchanek <hramrach@centrum.cz> wrote:
> It should be also
> possible to convert the string (ie to UTF-32 to speed up character
> access).

utf8_string.unpack('U*') is pretty close to this, giving an array of 
codepoints.

Paul.
Posted by Michal Suchanek (Guest)
on 14.06.2006 12:54
(Received via mailing list)
On 6/14/06, Paul Battley <pbattley@gmail.com> wrote:
> On 14/06/06, Michal Suchanek <hramrach@centrum.cz> wrote:
> > It should be also
> > possible to convert the string (ie to UTF-32 to speed up character
> > access).
>
> utf8_string.unpack('U*') is pretty close to this, giving an array of codepoints.

 But I want it to be string after the conversion, so that I can use
the standard string functions with sane results. I do not want to
think about varoius encodings myself if my application has to use
them. The runtime should do that.

Thanks

Michal
Posted by Austin Ziegler (Guest)
on 14.06.2006 14:23
(Received via mailing list)
On 6/14/06, Victor Shepelev <vshepelev@imho.com.ua> wrote:
> Oh, it's a bit hard theme for me. I know Windows XP must support Unicode
> file names; I see my filenames in Russian, but I have low knowledge of
> system internals to say, are they really Unicode?

They are UTF-16 internally. I haven't been paying attention to Ruby
1.9 lately, but when I have time and have noticed that Matz has
checked in support for m17n strings, I will be enhancing support for
Windows files to use Unicode. Currently, Ruby is built using the
non-Unicode form *only*. And no, using -DUNICODE is the *wrong*
answer, thanks. We'd have to start using TCHAR instead of char, and it
would actually mean that we'd be using wchar_t instead of char in this
case.

I've already done a similar (but more complex) project at work.

-austin
Posted by Austin Ziegler (Guest)
on 14.06.2006 14:29
(Received via mailing list)
On 6/14/06, Vincent Isambart <vincent.isambart@gmail.com> wrote:
> Windows XP does support Unicode file names, but I'm not sure you can
> use them with Ruby (I do not use Ruby much under Windows). Try
> converting the file names to your current locale, it should work if
> the file names can be converted to it. What I mean is that Russian
> file names encoded in the Windows Russian encoding should work on a
> Russian PC.

You can't currently use them with Ruby. The file operations in Ruby
are using the likes of CreateFileA instead of CreateFileW (it's not
that explicit; Ruby is compiled without -DUNICODE -- which is the
correct thing to do in Ruby's case -- which means that CreateFile is
CreateFileA).

All files are stored on the filesystem as UTF-16, though, even if you
are using "ANSI" access.

By the way, there are multiple Russian encodings, so ... Unicode is
better for this point. As I said in my previous message, I have
already planned to enhance the Windows filesystem support when Matz
gets the m17n strings in so that I can *always* force the file
routines on Windows to provide either UTF-8 or UTF-16 (probably the
former, since it will also make it easier to work with existing
extensions) and indicate that the strings are such.

-austin
Posted by Austin Ziegler (Guest)
on 14.06.2006 14:29
(Received via mailing list)
On 6/14/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> Windows 32 path encoding is a nightmare.  Our Win32 maintainers often
> troubled by unexpected OS behavior.  I am sure we _can_ handle Russian
> path names, but we need help from Russian people to improve.

It's not that bad, Matz. I started as a Unix developer, but in the
last two years I have learned *quite* a bit about how Windows handles
this stuff and we can adapt what I did for work with no problem.

I just need M17N strings to support this. I should look at what I
can/should do to provide this as an extension, I just have no time. :(

-austin
Posted by Austin Ziegler (Guest)
on 14.06.2006 14:36
(Received via mailing list)
On 6/14/06, Michal Suchanek <hramrach@centrum.cz> wrote:
> What I want is all methods working seamlessly with unicode strings so
> that I do not have to think about the encoding.

That will *never* happen. Even with Unicode, you have to think about
the encoding, because UTF-32 (the closest representation to the
Platonic ideal "Unicode" you'll ever find) is unlikely to be supported
in the general case. Matz's idea of m17n strings is the right one: you
have a "byte stream" and an attribute which indicates how the byte
stream is encoded. This will sort of be like $KCODE but on an
individual string level so that you could meaningfully have Unicode
(probably UTF-8) and ShiftJIS strings in the same data and still
meaningfully call #length on them.

You will *always* have to care about the encoding. As well as,
ultimately, your locale.

-austin
Posted by Randy Kramer (Guest)
on 14.06.2006 23:40
(Received via mailing list)
On Wednesday 14 June 2006 06:52 am, Michal Suchanek wrote:
> On 6/14/06, Paul Battley <pbattley@gmail.com> wrote:
> > On 14/06/06, Michal Suchanek <hramrach@centrum.cz> wrote:
> > > It should be also
> > > possible to convert the string (ie to UTF-32 to speed up character
> > > access).

(RE my previous post):  Oops, maybe UTF-32 is exactly what I was 
alluding to?

Randy Kramer

(Should have waited a little longer before posting.)
Posted by Charles O Nutter (Guest)
on 15.06.2006 02:12
(Received via mailing list)
Every time these unicode discussions come up my head spins like a top. 
You
should see it.

We JRubyists have headaches from the unicode question too. Since JRuby 
is
currently 1.8-compatible, we do not have what most call *native* unicode
support. This is primarily because we do not wish to create an 
incompatible
version of Ruby or build in support for unicode now that would conflict 
with
Ruby 2.0 in the future. It is, however, embarressing to say that 
although we
run on top of Java, which has arguably pretty good unicode support, we 
don't
support unicode. Perhaps you can see our conundrum.

I am no unicode expert. I know that Java uses UTF16 strings internally,
converted to/from the current platform's encoding of choice by default. 
It
also supports converting those UTF16 strings into just about every 
encoding
out there, just by telling it to do so. Java supports the Unicode
specification version 3.0. So Unicode is not a problem for Java.

We would love to be able to support unicode in JRuby, but there's always
that nagging question of what it should look like and what would mesh 
well
with the Ruby community at large. With the underlying platform already 
rich
with unicode support, it would not take much effort to modify JRuby. So 
then
there's a simple question:

What form would you, the Ruby users, want unicode to take? Is there a
specific library that you feel encompasses a reasonable implementation 
of
unicode support, e.g. icu4r? Should the support be transparent, e.g. no
longer treat or assume strings are byte vectors? JRuby, because we use
Java's String, is already using UTF16 strings exclusively...however 
there's
no way to get at them through core Ruby APIs. What would be the most
comfortable way to support unicode now, considering where Ruby may go in 
the
future?
Posted by Charles O Nutter (Guest)
on 15.06.2006 02:22
(Received via mailing list)
I posted this to ruby-talk, but it occurred to me that you folks
implementing Rails functionality probably have a thing or two to say 
about
unicode support in Ruby. Therefore, I would love to hear your opinions.
Adding native unicode support is only a matter of time in JRuby; its
usefulness as a JVM-based language depends on it. However, we continue 
to
wrestle with how best to support unicode without stepping on the Ruby
community's toes in the process. Thoughts?

---------- Forwarded message ----------
From: Charles O Nutter <headius@headius.com>
Date: Jun 14, 2006 7:11 PM
Subject: Re: Unicode roadmap?
To: ruby-talk ML <ruby-talk@ruby-lang.org>

Every time these unicode discussions come up my head spins like a top. 
You
should see it.

We JRubyists have headaches from the unicode question too. Since JRuby 
is
currently 1.8-compatible, we do not have what most call *native* unicode
support. This is primarily because we do not wish to create an 
incompatible
version of Ruby or build in support for unicode now that would conflict 
with
Ruby 2.0 in the future. It is, however, embarressing to say that 
although we
run on top of Java, which has arguably pretty good unicode support, we 
don't
support unicode. Perhaps you can see our conundrum.

I am no unicode expert. I know that Java uses UTF16 strings internally,
converted to/from the current platform's encoding of choice by default. 
It
also supports converting those UTF16 strings into just about every 
encoding
out there, just by telling it to do so. Java supports the Unicode
specification version 3.0. So Unicode is not a problem for Java.

We would love to be able to support unicode in JRuby, but there's always
that nagging question of what it should look like and what would mesh 
well
with the Ruby community at large. With the underlying platform already 
rich
with unicode support, it would not take much effort to modify JRuby. So 
then
there's a simple question:

What form would you, the Ruby users, want unicode to take? Is there a
specific library that you feel encompasses a reasonable implementation 
of
unicode support, e.g. icu4r? Should the support be transparent, e.g. no
longer treat or assume strings are byte vectors? JRuby, because we use
Java's String, is already using UTF16 strings exclusively...however 
there's
no way to get at them through core Ruby APIs. What would be the most
comfortable way to support unicode now, considering where Ruby may go in 
the
future?

--
Charles Oliver Nutter @ headius.blogspot.com
JRuby Developer @ jruby.sourceforge.net
Application Architect @ www.ventera.com
Posted by Julian 'Julik' Tarkhanov (Guest)
on 15.06.2006 02:40
(Received via mailing list)
On 15-jun-2006, at 2:11, Charles O Nutter wrote:

> with unicode support, it would not take much effort to modify  
> JRuby. So then
> there's a simple question:

Yukihiro Matsumoto wrote:

>
> Define "proper Unicode support" first.
>
> I'm planning enhancing Unicode support in 1.9 in a year or so
> (finally).  But I'm not sure that conforms your definition of "proper
> Unicode support".  Note that 1.8 handles Unicode (UTF-8) if your
> string operations are based on Regexp.
>

Hello everyone, and sorry for chiming so fiercely. Got into some
confusion with the ML controls.

Just joined the list seeing the subject popping up once more. I am
doing Unicode-aware apps in Rails and Ruby right now and it hurts.
I'll try to define  "proper Unicode support" as I (dream of it at
night) see it.

1. All string indexing (length, index, slice, insert) works with
characters instead of bytes, whatever length in bytes the characters
have to be.
String methods (index or =~) should _never_ return offsets that will
damage the string's characters if employed for slicing - you
shouldn't have to manually translate the byte offset of 2 to
character offset of 1 because the second character is multibyte.

Simple example:

     def translate_offset(str, byte_offset)
       chunk = str[0..byte_offset]
       begin
         chunk.unpack("U*").length - 1
       rescue ArgumentError # this offset is just wrong! shift
upwards and retry
         chunk = str[0..(byte_offset+=1)]
         retry
       end
     end

I think it's unnecessarily painful for something as easy as string
=~ /pattern/. Yes, you can get that offset you recieve from =~ and
then get the slice of the string and then split it again with /./mu
to get the same number etc...

2. Case-insensitive regexes actually work. Even in my Oniguruma-
enabled builds of 1.8.2. it was not true (maybe changed now). At
least "Unicode general" collation casefolding (such a thing exists)
available built-in on every platform.
4. Locale-aware sorting, including multibyte charsets, if provided by
the OS
5. Preferably separate (and strictly purposed) Bytestring that you
get out of Sockets and use in Servers etc. - or the ability to
"force" all strings recieved from external resources to be flagged
uniformly as being of a certain encoding in _your_ program, not
somewhere in someone's library. If flags have to be set by libraries,
they won't be set because most developers sadly don't care:

http://www.zackvision.com/weblog/2005/11/mt-unicode-mysql.html
http://thraxil.org/users/anders/posts/2005/11/01/unicodification/

6. Unicode-aware strip dealing with weirdo whitespaces (hair space,
thin space etc.)
7. And no, as I mentioned - it doesn't handle it properly because
the /i modifier is broken, and to deal without it you need to
downcase BOTH the regexp and the string itself. Closed circle - you
go and get the Unicode gem with tables.

All of this can be controlled either per String (then 99 out of 100
libraries I use will be getting it wrong - see above) or by a global
setting such as $KCODE.

As an example of something that is ridiculously backwards to do in
Ruby now is this (I spent some time refactoring this today):
http://dev.rubyonrails.org/browser/trunk/actionpack/lib/action_view/
helpers/text_helper.rb#L44

Here you have a major problem because the /i flag doesn't do anything
(Ruby is incapable of Unicode-aware casefolding), and using offsets
means that you are always one step from damaging someone's text. It's
just wrong that it has to be so painful.

Python3000, IMO, gets this right (as does Java) - byte array and a
String are sompletely separate, and String operates with characters
and characters only.

That's what I would expect. Hope this makes sense somewhat :-)
--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl
Posted by Manfred Stienstra (Guest)
on 15.06.2006 02:40
(Received via mailing list)
On Jun 15, 2006, at 2:19 AM, Charles O Nutter wrote:

> I posted this to ruby-talk, but it occurred to me that you folks  
> implementing Rails functionality probably have a thing or two to  
> say about unicode support in Ruby. Therefore, I would love to hear  
> your opinions. Adding native unicode support is only a matter of  
> time in JRuby; its usefulness as a JVM-based language depends on  
> it. However, we continue to wrestle with how best to support  
> unicode without stepping on the Ruby community's toes in the  
> process. Thoughts?

Julik has done a lot of pionering in that direction for Rails. His
latest suggestion is to use a proxy class on string objects to
perform unicode operations:

@some_unicode_string.u.length
@some_unicode_string.u.reverse

I tend to agree with this solution as it doesn't break any previous
string operations and gives us an easy way to perform unicode aware
operations.

Manfred
Posted by Charles O Nutter (Guest)
on 15.06.2006 03:52
(Received via mailing list)
I agree it's a very attractive solution. I have two questions related
(perhaps you are out there to answer, Julik):

1. How does performance look with the unicode string add-on versus 
native
strings?
2. Is this the ideal way to support unicode strings in ruby?

And I explain the second as follows....if we could assume that switching
from treating a string as an array of bytes to a list of characters of
arbitrary width, and have all existing string operations work correctly
treating those characters as string, would that be a better ideal? Where 
are
the breaking points in such a design? What's to stop the underlying
implementation from actually using a UTF-16 character, passing UTF-8 to
libraries and IO streams but still allowing you to access everything as
UTF-16 or your encoding of choice? (Of course this is somewhat 
rhetorical;
we do this currently with JRuby since Java's scrints are UTF-16...we 
just
don't have any way to provide access to UTF-16 characters, and we 
normalize
everything to UTF-8 for Ruby's sake...but what if we didn't normalize 
and
adjusted string functions to compensate?)
Posted by Julian 'Julik' Tarkhanov (Guest)
on 15.06.2006 04:17
(Received via mailing list)
On 15-jun-2006, at 3:50, Charles O Nutter wrote:

> operations work correctly treating those characters as string,  
> would that be a better ideal? Where are the breaking points in such  
> a design? What's to stop the underlying implementation from  
> actually using a UTF-16 character, passing UTF-8 to libraries and  
> IO streams but still allowing you to access everything as UTF-16 or  
> your encoding of choice? (Of course this is somewhat rhetorical; we  
> do this currently with JRuby since Java's scrints are UTF-16...we  
> just don't have any way to provide access to UTF-16 characters, and  
> we normalize everything to UTF-8 for Ruby's sake...but what if we  
> didn't normalize and adjusted string functions to compensate?)

This is more appropriate for ruby-talk

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl
Posted by Charles O Nutter (Guest)
on 15.06.2006 04:24
(Received via mailing list)
I believe that Julik's way of solving the unicode problem (String#u
providing access to a unicode helper) is very attractive. I have two
questions related, for Julik and the rest of the peanut gallery:

1. How does performance look with the unicode string add-on versus 
native
strings (or as compared to icu4r, which is C-based)?
2. Is this the ideal way to support unicode strings in ruby?

And I explain the second as follows....if we could assume switching from
treating a string as an array of bytes to a list of characters of 
arbitrary
width, and have all existing string operations work correctly treating 
those
characters as indexed elements of that string, would that be a better 
ideal?
Where are the breaking points in such a design? What's to stop the
underlying implementation from actually using a UTF-16 character, 
passing
UTF-8 to libraries and IO streams but still allowing you to access
everything as UTF-16 or your encoding of choice? Is it simply libraries 
or
core APIs that explicitly need *byte* counts? (Of course this is 
somewhat
rhetorical; we do this currently with JRuby since Java's strings are
UTF-16...we just don't have any uniform way to provide access to UTF-16
character strings, and we normalize everything to UTF-8 for Ruby's
sake...but what if we didn't normalize and adjusted string functions to
compensate?)
Posted by Charles O Nutter (Guest)
on 15.06.2006 04:28
(Received via mailing list)
Fair enough; redirected. If any other rails-core folks want to chime in,
please do so...I would expect unicode and multibyte are key issues for
worldwide rails deployments.
Posted by Austin Ziegler (Guest)
on 15.06.2006 04:41
(Received via mailing list)
On 6/14/06, Charles O Nutter <headius@headius.com> wrote:
> I believe that Julik's way of solving the unicode problem (String#u
> providing access to a unicode helper) is very attractive. I have two
> questions related, for Julik and the rest of the peanut gallery:

> 1. How does performance look with the unicode string add-on versus native
> strings (or as compared to icu4r, which is C-based)?
> 2. Is this the ideal way to support unicode strings in ruby?

No. In fact, I believe that Matz has the right idea for M17N strings
in Ruby 2.0. The *reality* is that there's a *lot* of data out there
that isn't Unicode.

I would suggest that JRuby could offer a JavaString that acts in every
way like a String except that it provides access to the native UTF-16
implementation.

-austin
Posted by Julian 'Julik' Tarkhanov (Guest)
on 15.06.2006 04:55
(Received via mailing list)
On 15-jun-2006, at 4:40, Austin Ziegler wrote:

> No. In fact, I believe that Matz has the right idea for M17N strings
> in Ruby 2.0. The *reality* is that there's a *lot* of data out there
> that isn't Unicode.

It's very difficult for me to understand the implementation. What if
we concat a Mojikyo string to a UTF8String? UnicodeDecodeError,
ordinal not in range?
I think Python folks proved that it's terrible (it is).
Nothing is ideal.

> I would suggest that JRuby could offer a JavaString that acts in every
> way like a String except that it provides access to the native UTF-16
> implementation.

Just what the ICU4R extension does. It's unusable to the point that
you cannot concat a native string with a UString.
To the point that you have to use special Regexp class for it. You
end up having half of your Ruby script doing typecasting from one to
the other.

There is alot of data that isn't Unicode, indeed. Converted on input
and converted on output if necessary - just as in any
other case when the encoding of your system doesn't match your input
or output. I don't know if it can be possible to have the "internal"
encoding of a system
switchable (seems to me this is what Matz wants) - then you can't
safely refer to anything other than bytes. And then you get software
that you can't use, because they had a different assumtpion than you had
as to what encoding the user will be using.
Posted by PJ Hyett (Guest)
on 15.06.2006 05:01
(Received via mailing list)
On 6/14/06, Austin Ziegler <halostatue@gmail.com> wrote:
> in Ruby 2.0. The *reality* is that there's a *lot* of data out there
> that isn't Unicode.

Yes, we all understand that Ruby 2.0 will be the coolest thing since
sliced bread, but those of us that are currently developing
international websites with Rails don't have the luxury of waiting
until Christmas of 2007.

-PJ Hyett
http://pjhyett.com
Posted by Austin Ziegler (Guest)
on 15.06.2006 05:10
(Received via mailing list)
On 6/14/06, PJ Hyett <pjhyett@gmail.com> wrote:
> > that isn't Unicode.
> Yes, we all understand that Ruby 2.0 will be the coolest thing since
> sliced bread, but those of us that are currently developing
> international websites with Rails don't have the luxury of waiting
> until Christmas of 2007.

*shrug*

As far as I can tell, there will be no implementation of Ruby before
then that has a "native" m17n string.

So whether you have the luxury of waiting or not, Ruby 1.8.x will not
*ever* have a "Unicode string".

Adding a "Unicode string" would *break* behaviour, and no example is
better than the extension that was proposed which would change the
meaning of #size and #length to mean two different things.

So, there's a point where patience is going to be necessary, whether
you "have the luxury" or not.

-austin
Posted by Dmitry Severin (Guest)
on 15.06.2006 10:47
(Received via mailing list)
IIRC, Matz has said that internally String won't change, and I suspect 
that
a CharString class (or smth like) won't be ever added.

Maybe just introducing String#encoding flag and addig  new methods to 
String
with prefixes, like char_array, char_slice, char_length, char_index,
char_downcase, char_strcoll, char_strip, etc. that will internally look 
at
encoding flag and process respectively bytes in this particular string
without  conversion (just maybe some hidden), and leaving old
byte-processing methods intact, would be the way to keep older code 
working
and enjoy M17N?

Though, as for me, it is still unclear, what should happen, if one tries 
to
perform operation on two strings with different String#encoding...
Posted by Michal Suchanek (Guest)
on 15.06.2006 13:02
(Received via mailing list)
On 6/14/06, Austin Ziegler <halostatue@gmail.com> wrote:
> individual string level so that you could meaningfully have Unicode
> (probably UTF-8) and ShiftJIS strings in the same data and still
> meaningfully call #length on them.
>
> You will *always* have to care about the encoding. As well as,
> ultimately, your locale.

No. Since I have locale stdin can be marked with the proper encoding
information so that all stings originating there have the proper
encoding information.

The string methods should not just blindly operate on bytes but use
the encoding information to operate on characters rather than bytes.
Sure something like byte_length is needed when the string is stored
somewhere outside Ruby but standard string methods should work with
character offsets and characters, not byte offsets nor bytes.

Since my stdout can be also marked with correct encoding the strings
that are output there can be converted to that encoding. Even if it
originates from a source file that happens to be in a different
encoding.
Hmm, prehaps it will be necessary to mark source files with encoding
tags as well. It could be quite tedious to assingn the tag manually to
every string in a source file.

When strings are compared, concatenated, .. the encoding is known so
the methods should do the right thing.

I do not have to care about encoding. You may make a string
implemenation that forces me to care (such a the current one). But I
do not have to. I can always turn to perl if I get really desperate.

Thanks

Michal
Posted by Michal Suchanek (Guest)
on 15.06.2006 13:22
(Received via mailing list)
On 6/15/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:
>

> 5. Preferably separate (and strictly purposed) Bytestring that you
> get out of Sockets and use in Servers etc. - or the ability to
> "force" all strings recieved from external resources to be flagged
> uniformly as being of a certain encoding in _your_ program, not
> somewhere in someone's library. If flags have to be set by libraries,
> they won't be set because most developers sadly don't care:
>
> http://www.zackvision.com/weblog/2005/11/mt-unicode-mysql.html
> http://thraxil.org/users/anders/posts/2005/11/01/unicodification/

Where else should the strings be flagged? If you get a web page
through http request, and the library parses the response for you, it
should set enconding on the web page. You would never know since you
only received the page, not the header.

> setting such as $KCODE.
I do not see why libraries should be always wrong. After all, you can
always fix them. And setting the encoding globally is a bad thing. You
cannot have strings encoded in different encodings in one process
then. It looks quite limiting. For one, the web pages that you get
from various servers (and even the same server) can be in varoius
encodings.

Thanks

Michal
Posted by Julian 'Julik' Tarkhanov (Guest)
on 15.06.2006 13:51
(Received via mailing list)
On 15-jun-2006, at 13:21, Michal Suchanek wrote:

>> http://www.zackvision.com/weblog/2005/11/mt-unicode-mysql.html
>> http://thraxil.org/users/anders/posts/2005/11/01/unicodification/
>
> Where else should the strings be flagged?
They should nog be flagged, because some strings will be flagged and
some won't and exactly
in the wrong places at the wrong time. See _is_uf_8_ in Perl to
witness the terrible ugliness of this.

> If you get a web page
> through http request, and the library parses the response for you, it
> should set enconding on the web page. You would never know since you
> only received the page, not the header.

That's why you should distinguish between a ByteArray and a String.

>> libraries I use will be getting it wrong - see above) or by a global
>> setting such as $KCODE.
>
> I do not see why libraries should be always wrong. After all, you can
> always fix them. And setting the encoding globally is a bad thing. You
> cannot have strings encoded in different encodings in one process
> then. It looks quite limiting. For one, the web pages that you get
> from various servers (and even the same server) can be in varoius
> encodings.

Of course they can (and will). When I have to approach this I usually
just snif the encoding of the strings I recieved and then feed them
to iconv and friends before doing any processing. A library that
downloads stuff off the Internet should be (IMO) aware of
the charset madness and decode the strings for me.

Trust me, when multibyte/Unicode handling is optional, 80% of
libraries do it wrong. Re-read the links above if you don't believe.

Actually it seems that the solution with an accessor is quite nice,
but that I had to figure out the hard way after breaking the String
class
with my hacks and seeing stuff collapse. Apparently the poster of a
parallel thread finds it inspiring to repeat my experiment _in vitro_
just for
the academic sake of it.
Posted by Michal Suchanek (Guest)
on 15.06.2006 15:13
(Received via mailing list)
On 6/15/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:
> >> somewhere in someone's library. If flags have to be set by libraries,
> >> they won't be set because most developers sadly don't care:
> >>
> >> http://www.zackvision.com/weblog/2005/11/mt-unicode-mysql.html
> >> http://thraxil.org/users/anders/posts/2005/11/01/unicodification/
> >
> > Where else should the strings be flagged?
> They should nog be flagged, because some strings will be flagged and
> some won't and exactly
> in the wrong places at the wrong time. See _is_uf_8_ in Perl to
> witness the terrible ugliness of this.

You can certainly get the things wrong. But if you get a string that
is wrongly flagged you have the choice to fix the code where the
string originates or work arond it by flagging it right.
If you have a code that gets the encoding wrong, and it tries to
convert the string to some 'universal' encoding you want to use
everywhere in your application, you get a broken string.

>
> > If you get a web page
> > through http request, and the library parses the response for you, it
> > should set enconding on the web page. You would never know since you
> > only received the page, not the header.
>
> That's why you should distinguish between a ByteArray and a String.

How does it help you here?

> >> All of this can be controlled either per String (then 99 out of 100
> Of course they can (and will). When I have to approach this I usually
> just snif the encoding of the strings I recieved and then feed them
> to iconv and friends before doing any processing. A library that
> downloads stuff off the Internet should be (IMO) aware of
> the charset madness and decode the strings for me.

If it can decode them, it can flag them. It has to be aware - that's it.

>
> Trust me, when multibyte/Unicode handling is optional, 80% of
> libraries do it wrong. Re-read the links above if you don't believe.

But they get the very foundation wrong. In Python functions that take
multiple strings can only thake them in one encoding. It is impossible
to concatenate differently encoded strings. Of course, this is bound
to fail.
In the other case they use a database with poor support for unicode,
and mysql that does exactly the same thing ruby does right now - works
with strings as arrays of bytes. Of course, this is going to break.

Neither is the case when the strings carry information about their
encoding, and the string functions can handle strings encoded
differently.

The fact that there are libraries and languages with poor unicode
support does not mean it must be always poor.

Thanks

Michal
Posted by Juergen Strobel (Guest)
on 17.06.2006 13:11
(Received via mailing list)
On Thu, Jun 15, 2006 at 07:59:54PM +0900, Michal Suchanek wrote:
> >stream is encoded. This will sort of be like $KCODE but on an
> 
> The string methods should not just blindly operate on bytes but use
> the encoding information to operate on characters rather than bytes.
> Sure something like byte_length is needed when the string is stored
> somewhere outside Ruby but standard string methods should work with
> character offsets and characters, not byte offsets nor bytes.

I empathically agree. I'll even repeat and propose a new Plan for
Unicode Strings in Ruby 2.0 in 10 points:

1. Strings should deal in characters (code points in Unicode) and not
in bytes, and the public interface should reflect this.

2. Strings should neither have an internal encoding tag, nor an
external one via $KCODE. The internal encoding should be encapsulated
by the string class completely, except for a few related classes which
may opt to work with the gory details for performance reasons.
The internal encoding has to be decided, probably between UTF-8,
UTF-16, and UTF-32 by the String class implementor.

3. Whenever Strings are read or written to/from an external source,
their data needs to be converted. The String class encapsulates the
encoding framework, likely with additional helper Modules or Classes
per external encoding. Some methods take an optional encoding
parameter, like #char(index, encoding=:utf8), or
#to_ary(encoding=:utf8), which can be used as helper Class or Module
selector.

4. IO instances are associated with a (modifyable) encoding. For
stdin, stdout this can be derived from the locale settings. String-IO
operations work as expected.

5. Since the String class is quite smart already, it can implement
generally useful and hard (in the domain of Unicode) operations like
case folding, sorting, comparing etc.

6. More exotic operations can easily be provided by additional
libraries because of Ruby's open classes. Those operations may be
coded depending on on String's public interface for simplicissity, or
work with the internal representation directly for performance.

7. This approach leaves open the possibility of String subclasses
implementing different internal encodings for performance/space
tradeoff reasons which work transparently together (a bit like FixInt
and BigInt).

8. Because Strings are tightly integrated into the language with the
source reader and are used pervasively, much of this cannot be
provided by add-on libraries, even with open classes. Therefore the
need to have it in Ruby's canonical String class. This will break some
old uses of String, but now is the right time for that.

9. The String class does not worry over character representation
on-screen, the mapping to glyphs must be done by UI frameworks or the
terminal attached to stdout.

10. Be flexible. <placeholder for future idea>


This approach has several advantages and a few disadvantages, and I'll
try to bring in some new angles to this now too:


*Advantages*

-POL, Encapsulation-

All Strings behave exactly the same everywhere, are predictable,
and do the hard work for their users.

-Cross Library Transparency-

No String user needs to worry which Strings to pass to a library, or
worry which Strings he will get from a library. With Web-facing
libraries like rails returning encoding-tagged Strings, you would be
likely to get Strings of all possible encodings otherwise, and isthe
String user prepared to deal with this properly?  This is a *big* deal
IMNSHO.

-Limited Conversions-

Encoding conversions are limited to the time Strings are created or
written or explicitly transformed to an external representation.

-Correct String Operations-

Even basic String operations are very hard in the world of Unicode. If
we leave the String users to look at the encoding tags and sort it out
themselves, they are bound to make mistakes because they don't care,
don't know, or have no time. And these mistakes may be _security_
_sensitive_, since most often credentials are represented as Strings
too. There already have been exploits related to Unicode.


*Disadvantages* (with mitigating reasoning of course)

- String users need to learn that #byte_length(encoding=:utf8) >=
#size, but that's not too hard, and applies everywhere. Users do not
need to learn about an encoding tag, which is surely worse to handle
for them.

- Strings cannot be used as simple byte buffers any more. Either use
an array of bytes, or an optimized ByteBuffer class. If you need
regular expresson support, RegExp can be extended for ByteBuffers or
even more.

- Some String operations may perform worse than might be expected from
a naive user, in both the time or space domain. But we do this so the
String user doesn't need to himself, and are problably better at it
than the user too.

- For very simple uses of String, there might be unneccessary
conversions. If a String is just to be passed through somewhere,
without inspecting or modifying it at all, in- and outwards conversion
will still take place. You could and should use a ByteBuffer to avoid
this.

- This ties Ruby's String to Unicode. A safe choice IMHO, or would we
really consider something else? Note that we don't commit to a
particular encoding of Unicode strongly.

- More work and time to implement. Some could call it
over-engineered. But it will save a lot of time and troubles when shit
hits the fan and users really do get unexpected foreign characters in
their Strings. I could offer help implementing it, although I have
never looked at ruby's source, C-extensions, or even done a lot of
ruby programming yet.


Close to the start of this discussion Matz asked what the problem with
current strings really was for western users. Somewhere later he
concluded case folding. I think it is more than that: we are lazy and
expect character handling to be always as easy as with 7 bit ASCII, or
as close as possible. Fixed 8-bit codepages worked quite fine most of
the time in this regard, and breakage was limited to special
characters only.

Now let's ask the question in reverse: are eastern programmers so used
to doing elaborate byte-stream to character handling by hand they
don't recognize how hard this is any more? Surely it is a target for
DRY if I ever saw one. Or are there actual problems not solveable this
way? I looked up the mentioned Han-Unification issue, and as far as I
understood this could be handled by future Unicode revisions
allocating more characters, outside of Ruby, but I don't see how it
requires our Strings to stay dumb byte buffers.

Jürgen
Posted by Stefan Lang (Guest)
on 17.06.2006 15:51
(Received via mailing list)
On Saturday 17 June 2006 13:08, Juergen Strobel wrote:
> On Thu, Jun 15, 2006 at 07:59:54PM +0900, Michal Suchanek wrote:
[...]
> > The string methods should not just blindly operate on bytes but
> > use the encoding information to operate on characters rather than
> > bytes. Sure something like byte_length is needed when the string
> > is stored somewhere outside Ruby but standard string methods
> > should work with character offsets and characters, not byte
> > offsets nor bytes.
>
> I empathically agree. I'll even repeat and propose a new Plan for
> Unicode Strings in Ruby 2.0 in 10 points:

Juergen, I agree with most of what you have written. I will
add my thoughts.

> 1. Strings should deal in characters (code points in Unicode) and
> not in bytes, and the public interface should reflect this.
>
> 2. Strings should neither have an internal encoding tag, nor an
> external one via $KCODE. The internal encoding should be
> encapsulated by the string class completely, except for a few
> related classes which may opt to work with the gory details for
> performance reasons. The internal encoding has to be decided,
> probably between UTF-8, UTF-16, and UTF-32 by the String class
> implementor.

Full ACK. Ruby programs shouldn't need to care about the
*internal* string encoding. External string data is treated as
a sequence of bytes and is converted to Ruby strings through
an encoding API.

> 3. Whenever Strings are read or written to/from an external source,
> their data needs to be converted. The String class encapsulates the
> encoding framework, likely with additional helper Modules or
> Classes per external encoding. Some methods take an optional
> encoding parameter, like #char(index, encoding=:utf8), or
> #to_ary(encoding=:utf8), which can be used as helper Class or
> Module selector.

I think the encoding/decoding API should be separated from the
String class. IMO, the most important change is to strictly
differentiate between arbitrary binary data and character
data. Character data is represented by an instance of the
String class.

I propose adding a new core class, maybe call it ByteString
(or ByteBuffer, or Buffer, whatever) to handle strings of
bytes.

Given a specific encoding, the encoding API converts
ByteStrings to Strings and vice versa.

This could look like:

    my_character_str = Encoding::UTF8.encode(my_byte_buffer)
    buffer = Encoding::UTF8.decode(my_character_str)

> 4. IO instances are associated with a (modifyable) encoding. For
> stdin, stdout this can be derived from the locale settings.
> String-IO operations work as expected.

I propose one of:

1) A low level IO API that reads/writes ByteBuffers. String IO
   can be implemented on top of this byte-oriented API.

   The basic binary IO methods could look like:

   binfile = BinaryIO.new("/some/file", "r")
   buffer = binfile.read_buffer(1024) # read 1K of binary data

   binfile = BinaryIO.new("/some/file", "w")
   binfile.write_buffer(buffer) # Write the byte buffer

   The standard File class (or IO module, whatever) has an
   encoding attribute. The default value is set by the
   constructor by querying OS settings (on my Linux system
   this could be $LANG):

   # read strings from /some/file, assuming it is encoded
   # in the systems default encoding.
   text_file = File.new("/some/file", "r")
   contents = text_file.read

   # alternatively one can explicitely set an encoding before
   # the first read/write:
   text_file = File.new("/some/file", "r")
   text_file.encoding = Encoding::UTF8

   The File class (or IO module) will probably use a BinaryIO
   instance internally.

2) The File class/IO module as of current Ruby just gets
   additional methods for binary IO (through ByteBuffers) and
   an encoding attribute. The methods that do binary IO don't
   need to care about the encoding attribute.

I think 1) is cleaner.

> 5. Since the String class is quite smart already, it can implement
> generally useful and hard (in the domain of Unicode) operations
> like case folding, sorting, comparing etc.

If the strings are represented as a sequence of Unicode
codepoints, it is possible for external libraries to implement
more advanced Unicode operations.

Since IMO a new "character" class would be overkill, I propose
that the String class provides codepoint-wise iteration (and
indexing) by representing a codepoint as a Fixnum. AFAIK a
Fixnum consists of 31 bits on a 32 bit machine, which is
enough to represent the whole range of unicode codepoints.

> 6. More exotic operations can easily be provided by additional
> libraries because of Ruby's open classes. Those operations may be
> coded depending on on String's public interface for simplicissity,
> or work with the internal representation directly for performance.
>
> 7. This approach leaves open the possibility of String subclasses
> implementing different internal encodings for performance/space
> tradeoff reasons which work transparently together (a bit like
> FixInt and BigInt).

I think providing different internal String representations
would be too much work, especially for maintenance in the long
run.

> 10. Be flexible. <placeholder for future idea>
The advantages of this proposal over the current situation and
tagging a string with an encoding are:

* There is only one internal string (where string means a
  string of characters) representation. String operations
  don't need to be written for different encodings.

* No need for $KCODE.

* Higher abstraction.

* Separation of concerns. I always found it strange that most
  dynamic languages simply mix handling of character and
  arbitrary binary data (just think of pack/unpack).

* Reading of character data in one encoding and representing
  it in other encoding(s) would be easy.

It seems that the main argument against using Unicode strings
in Ruby is because Unicode doesn't work well for eastern
countries. Perhaps there is another character set that works
better that we could use instead of Unicode. The important
point here is that there is only *one* representation of
character data Ruby.

If Unicode is choosen as character set, there is the
question which encoding to use internally. UTF-32 would be a
good choice with regards to simplicity in implementation,
since each codepoint takes a fixed number of bytes. Consider
indexing of Strings:

        "some string"[4]

If UTF-32 is used, this operation can internally be
implemented as a simple, constant array lookup. If UTF-16 or
UTF-8 is used, this is not possible to implement as an array
lookup, since any codepoint before the fifth could occupy more
than one (8 bit or 16 bit) unit. Of course there is the
argument against UTF-32 that it takes to much memory. But I
think that most text-processing done in Ruby spends much more
memory on other data structures than in actual character data
(just consider an REXML document), but I haven't measured that
;)

An advantage of using UTF-8 would be that for pure ASCII files
no conversion would be necessary for IO.

Thank you for reading so far. Just in case Matz decides to
implement something similar to this proposal, I am willing to
help with Ruby development (although I don't know much about
Ruby's internals and not too much about Unicode either).

I do not have a CS degree and I'm not a Unicode expert, so
perhaps the proposal is garbage, in this case please tell me
what is wrong about it or why it is not realistic to implement
it.
Posted by Austin Ziegler (austin)
on 17.06.2006 15:54
(Received via mailing list)
On 6/17/06, Juergen Strobel <strobel@secure.at> wrote:
> I empathically agree. I'll even repeat and propose a new Plan for
> Unicode Strings in Ruby 2.0 in 10 points:
>
> 1. Strings should deal in characters (code points in Unicode) and not
> in bytes, and the public interface should reflect this.

Agree, mostly. Strings should have a way to indicate the buffer size of
the String.

> 2. Strings should neither have an internal encoding tag, nor an
> external one via $KCODE. The internal encoding should be encapsulated
> by the string class completely, except for a few related classes which
> may opt to work with the gory details for performance reasons.
> The internal encoding has to be decided, probably between UTF-8,
> UTF-16, and UTF-32 by the String class implementor.

Completely disagree. Matz has the right choice on this one. You can't
think in just terms of a pure Ruby implementation -- you *must* think
in terms of the Ruby/C interface for extensions as well.

> 3. Whenever Strings are read or written to/from an external source,
> their data needs to be converted. The String class encapsulates the
> encoding framework, likely with additional helper Modules or Classes
> per external encoding. Some methods take an optional encoding
> parameter, like #char(index, encoding=:utf8), or
> #to_ary(encoding=:utf8), which can be used as helper Class or Module
> selector.

Conversion should be possible at any time. An "external source" may be
an extension that your Ruby program can't distinguish. Again, this point
fails because your #2 is unacceptable.

> 4. IO instances are associated with a (modifyable) encoding. For
> stdin, stdout this can be derived from the locale settings. String-IO
> operations work as expected.

Agree, realising that the internal implementation of String must be
completely different than you've suggested. It is also important to
retain *raw* reading; a JPEG should not be interpreted as Unicode.

> 5. Since the String class is quite smart already, it can implement
> generally useful and hard (in the domain of Unicode) operations like
> case folding, sorting, comparing etc.

Agreed, but this would be expected regardless of the actual encoding of
a String.

> 6. More exotic operations can easily be provided by additional
> libraries because of Ruby's open classes. Those operations may be
> coded depending on on String's public interface for simplicissity, or
> work with the internal representation directly for performance.

Agreed.

> 7. This approach leaves open the possibility of String subclasses
> implementing different internal encodings for performance/space
> tradeoff reasons which work transparently together (a bit like FixInt
> and BigInt).

Um. Disagree. Matz's proposed approach does this; yours does not. Yours,
in fact, makes things *much* harder.

> 8. Because Strings are tightly integrated into the language with the
> source reader and are used pervasively, much of this cannot be
> provided by add-on libraries, even with open classes. Therefore the
> need to have it in Ruby's canonical String class. This will break some
> old uses of String, but now is the right time for that.

"Now" isn't; Ruby 2.0 is. Maybe Ruby 1.9.1.

> 9. The String class does not worry over character representation
> on-screen, the mapping to glyphs must be done by UI frameworks or the
> terminal attached to stdout.

The String class doesn't worry about that now.

> 10. Be flexible. <placeholder for future idea>

And little is more flexible than Matz's m17n String.

> This approach has several advantages and a few disadvantages, and I'll
> try to bring in some new angles to this now too:
>
> *Advantages*
>
> -POL, Encapsulation-
>
> All Strings behave exactly the same everywhere, are predictable,
> and do the hard work for their users.

Remember: POLS is not an acceptable reason for anything. Matz's m17n
Strings would be predictable, too. a + b would be possible if and only
if a and b are the same encoding or one of them is "raw" (which would
mean that the other is treated as the defined encoding) *or* there is a
built-in conversion for them.

> -Cross Library Transparency-
> No String user needs to worry which Strings to pass to a library, or
> worry which Strings he will get from a library. With Web-facing
> libraries like rails returning encoding-tagged Strings, you would be
> likely to get Strings of all possible encodings otherwise, and isthe
> String user prepared to deal with this properly?  This is a *big* deal
> IMNSHO.

This will be true with m17n strings. However, your proposal does *not*
work for Ruby/C interfaced items. Sorry.

> -Limited Conversions-
>
> Encoding conversions are limited to the time Strings are created or
> written or explicitly transformed to an external representation.

This is a mistake. I may need to know the internal representation of a
particular encoding of a String inside of a program. Trust me on this
one: I *have* done some low-level encoding work. Additionally, even
though I might have marked a network object as "UTF-8", I may not know
whether it's *actually* UTF-8 or not until I get HTTP headers -- or
worse, a <meta http-equiv> tag. Assuming UTF-8 reading in today's world
is doomed to failure.

> -Correct String Operations-
> Even basic String operations are very hard in the world of Unicode. If
> we leave the String users to look at the encoding tags and sort it out
> themselves, they are bound to make mistakes because they don't care,
> don't know, or have no time. And these mistakes may be _security_
> _sensitive_, since most often credentials are represented as Strings
> too. There already have been exploits related to Unicode.

This is a misunderstanding on your part. Nothing about Matz's m17n
Strings suggests that String users would have to look at the encoding
tags. Merely that they *could*. I suspect that there will be pragma-
like behaviours to enforce a particular internal representation at all
times.

> *Disadvantages* (with mitigating reasoning of course)
> - String users need to learn that #byte_length(encoding=:utf8) >=
> #size, but that's not too hard, and applies everywhere. Users do not
> need to learn about an encoding tag, which is surely worse to handle
> for them.

True, but the encoding tag is not worse. Anyone who assumes that
developers can ignore encoding at any time simply *doesn't* know about
the level of problems that can be encountered.

> - Strings cannot be used as simple byte buffers any more. Either use
> an array of bytes, or an optimized ByteBuffer class. If you need
> regular expresson support, RegExp can be extended for ByteBuffers or
> even more.

I see no reason for this.

> - Some String operations may perform worse than might be expected from
> a naive user, in both the time or space domain. But we do this so the
> String user doesn't need to himself, and are problably better at it
> than the user too.

This is a wash.

> - For very simple uses of String, there might be unneccessary
> conversions. If a String is just to be passed through somewhere,
> without inspecting or modifying it at all, in- and outwards conversion
> will still take place. You could and should use a ByteBuffer to avoid
> this.

This is a wash.

> - This ties Ruby's String to Unicode. A safe choice IMHO, or would we
> really consider something else? Note that we don't commit to a
> particular encoding of Unicode strongly.

This is a wash. I think that it's better to leave the options open.
After all, it *is* a hope of mine to have Ruby running on iSeries
(AS/400) and *that* still uses EBCDIC.

> - More work and time to implement. Some could call it over-engineered.
> But it will save a lot of time and troubles when shit hits the fan and
> users really do get unexpected foreign characters in their Strings. I
> could offer help implementing it, although I have never looked at
> ruby's source, C-extensions, or even done a lot of ruby programming
> yet.

I would call it the amount of work necessary. But the work needs to be
done for a *variety* of encodings, and not just Unicode. *Especially*
because of C extensions.

> Close to the start of this discussion Matz asked what the problem with
> current strings really was for western users. Somewhere later he
> concluded case folding. I think it is more than that: we are lazy and
> expect character handling to be always as easy as with 7 bit ASCII, or
> as close as possible. Fixed 8-bit codepages worked quite fine most of
> the time in this regard, and breakage was limited to special
> characters only.

> Now let's ask the question in reverse: are eastern programmers so used
> to doing elaborate byte-stream to character handling by hand they
> don't recognize how hard this is any more? Surely it is a target for
> DRY if I ever saw one. Or are there actual problems not solveable this
> way? I looked up the mentioned Han-Unification issue, and as far as I
> understood this could be handled by future Unicode revisions
> allocating more characters, outside of Ruby, but I don't see how it
> requires our Strings to stay dumb byte buffers.

No one has ever suggested that Ruby Strings stay byte buffers. However,
blindly choosing Unicode *adds* unnecessary complexity to the situation.

-austin
Posted by Julian 'Julik' Tarkhanov (Guest)
on 17.06.2006 16:16
(Received via mailing list)
On 17-jun-2006, at 15:52, Austin Ziegler wrote:
>> 8. Because Strings are tightly integrated into the language with the
>> source reader and are used pervasively, much of this cannot be
>> provided by add-on libraries, even with open classes. Therefore the
>> need to have it in Ruby's canonical String class. This will break  
>> some
>> old uses of String, but now is the right time for that.
>
> "Now" isn't; Ruby 2.0 is. Maybe Ruby 1.9.1.

Most probably wise, but I need casefolding and character classes to
work since yesteryear.
Oniguruma is there but even if you complie with it (which is not the
default, still) you don't get char classes (AFAIK)
and you don't get casefolding. Case-insensitive search/replace
quickly becomes bondage.

I am maintaining a gem whose test fails due to different regexps in
Oniguruma, but I would be able to quickly fix it knowing that
Oniguruma is in stable now.
>> 10. Be flexible. <placeholder for future idea>
>
> And little is more flexible than Matz's m17n String.

I couldn't find a proper description of that - as I told already, the
thing I'd least prefer would be

# get a string from the database
p str + my_unicode_chars # Ok, bail out with an ugly exception
because the author of the DB adaptor didn't care to send me proper
Strings...

If strings in the system are allowed to have varying encodings, I
don't understand how the engine is going to upgrade/downgrade strings
automatically.
Especially remembering that the receiver is on the left, so I
actually might get different exceptions going as I do

p my_unicode_chars + mojikyo_str # who wins?

or

p mojikyo_str + my_unicode_chars # who wins?

or (especially)

p mojikyo_str +
bytestring_that_i_just_grabbed_by_http_and_i_know_it_is_mojikyo_but_its_
not # who wins?
Posted by Austin Ziegler (austin)
on 17.06.2006 16:19
(Received via mailing list)
On 6/17/06, Stefan Lang <langstefan@gmx.at> wrote:
> Full ACK. Ruby programs shouldn't need to care about the
> *internal* string encoding. External string data is treated as
> a sequence of bytes and is converted to Ruby strings through
> an encoding API.

This is incorrect. *Most* Ruby programs won't need to care about the
internal string encoding. Experience suggests, however, that it is
*most*. Definitely not all.

> Given a specific encoding, the encoding API converts
> ByteStrings to Strings and vice versa.
>
> This could look like:
>
>     my_character_str = Encoding::UTF8.encode(my_byte_buffer)
>     buffer = Encoding::UTF8.decode(my_character_str)

Unnecessarily complex and inflexible. Before you go too much further, I
*really* suggest that you look in the archives and Google to find more
about Matz's m17n String proposal. It's a really good one, as it allows
developers (both pure Ruby and extension) to choose what is appropriate
with the ability to transparently convert as well.

>> 4. IO instances are associated with a (modifyable) encoding. For
>> stdin, stdout this can be derived from the locale settings.
>> String-IO operations work as expected.
>
> I propose one of:
>
> 1) A low level IO API that reads/writes ByteBuffers. String IO
>    can be implemented on top of this byte-oriented API.

[...]

> 2) The File class/IO module as of current Ruby just gets
>    additional methods for binary IO (through ByteBuffers) and
>    an encoding attribute. The methods that do binary IO don't
>    need to care about the encoding attribute.
>
> I think 1) is cleaner.

I think neither is necessary and both would be a mistake. It is, as I
indicated to Juergen, sometimes *impossible* to determine the encoding
to be used for an IO until you have some data from the IO already.

>> 5. Since the String class is quite smart already, it can implement
>> generally useful and hard (in the domain of Unicode) operations like
>> case folding, sorting, comparing etc.
> If the strings are represented as a sequence of Unicode codepoints, it
> is possible for external libraries to implement more advanced Unicode
> operations.

This would be true regardless of the encoding.

> Since IMO a new "character" class would be overkill, I propose that
> the String class provides codepoint-wise iteration (and indexing) by
> representing a codepoint as a Fixnum. AFAIK a Fixnum consists of 31
> bits on a 32 bit machine, which is enough to represent the whole range
> of unicode codepoints.

This does not match what Matz will be doing.

  str = "Fran\303\247ais"
  str[5] # -> "\303\247"

This is better than doing a Fixnum representation. It is character
iteration, but each character is, itself, a String.

>> 7. This approach leaves open the possibility of String subclasses
>> implementing different internal encodings for performance/space
>> tradeoff reasons which work transparently together (a bit like
>> FixInt and BigInt).
> I think providing different internal String representations
> would be too much work, especially for maintenance in the long
> run.

If you're depending on classes to do that, especially given that Ruby's
String, Array, and Hash classes don't inherit well, you're right.

> The advantages of this proposal over the current situation and
> tagging a string with an encoding are:

The problem, of course, is that this proposal -- and your take on it --
don't account for the m17n String that Matz has planned. The current
situation is a mess. But the current situation is *not* what is planned.
I've had to do some encoding work for work in the last two years, and
while I *prefer* a UTF-8/UTF-16 internal representation, I also know
that's *impossible* in some situations and you have to be flexible. I
also know that POSIX handles this situation worse than any other
setup.

With the work that I've done on this, Matz is *right* about this, and
the people claiming that Unicode is the Only Way ... are wrong. In an
ideal world, Unicode would be the correct and only way. In the real
world, however, it's a lot messier, and Ruby has to be aware of that.

We can *still* make it as easy as possible for the common case (which
will be UTF-8 encoding data and filenames). But we shouldn't make the
mistake of assuming that the common case is all that Ruby should handle.

> * There is only one internal string (where string means a
>   string of characters) representation. String operations
>   don't need to be written for different encodings.

This is still (mostly) correct under the m17n String proposal.

> * No need for $KCODE.

This is true under the m17n String.

> * Higher abstraction.

This is true under the m17n String.

> * Separation of concerns. I always found it strange that most dynamic
>   languages simply mix handling of character and arbitrary binary data
>   (just think of pack/unpack).

The separation makes things harder most of the time.

> * Reading of character data in one encoding and representing it in
>   other encoding(s) would be easy.

This is true under the m17n String.

> It seems that the main argument against using Unicode strings in Ruby
> is because Unicode doesn't work well for eastern countries. Perhaps
> there is another character set that works better that we could use
> instead of Unicode. The important point here is that there is only
> *one* representation of character data Ruby.

This is a mistake.

> If Unicode is choosen as character set, there is the question which
> encoding to use internally. UTF-32 would be a good choice with regards
> to simplicity in implementation, since each codepoint takes a fixed
> number of bytes. Consider indexing of Strings:

Yes, but this would be very hard on memory requirements. There are
people who are trying to get Ruby to fit into small-memory environments.
This would destroy any chance of that.

[...]

> Thank you for reading so far. Just in case Matz decides to implement
> something similar to this proposal, I am willing to help with Ruby
> development (although I don't know much about Ruby's internals and not
> too much about Unicode either).

I would suggest that you look for discussions about m17n Strings in
Ruby. Matz has this one right.

> I do not have a CS degree and I'm not a Unicode expert, so perhaps the
> proposal is garbage, in this case please tell me what is wrong about
> it or why it is not realistic to implement it.

I don't have a CS degree either, but I have been in the business for a
*long* time and I've been immersed in Unicode and encoding issues for
the last two years. If everyone used Unicode -- and POSIX weren't stupid
-- your proposal would be much more realistic. I *agree* that Ruby
should encourage the use of Unicode as much as is practical. But it also
shouldn't tie our hands like other programming languages do.

-austin
Posted by Austin Ziegler (austin)
on 17.06.2006 16:26
(Received via mailing list)
On 6/17/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:
> (AFAIK) and you don't get casefolding. Case-insensitive search/replace
> quickly becomes bondage.

I don't disagree. But you're *not* going to get those features, in all
likelihood, in a Ruby 1.8.x release. It would be a breaking release.
Oniguruma is the default for Ruby 1.9+. If there are things missing,
work with the developer.

> I am maintaining a gem whose test fails due to different regexps in
> Oniguruma, but I would be able to quickly fix it knowing that
> Oniguruma is in stable now.

I don't think that Oniguruma is in stable (1.8.x); I *don't* think it
will be enabled as default in stable. Again, it's a breaking change.

>>> 10. Be flexible. <placeholder for future idea>
>> And little is more flexible than Matz's m17n String.
> I couldn't find a proper description of that - as I told already, the
> thing I'd least prefer would be

> # get a string from the database
> p str + my_unicode_chars # Ok, bail out with an ugly exception
> because the author of the DB adaptor didn't care to send me proper
> Strings...

The DB adaptor, of course, will have to look at the encoding that the DB
is using.

> p mojikyo_str + my_unicode_chars # who wins?
>
> or (especially)
>
> p mojikyo_str +
> bytestring_that_i_just_grabbed_by_http_and_i_know_it_is_mojikyo_but_its_
> not # who wins?

Consider coersion in Numerics (ri Numeric#coerce). A similar framework
can be built for Strings.

-austin
Posted by unknown (Guest)
on 17.06.2006 17:00
(Received via mailing list)
On Jun 17, 2006, at 9:50 AM, Stefan Lang wrote:

> *internal* string encoding. External string data is treated as
> a sequence of bytes and is converted to Ruby strings through
> an encoding API.

I don't claim to be an Unicode export but shouldn't the goal be to
have Ruby work with *any* text encoding on a per-string basis?  Why
would you want to force all strings into Unicode for example in a
context where you aren't using Unicode?  (The internal encoding has
to be....).  And of course even in the Unicode world you have several
different encodings (UTF-8, UTF-16, and so on).  Juergen, when you
say 'internal encoding' are you talking about the text encoding of
Ruby source code?

It seems to me that irrespective of any particular text encoding
scheme you need clean support of a simple byte vector data structure
completely unencumbered with any notion of text encoding or locale.
Right now that is done by the String class, whose name I think
certainly creates much confusion.  If the class had been called
Vector and then had methods like:

	Vector#size		# size in bytes
	Vector#str_size 	# size in characters (encoding and locale considered)

I think this discussion would be clearer because it would be the
behavior of the str* methods that would need to understand text
encodings and/or locale settings while the underlying byte vector
methods remained oblivious.  The #[] method is the most confusing
since sometimes you want to extract bytes and sometimes you want to
extract sub-strings (i.e consider the encoding).  One method, two
interpretations, bad headache.

It seems that three distinct behaviors are being shoehorned (with
good reason) into a single class framework (String):

	byte vector
	text encoding (encoded sequence of code points)
	locale	      (cultural interpretations of the encoded sequence of
code points)

I'm just suggesting that these distinctions seem to be lost in much
of this discussion, especially for folks (like myself) who have a
practical interest in this but certainly aren't text-encoding gurus.


Gary Wright
Posted by Paul Battley (Guest)
on 17.06.2006 18:04
(Received via mailing list)
On 17/06/06, Austin Ziegler <halostatue@gmail.com> wrote:
> > - This ties Ruby's String to Unicode. A safe choice IMHO, or would we
> > really consider something else? Note that we don't commit to a
> > particular encoding of Unicode strongly.
>
> This is a wash. I think that it's better to leave the options open.
> After all, it *is* a hope of mine to have Ruby running on iSeries
> (AS/400) and *that* still uses EBCDIC.

Not to mention that Matz has explicitly stated in the past that he
wants Ruby to support other encodings (TRON, Mojikyo, etc.) that
aren't compatible with a Unicode internal representation.

Not tying String to Unicode is also the right thing to do: it allows
for future developments. Java's weird encoding system is entirely down
to the fact that it standardised on UCS-2; when codepoints beyond
65535 arrived, they had to be shoehorned in via an ugly hack. As far
as possible, Ruby should avoid that trap.

Paul.
Posted by Stefan Lang (Guest)
on 17.06.2006 18:17
(Received via mailing list)
On Saturday 17 June 2006 16:58, gwtmp01@mac.com wrote:
> > Full ACK. Ruby programs shouldn't need to care about the
> when you say 'internal encoding' are you talking about the text
> encoding of Ruby source code?

I'm not Juergen, but since you responded to my message...

First of all Unicode is a character set and UTF-8, UTF-16 etc.
are encodings, that is they specify how a Unicode character is
represented as a series of bits.

At least *I* am not talking about the encoding of Ruby source
code. The main point of the proposal is to use a single
universal character encoding for all Ruby character strings
(instances of the String class). Assuming there is an ideal
character set that is really sufficient to represent any
text in this world, it could be used to construct a String
class that abstracts the underlying representation completely
away.

Consider the "float" data type you will find in most
programming languages: The programmer doesn't think in terms
of the bits that represent a floating point value. He just
uses the operators provided for floats. He can choose between
different serialization strategies if he needs to serialize
floats. But the *operators* on floats the programming language
provides don't care about the different serialization formats,
they all work using the same internal representation.
Conversion is done on IO. Ideally, the same level of
abstraction should be there for character data.

If you have a universal character set (Unicode is an attempt
at this), and an encoding for it, the programming language can
abstract the underlying String representation away. For IO, it
provides methods (i.e. through Encoding objects) that
serialize Strings to a stream of bytes and vice versa.

> It seems to me that irrespective of any particular text encoding
> scheme you need clean support of a simple byte vector data
> structure completely unencumbered with any notion of text encoding
> or locale.

I have proposed that further below as Buffer or ByteString.

> Right now that is done by the String class, whose name I
> think certainly creates much confusion.  If the class had been
> called Vector and then had methods like:
>
> 	Vector#size		# size in bytes
> 	Vector#str_size 	# size in characters (encoding and locale
> considered)

By providing str_size you are already mixing up the purpose of
your simple byte vector and character strings.
Posted by unknown (Guest)
on 17.06.2006 18:38
(Received via mailing list)
On Jun 17, 2006, at 12:16 PM, Stefan Lang wrote:
> Assuming there is an ideal
> character set that is really sufficient to represent any
> text in this world, it could be used to construct a String
> class that abstracts the underlying representation completely
> away.

So all we need is an ideal character set?  That sounds simple.  :-)

> By providing str_size you are already mixing up the purpose of
> your simple byte vector and character strings.

Yes.  I was pointing out that there were multiple concerns that were
being solved by a single class and I said that there were good
reasons for this.  My point was that even if you choose to handle all
those concerns in a single class it was important to keep the
concerns distinct during discussion.  Something that I thought wasn't
happening in this discussion.

I think this is another example of the Humane Interface discussion
started by Martin Fowler (http://www.martinfowler.com/bliki/
HumaneInterface.html)

In Ruby arrays have an interface that allow them to be used as pure
arrays, as lists, as queue, as stacks and so on instead of having
lots of additional classes.
Similarly I think it makes sense for all M17N issues to be packaged
up in a single class (String) instead of breaking up those concerns
into a class hierarchy.


Gary Wright
Posted by Stefan Lang (Guest)
on 17.06.2006 19:37
(Received via mailing list)
On Saturday 17 June 2006 16:16, Austin Ziegler wrote:
> On 6/17/06, Stefan Lang <langstefan@gmx.at> wrote:
> > Full ACK. Ruby programs shouldn't need to care about the
> > *internal* string encoding. External string data is treated as
> > a sequence of bytes and is converted to Ruby strings through
> > an encoding API.
>
> This is incorrect. *Most* Ruby programs won't need to care about
> the internal string encoding. Experience suggests, however, that it
> is *most*. Definitely not all.

As long as one treats a character string as a character
string, the internal encoding is irrelevant, and as soon as a
decision for an internal string encoding is made, every
programmer can read in the docs "Ruby internally encodes
strings using the XYZ encoding".

[...]
> Unnecessarily complex and inflexible. Before you go too much
> further, I *really* suggest that you look in the archives and
> Google to find more about Matz's m17n String proposal. It's a
> really good one, as it allows developers (both pure Ruby and
> extension) to choose what is appropriate with the ability to
> transparently convert as well.

I couldn't find much (in English, I don't understand
Japanese), do you have a link at hand?

[...]
> already.
That is easy to handle with the proposed scheme: Read as much
as you need with the binary interface until you know the
encoding and then do the conversion of the byte buffer to
string. For file input, you can close the file when you have
determined the encoding and reopen it using the "normal"
(character oriented) interface.

Or do you mean Ruby should determine the encoding
automatically? IMO, that would be bad magic and error-prone.

[...]
> > If the strings are represented as a sequence of Unicode
> > codepoints, it is possible for external libraries to implement
> > more advanced Unicode operations.
>
> This would be true regardless of the encoding.

But a conversion from [insert arbitrary encoding here] to
unicode codepoints would be needed.

>
> This is better than doing a Fixnum representation. It is character
> iteration, but each character is, itself, a String.

I wouldn't mind additionally having:

    str.codepoint_at(5)     => a Fixnum

[...]
> and the people claiming that Unicode is the Only Way ... are wrong.
> >   string of characters) representation. String operations
> >   don't need to be written for different encodings.
>
> This is still (mostly) correct under the m17n String proposal.

How does the regular expression engine work then? And all
String methods that have to combine two or more strings in
some way?

[...]
> > * Separation of concerns. I always found it strange that most
> > dynamic languages simply mix handling of character and arbitrary
> > binary data (just think of pack/unpack).
>
> The separation makes things harder most of the time.

Why? In which cases?

[...]
> > It seems that the main argument against using Unicode strings in
> > Ruby is because Unicode doesn't work well for eastern countries.
> > Perhaps there is another character set that works better that we
> > could use instead of Unicode. The important point here is that
> > there is only *one* representation of character data Ruby.
>
> This is a mistake.

OK, Unicode was enough for me until now, but I see that
Unicode is not enough for everyone.

> > If Unicode is choosen as character set, there is the question
> > which encoding to use internally. UTF-32 would be a good choice
> > with regards to simplicity in implementation, since each
> > codepoint takes a fixed number of bytes. Consider indexing of
> > Strings:
>
> Yes, but this would be very hard on memory requirements. There are
> people who are trying to get Ruby to fit into small-memory
> environments. This would destroy any chance of that.

I can hardly believe that. There is still the binary IO
interface and ByteString that I proposed. And I still think
that the memory used for pure character data is a small
fraction of the overall memory consumption of typical Ruby
programs.
Posted by Juergen Strobel (Guest)
on 17.06.2006 22:34
(Received via mailing list)
On Sun, Jun 18, 2006 at 01:02:39AM +0900, Paul Battley wrote:
> On 17/06/06, Austin Ziegler <halostatue@gmail.com> wrote:
> >> - This ties Ruby's String to Unicode. A safe choice IMHO, or would we
> >> really consider something else? Note that we don't commit to a
> >> particular encoding of Unicode strongly.
> >
> >This is a wash. I think that it's better to leave the options open.
> >After all, it *is* a hope of mine to have Ruby running on iSeries
> >(AS/400) and *that* still uses EBCDIC.

AFAIK, EBCDIC can be losslessly converted to Unicode and back. Right?

On the other hand, do you really trust all ruby library writers to
accept your strings tagged with EBCDIC encoding? Or do you look
forward to a lot of manual conversions?

> Paul.
That's why I explicitly stated it ties Ruby's String class to Unicode
Character Code Points, but not to a particular Unicode encoding or
character class, and *that* was Java's main folly. (UCS-2 is a
strictly 16 bit per character encoding, but new Unicode standards
specify 21 bit characters, so they had to "extend" it).

I am unaware of unsolveable problems with Unicode and Eastern
languages, I asked specifically about it. If you think Unicode is
unfixably flawed in this respect, I guess we all should write off
Unicode now rather than later? Can you detail why Unicode is
unacceptable as a single world wide unifying character set?
Especially, are there character sets which cannot be converted to
Unicode and back, which is the main requirement to have Unicode
Strings in a non-Unicode environment?

Jürgen
Posted by Juergen Strobel (Guest)
on 17.06.2006 22:37
(Received via mailing list)
On Sun, Jun 18, 2006 at 01:16:12AM +0900, Stefan Lang wrote:
> > >
> > several different encodings (UTF-8, UTF-16, and so on).  Juergen,
> code. The main point of the proposal is to use a single
> universal character encoding for all Ruby character strings
> (instances of the String class). Assuming there is an ideal
> character set that is really sufficient to represent any
> text in this world, it could be used to construct a String
> class that abstracts the underlying representation completely
> away.

That's what I meant, yes. And that is the most important point too.

Jürgen
Posted by Paul Battley (Guest)
on 17.06.2006 23:02
(Received via mailing list)
On 17/06/06, Juergen Strobel <strobel@secure.at> wrote:
> I am unaware of unsolveable problems with Unicode and Eastern
> languages, I asked specifically about it. If you think Unicode is
> unfixably flawed in this respect, I guess we all should write off
> Unicode now rather than later? Can you detail why Unicode is
> unacceptable as a single world wide unifying character set?
> Especially, are there character sets which cannot be converted to
> Unicode and back, which is the main requirement to have Unicode
> Strings in a non-Unicode environment?

They aren't so much unsolvable problems as mutually incompatible
approaches. Unicode is concerned with the semantic meaning of a
character, and ignores glyph variations through the 'Han unification'
process. TRON encoding doesn't use Han unification: it encodes the
historically-same Chinese character differently for different
languages/regions where they are written differently today. Mojikyo
encodes each graphically distinct character differently and includes a
very wide range of historical characters, and is therefore
particularly suited to certain linguistic and literary niches.

In spite of this, I think that Unicode is an excellent choice for
everyday usage. Unicode does have a solution to the problem of
character variants, but it's not a universal back end for all
encodings.

Incidentally, it is said that TRON is the world's most widely-used
operating system, so supporting that encoding is not necessarily a
minor concern.

Paul.
Posted by Juergen Strobel (Guest)
on 17.06.2006 23:51
(Received via mailing list)
On Sat, Jun 17, 2006 at 10:52:24PM +0900, Austin Ziegler wrote:
> >2. Strings should neither have an internal encoding tag, nor an
> >external one via $KCODE. The internal encoding should be encapsulated
> >by the string class completely, except for a few related classes which
> >may opt to work with the gory details for performance reasons.
> >The internal encoding has to be decided, probably between UTF-8,
> >UTF-16, and UTF-32 by the String class implementor.
> 
> Completely disagree. Matz has the right choice on this one. You can't
> think in just terms of a pure Ruby implementation -- you *must* think
> in terms of the Ruby/C interface for extensions as well.

I admit I don't know about Ruby's C extensions. Are they unable to
access String's methods? That is all that is needed to work with them.

And since this String class does not have a parametric encoding
attribute, it should be easier to crunch in C even.

> fails because your #2 is unacceptable.
Note that explict conversion to characters, arrays, etc, is possible
for any supported character set and encodig. I have even given method
examples. "External" is to be seen in the context of the String class.

> >case folding, sorting, comparing etc.
> 
> Agreed, but this would be expected regardless of the actual encoding of
> a String.

I am unaware of Matz's exact plan. Any good english language links?

I was under the impression users of Matz' String instances need to
look at the encoding tag to implement eg. #version_sort. If that is
not the case our proposals are not that much different, only Matz' one
is even more complex to implement than mine.

> >tradeoff reasons which work transparently together (a bit like FixInt
> >and BigInt).
> 
> Um. Disagree. Matz's proposed approach does this; yours does not. Yours,
> in fact, makes things *much* harder.

If Matz's approach requires looking at the encoding tag from the
outside, it is not as transparent as mine. If it isn't it just boils
down to a parametric class versus subclass hierarchy design decision,
and I don't see much difference and would be happy with either one.

> 
> >8. Because Strings are tightly integrated into the language with the
> >source reader and are used pervasively, much of this cannot be
> >provided by add-on libraries, even with open classes. Therefore the
> >need to have it in Ruby's canonical String class. This will break some
> >old uses of String, but now is the right time for that.
> 
> "Now" isn't; Ruby 2.0 is. Maybe Ruby 1.9.1.

My original title, somewhere snipped out, was "A Plan for Unicode
Strings in Ruby 2.0". I don't want to rush things or break 1.8 either.

> 
> >9. The String class does not worry over character representation
> >on-screen, the mapping to glyphs must be done by UI frameworks or the
> >terminal attached to stdout.
> 
> The String class doesn't worry about that now.

I was just playing safe here.

> >10. Be flexible. <placeholder for future idea>
> 
> And little is more flexible than Matz's m17n String.

I've had flexibility with respect to Unicode Standards in mind, to not
fall into traps similiar to Java. A simple to use String class,
powerful enough to include every character of the world was my goal,
with the ability to convert to and from other external (from the
String class'es point of view) representations.

The flexibility to have parametric String encodings inside the String
class was not what I was going for, rather I would have that
inaccessible or at least unneccessary to access for the common String
user, and I provided a somewhat weaker but maybe still sufficient
technique via subclassing.

> Remember: POLS is not an acceptable reason for anything. Matz's m17n
> Strings would be predictable, too. a + b would be possible if and only
> if a and b are the same encoding or one of them is "raw" (which would
> mean that the other is treated as the defined encoding) *or* there is a
> built-in conversion for them.

Since I probably cannot control which Strings I get from libraries,
and dont't want to worry which ones I'll have to provide to them, this
is weaker than my approach in this respect, see my next point.

> work for Ruby/C interfaced items. Sorry.
Please elaborate this or provide pointers. I cannot believe C cannot
crunch at my Strings, which are less parametric than Matz's ones are.

> whether it's *actually* UTF-8 or not until I get HTTP headers -- or
> worse, a <meta http-equiv> tag. Assuming UTF-8 reading in today's world
> is doomed to failure.

Read it as binary, and decide later. These problems should be locally
containable, and methods are still able to return Strings after
determining the encoding.

> tags. Merely that they *could*. I suspect that there will be pragma-
> like behaviours to enforce a particular internal representation at all
> times.

Previously you stated users need to look at the encoding to determine
if simple operations like a + b work.

Can you point to more info? I am interested how this pragma stuff
works, and if not doing it "right" can break things.

> >*Disadvantages* (with mitigating reasoning of course)
> >- String users need to learn that #byte_length(encoding=:utf8) >=
> >#size, but that's not too hard, and applies everywhere. Users do not
> >need to learn about an encoding tag, which is surely worse to handle
> >for them.
> 
> True, but the encoding tag is not worse. Anyone who assumes that
> developers can ignore encoding at any time simply *doesn't* know about
> the level of problems that can be encountered.

For String concatenates, substring access, search, etc, I expect to be
able to ignore encoding totally. Only when interfacing with
non-String-class objects (I/O and/or explicit conversion) would I need
encoding info.

> >- Strings cannot be used as simple byte buffers any more. Either use
> >an array of bytes, or an optimized ByteBuffer class. If you need
> >regular expresson support, RegExp can be extended for ByteBuffers or
> >even more.
> 
> I see no reason for this.

In my proposal, Unicode Strings cannot represent arbitrary binary data
in their internal representation, since not everything would be valid
characters. In fact, you cannot set the internal representation
directly.

The interface could accept a code point sequence of values
(0..255), but that would be wasteful compared to an array of bytes.

> >- Some String operations may perform worse than might be expected from
> >a naive user, in both the time or space domain. But we do this so the
> >String user doesn't need to himself, and are problably better at it
> >than the user too.
> 
> This is a wash.

Only trying to refute weak arguments in advance.

> >- For very simple uses of String, there might be unneccessary
> >conversions. If a String is just to be passed through somewhere,
> >without inspecting or modifying it at all, in- and outwards conversion
> >will still take place. You could and should use a ByteBuffer to avoid
> >this.
> 
> This is a wash.

Not a big problem either, but someone was bound to bring it up.

> >users really do get unexpected foreign characters in their Strings. I
> >concluded case folding. I think it is more than that: we are lazy and
> >understood this could be handled by future Unicode revisions
>               * austin@zieglers.ca
The way I see it we have to choose a character set. I proposed
Unicode, because their official goal is to be the one unifying set,
and if they ain't yet, I hope they'll be sometime.

If that is not enough we will effectively create our own character
set, let's call it RubyCode, which will contain characters from the
union of Unicode and a few other sets. Each String will have a
particular encoding, which will determine which characters of RubyCode
are valid in this particular String instance. Hopefully many
characters will be valid in multiple encodings. But it doesn't sound
like a very clear design to me.

Jürgen
Posted by Michal Suchanek (Guest)
on 17.06.2006 23:57
(Received via mailing list)
On 6/17/06, Stefan Lang <langstefan@gmx.at> wrote:
>
> As long as one treats a character string as a character
> string, the internal encoding is irrelevant, and as soon as a

No, it is not.

First for reasons of efficiency. If an application is going to perform
lots of slicing and poking on strings it will want some encoding that
is suiatble for that such as UTF-32. If an application runs on system
with little memory it will want space-efficient encoding (ie UTF-8 or
UTF-16 for Asian languages). And if an appliaction runs on system that
uses some legacy codepage it can read, write, and process all strings
in that codepage. And in JRuby it will be useful to convert strings to
UTF-16 so that the native Java functions can be used for manipulation.

Second, not all characters are equal. If you lived in world where
everything was Unicode you would be fine. But it is not so. Unicode is
suboptimal for encoding CJK characters. So some people might want to
use another encoding for their texts (iirc TRON mentioned earlier is
one of such encodings). In your model you can modify Ruby to use
strings composed of TRON characters instead of Unicode characters. But
how would Unicode Ruby and TRON Ruby exchange strings?
And how would  you write an application that handles _both_ TRON and
Unicode? (I suspect TRON would not be much good ie for Runic script)
Such appliaction has to be written very carefully because neither
character set would be subset of the other so it is not possible
converting strings forth and back without thinking. But in your model
such application is not possible at all.

> decision for an internal string encoding is made, every
> programmer can read in the docs "Ruby internally encodes
> strings using the XYZ encoding".
>
> [...]

> > I indicated to Juergen, sometimes *impossible* to determine the
> > encoding to be used for an IO until you have some data from the IO
> > already.
>
> That is easy to handle with the proposed scheme: Read as much
> as you need with the binary interface until you know the
> encoding and then do the conversion of the byte buffer to
> string. For file input, you can close the file when you have
> determined the encoding and reopen it using the "normal"
> (character oriented) interface.

Why reopening or converting if you can simply tag a string that you
had to read anyway?

>
> Or do you mean Ruby should determine the encoding
> automatically? IMO, that would be bad magic and error-prone.

No. But if you read  part of html/xml document before the encoding was
specified there  is no reason why that part hes to be converted or
reread. You apparently got it right if you were able to determine the
encoding from what you read.

>
> [...]
> > > If the strings are represented as a sequence of Unicode
> > > codepoints, it is possible for external libraries to implement
> > > more advanced Unicode operations.
> >
> > This would be true regardless of the encoding.
>
> But a conversion from [insert arbitrary encoding here] to
> unicode codepoints would be needed.

That will be needed anyway. You cannot expect all libraries to use the
arbitrary encoding you chose for Ruby strings.

But if you can choose the encoding of your strings there is nothing
stopping you from converting your strings so that they best suit your
library of choice.


> >
> > > * There is only one internal string (where string means a
> > >   string of characters) representation. String operations
> > >   don't need to be written for different encodings.
> >
> > This is still (mostly) correct under the m17n String proposal.
>
> How does the regular expression engine work then? And all
> String methods that have to combine two or more strings in
> some way?

If they are both subset of Unicode I see no problem with converting
both to Unicode. If they are incompatible things may break. But that
is because of real incompatibility, not because of some restriction of
the approach.

>
> [...]
> > > * Separation of concerns. I always found it strange that most
> > > dynamic languages simply mix handling of character and arbitrary
> > > binary data (just think of pack/unpack).
> >
> > The separation makes things harder most of the time.
>
> Why? In which cases?

Such as when you have to read sthe start of a HTML page as ByteBuffer
and then convert it to String once you determine the encoding.
Especially if string operations do not exist on the ByteBuffer to
allow parsing it.

>
> I can hardly believe that. There is still the binary IO
> interface and ByteString that I proposed. And I still think
> that the memory used for pure character data is a small
> fraction of the overall memory consumption of typical Ruby
> programs.

It depends on the program. For programs that do only text processing
the portion of memory taken by text may be large.

Michal
Posted by Austin Ziegler (austin)
on 18.06.2006 00:16
(Received via mailing list)
On 6/17/06, Stefan Lang <langstefan@gmx.at> wrote:
> internal string encoding is made, every programmer can read in the
> docs "Ruby internally encodes strings using the XYZ encoding".

And I'm saying that it's a mistake to do that (standardize on a single
encoding). Every programmer will instead be able to read:

  "Ruby supports encoded strings in a variety of encodings. The
  default behaviour for all strings is XYZ, but this can be
  changed and individual strings may be recoded for performance
  or compatibility reasons."

Language and character encodings are hard. Hiding that fact is a
mistake. That doesn't mean we have to make the APIs difficult, but that
we aren't going to be buzzworded into compliance, either.

> [...]
>> Unnecessarily complex and inflexible. Before you go too much further,
>> I *really* suggest that you look in the archives and Google to find
>> more about Matz's m17n String proposal. It's a really good one, as it
>> allows developers (both pure Ruby and extension) to choose what is
>> appropriate with the ability to transparently convert as well.
> I couldn't find much (in English, I don't understand Japanese), do you
> have a link at hand?

I do not. I've been reading about this, talking about this, and
discussing it with Matz for the last two years or so, and I've been
dealing with Unicode and other character encoding issues extensively at
work. However, the gist of it is that every String is still a byte
vector. Each string will also have an encoding flag. Substrings of a
single character width will always return the String required for the
*character*. The supported encodings will probably start with UTF-8,
UTF-16, various ISO-8859-* encodings, EUC-JP, SJIS, and other Asian
encodings.

>
> Or do you mean Ruby should determine the encoding automatically? IMO,
> that would be bad magic and error-prone.

I mean that what you're suggesting *exposes* problems with encoding
stuff extensively and unnecessarily. I certainly wouldn't want to
program in it if the API involved were as stupid as you're suggesting it
should be.


> [...]
>>> If the strings are represented as a sequence of Unicode codepoints,
>>> it is possible for external libraries to implement more advanced
>>> Unicode operations.
>> This would be true regardless of the encoding.
> But a conversion from [insert arbitrary encoding here] to unicode
> codepoints would be needed.

Why? What if the library that I'm interfacing with requires EUC-JP?
Sorry, but Unicode is *not necessarily* the right answer.

>> This is better than doing a Fixnum representation. It is character
>> iteration, but each character is, itself, a String.
> I wouldn't mind additionally having:
>
>     str.codepoint_at(5)     => a Fixnum

Since Ruby isn't *only* using Unicode, this isn't necessarily going to
be possible or meaningful.

> [...]
>>> * There is only one internal string (where string means a
>>>   string of characters) representation. String operations
>>>   don't need to be written for different encodings.
>> This is still (mostly) correct under the m17n String proposal.
> How does the regular expression engine work then? And all
> String methods that have to combine two or more strings in
> some way?

Matz will have that figured and detailed before he starts writing it.

> [...]
>>> * Separation of concerns. I always found it strange that most
>>> dynamic languages simply mix handling of character and arbitrary
>>> binary data (just think of pack/unpack).
>> The separation makes things harder most of the time.
> Why? In which cases?

In *reality*, the separation is not nearly as clean as people who
advocate such separations would like to pretend. It's less of a problem
in dynamic languages like Ruby, but it's also far less necessary in
dynamic languages like Ruby. I have found it far more useful to not have
to care whether I'm reading a binary or string value. I despise dealing
with C++ and Java where I am forced to care because of stupid API
design.

> [...]
>>> It seems that the main argument against using Unicode strings in
>>> Ruby is because Unicode doesn't work well for eastern countries.
>>> Perhaps there is another character set that works better that we
>>> could use instead of Unicode. The important point here is that there
>>> is only *one* representation of character data Ruby.
>> This is a mistake.
> OK, Unicode was enough for me until now, but I see that Unicode is not
> enough for everyone.

Thank you. Unicode needs to -- will -- work *very* well. I know enough
about Unicode handling to make sure that what I deal with *will*. But I
have come to believe that choosing a single encoding as your String
representation is a mistake, even if it means making your job harder by
defining and implementing rules for mixed-encoding handling.

> consumption of typical Ruby programs.
I can believe it; it's very domain and program specific, but you've just
proposed multiplying the memory usage of that amount of space by four.
(Rails would suffer terribly under your proposal to use UTF-32.)

-austin
Posted by Austin Ziegler (austin)
on 18.06.2006 00:22
(Received via mailing list)
On 6/17/06, Juergen Strobel <strobel@secure.at> wrote:
> On Sun, Jun 18, 2006 at 01:02:39AM +0900, Paul Battley wrote:
>> On 17/06/06, Austin Ziegler <halostatue@gmail.com> wrote:
>>>> - This ties Ruby's String to Unicode. A safe choice IMHO, or would
>>>> we really consider something else? Note that we don't commit to a
>>>> particular encoding of Unicode strongly.
>>> This is a wash. I think that it's better to leave the options open.
>>> After all, it *is* a hope of mine to have Ruby running on iSeries
>>> (AS/400) and *that* still uses EBCDIC.
> AFAIK, EBCDIC can be losslessly converted to Unicode and back. Right?

Which code page? EBCDIC has as many code pages (including a UTF-EBCDIC)
as exist in other 8-byte encodings.

> On the other hand, do you really trust all ruby library writers to
> accept your strings tagged with EBCDIC encoding? Or do you look
> forward to a lot of manual conversions?

It depends on the purpose of the library. Very few libraries end up
using byte vectors for strings or completely treat them as such. I would
expect that some of the libraries that I've written would work without
any problems in EBCDIC.

> Character Code Points, but not to a particular Unicode encoding or
> character class, and *that* was Java's main folly. (UCS-2 is a
> strictly 16 bit per character encoding, but new Unicode standards
> specify 21 bit characters, so they had to "extend" it).

Um. Do you mean UTF-32? Because there's *no* binary representaiton of
Unicode Character Code Points that isn't an encoding of some sort. If
that's the case, that's unacceptable from a memory representation.

> I am unaware of unsolveable problems with Unicode and Eastern
> languages, I asked specifically about it. If you think Unicode is
> unfixably flawed in this respect, I guess we all should write off
> Unicode now rather than later? Can you detail why Unicode is
> unacceptable as a single world wide unifying character set?
> Especially, are there character sets which cannot be converted to
> Unicode and back, which is the main requirement to have Unicode
> Strings in a non-Unicode environment?

Legacy data and performance.

-austin
Posted by unknown (Guest)
on 18.06.2006 00:25
(Received via mailing list)
On Jun 17, 2006, at 5:48 PM, Juergen Strobel wrote:
> The way I see it we have to choose a character set.

What leads you to this conclusion?  I don't think it can be refuted
that there exists today an almost endless number of character sets
and text encodings in use. I don't understand why the core facilities
of a language should be intimately tied to any one of those
representations.  Once you do that you've decided that all other
representations are second class citizens.  Why not have the language
be agnostic about these things but still provide a coherent framework
for building libraries and applications that can be locale and
encoding-aware?

Gary Wright
Posted by Michal Suchanek (Guest)
on 18.06.2006 00:49
(Received via mailing list)
On 6/17/06, Juergen Strobel <strobel@secure.at> wrote:
> On Sat, Jun 17, 2006 at 10:52:24PM +0900, Austin Ziegler wrote:
> > On 6/17/06, Juergen Strobel <strobel@secure.at> wrote:

> > mean that the other is treated as the defined encoding) *or* there is a
> > built-in conversion for them.
>
> Since I probably cannot control which Strings I get from libraries,
> and dont't want to worry which ones I'll have to provide to them, this
> is weaker than my approach in this respect, see my next point.

It's apparent from the explanation above.
You do not have to look at  string encoding or worry which encoding
they are as long as they are compatible (ie iso-8859-1 and utf-8) -
there is a conversion for them. The string methods have to use
(internally) the encoding tag, and you can look if you are interested.
If the  strings are incomatible it is a real problem. Not one created
by the implmentation but one originating form the fact that the
strings cannot be automatically converted from one ecoding to another.
But you can keep all your strings, even if they are in several
incompatible encodings. You are not limited to using just one
encoding.


Michal
Posted by Julian 'Julik' Tarkhanov (Guest)
on 18.06.2006 01:17
(Received via mailing list)
On 17-jun-2006, at 23:55, Michal Suchanek wrote:

>
> First for reasons of efficiency. If an application is going to perform
> lots of slicing and poking on strings it will want some encoding that
> is suiatble for that such as UTF-32.
I would much rather prefer UTF-8 in a language such as Ruby which is
often used as glue between
other systems. UTF-8 is used for interchange and it's indisputable.
If you go for UTF-16 or UTF-32, you are most likely
to convert every single character of text files you read (in text
files present in the wild AFAIK UTF-16 and UTF-32 are a minority,
thanks to the BOM and other setbacks).

> If an application runs on system
> with little memory it will want space-efficient encoding (ie UTF-8 or
> UTF-16 for Asian languages). And if an appliaction runs on system that
> uses some legacy codepage it can read, write, and process all strings
> in that codepage. And in JRuby it will be useful to convert strings to
> UTF-16 so that the native Java functions can be used for manipulation.
>
> n your model you can modify Ruby to use
> strings composed of TRON characters instead of Unicode characters. But
> how would Unicode Ruby and TRON Ruby exchange strings?

I think Alan Little summed it up very well. The problem with Unicode
in Ruby is strive for perfection
(i.e. satisfy the users of every conceivable or needed encoding).
It's very noble and I personally can't imagine it
(even with the "democratic coerce" approach Austin cited). The only
thing I don't know if a system having this type of handling can be
built at all and how it will interoperate.

Up until now all scripting languages I used somewhat (Perl, Python,
Ruby) allowed all encodings in strings and doing Unicode in them hurts.

Bluntly put, I am selfish and I don't believe in the "saving grace"
of the M17N (because I just can't wrap it around my head and I sure
as hell know it's going to be VERY complex).
It's also something that bothers me the most about Ruby's "unicode
discussions" (I've read all of them on this list dating back to 2002
because I need it to work NOW) and they
always transcend into this kind of religious discussion in the spirit
of "but your encoding is not good enough", "but my bad encoding isn't
that one and I still need it to work" etc.

While for me the greatest thing about Unicode is that it's Just Good
Enough. And it doesn't seem Unicode is indeed THAT useless for CJK
languages either
(although I'm sure Paul can correct me - all the 4 languages I am in
control of use only 2 scripting systems with some odd additions here
and there).

And no, I didn't have a chance to see a TRON system in the wild. If
someone would show me one within 200 km distance I would be glad to
take a look.
Posted by Julian 'Julik' Tarkhanov (Guest)
on 18.06.2006 01:20
(Received via mailing list)
On 18-jun-2006, at 0:21, Austin Ziegler wrote:
> Legacy data and performance.
Yes, you will spend those cycles to count the letters in my language
RIGHT :-)) (evil grin)
It's actually the most common case when apps damage strings in my
language - their authors wanted to be smart
and _conserve_. And yes, normalization etc. is complex and you DO
need to have a case-conversion table in memory. Please do have one
(Ruby doesn't).

No offense, just observation.
Posted by Stefan Lang (Guest)
on 18.06.2006 02:18
(Received via mailing list)
On Saturday 17 June 2006 23:55, Michal Suchanek wrote:
> On 6/17/06, Stefan Lang <langstefan@gmx.at> wrote:
[...]
> And if an appliaction runs on system that uses some legacy codepage
> it can read, write, and process all strings in that codepage. And
> in JRuby it will be useful to convert strings to UTF-16 so that the
> native Java functions can be used for manipulation.

If you really need this level of efficiency, Ruby is probably
the wrong language anyway. Regarding JRuby: Of course each
implementation would be free to choose an internal Unicode
encoding. If somebody has enough time and motivation he can
even implement support for multiple encodings and let the user
choose at build-time.

[...]
> > Or do you mean Ruby should determine the encoding
> > automatically? IMO, that would be bad magic and error-prone.
>
> No. But if you read  part of html/xml document before the encoding
> was specified there  is no reason why that part hes to be converted
> or reread. You apparently got it right if you were able to
> determine the encoding from what you read.

The conversion would be done anyway, iff a single internal
encoding was choosen and iff the encoding of the input doesn't
match the internal encoding.

>
> That will be needed anyway. You cannot expect all libraries to use
> the arbitrary encoding you chose for Ruby strings.

I assume you mean C libraries here.
Posted by Austin Ziegler (austin)
on 18.06.2006 04:38
(Received via mailing list)
On 6/17/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:
> Yes, you will spend those cycles to count the letters in my language
> RIGHT :-)) (evil grin) It's actually the most common case when apps
> damage strings in my language - their authors wanted to be smart and
> _conserve_. And yes, normalization etc. is complex and you DO need to
> have a case-conversion table in memory. Please do have one (Ruby
> doesn't).

I think you're overthinking the problem. Let's consider the guarantees
that an m17n String would make:

  * #size and #length would return the number of glyphs
  * #[] would return glyphs

Presumably, in Regexen with an m17n String, \w would indicate only
"word" glyphs. Other guarantees *would* be made along that line.

Therefore, if your input data is UTF-8, anything that deals with #size,
#length, and character-based indexing *will just work*. The same will
apply to SJIS or any other encoding. The number of times that people are
dealing with mixed-encoding data is vanishingly small, and even when
a developer must, they will probably use a Unicode encoding to
deal with that. But if you're using SJIS, you're just going to want use
*that*.

That's what the m17n String is about. It's not about dictating a single
encoding, but enabling people to use Strings intelligently.

> No offense, just observation.

I agree -- we *need* full Unicode support. But not at the cost of legacy
code pages in favour of Unicode. It's not always appropriate.

-austin
Posted by Tim Bray (Guest)
on 18.06.2006 06:19
(Received via mailing list)
On Jun 17, 2006, at 4:08 AM, Juergen Strobel wrote:

> 1. Strings should deal in characters (code points in Unicode) and not
> in bytes, and the public interface should reflect this.

Be careful.  People who care about this stuff might want to read
http://www.w3.org/TR/2005/REC-charmod-20050215/ It turns out that
characters do not correspond one-to-one with units of sound, or units
of input, or units of display.  Except for low-level stuff like
regexps, it's very difficult to write any code that goes character-at-
a-time that doesn't contain horrible i18n bugs. For practical
purposes, a String is a more useful basic tool than a character.

> 5. Since the String class is quite smart already, it can implement
> generally useful and hard (in the domain of Unicode) operations like
> case folding, sorting, comparing etc.

Be careful.  Case folding is a horrible can of worms, is rarely
implemented correctly, and when it is (the Java library tries really
hard) is insanely expensive.  The reason is that case conversion is
not only language-sensitive but jurisdiction sensitive (in some
respects different in France & Qubec).  Trying to do case-folding on
text that is not known to be ASCII is likely a symptom of a bug.

> - This ties Ruby's String to Unicode. A safe choice IMHO, or would we
> really consider something else? Note that we don't commit to a
> particular encoding of Unicode strongly.

For information: The XML view is that Shift-JIS, KOI8-R, EBCDIC, and
many others are all encodings of Unicode and a best effort should be
made to accept and emit all sane encodings on demand.  Most XML
software sticks to a single encoding, internally.

  -Tim
Posted by Tim Bray (Guest)
on 18.06.2006 06:29
(Received via mailing list)
On Jun 17, 2006, at 6:50 AM, Stefan Lang wrote:

> It seems that the main argument against using Unicode strings
> in Ruby is because Unicode doesn't work well for eastern
> countries.

Point of information: there are highly successful word-processing
products selling well in countries whose writing systems include Han
characters, which internally use Unicode.   So while the Han-
unification problems have been much discussed and are regarded as
important by people who are not fools, in fact there is existence
proof that Unicode does work well enough for wide deployment in
commercial software.

> If Unicode is choosen as character set, there is the
> question which encoding to use internally. UTF-32 would be a
> good choice with regards to simplicity in implementation,

UTF-32 has a practical problem in that in C code, you can't use strcmp
() and friends because it's full of null bytes.  Of course if you're
careful to code everything using wchar_t you'll be OK, but lots of
code isn't.  (UTF-8 doesn't have this problem and is much more compact).

> Consider
> indexing of Strings:
>
>         "some string"[4]
>
> If UTF-32 is used, this operation can internally be
> implemented as a simple, constant array lookup. If UTF-16 or
> UTF-8 is used, this is not possible to implement as an array

Correct.  But in practice this seems not to be too huge a problem,
since in practice text is most often accessed sequentially.  The
times that you really need true random access to the N'th character
are rare enough that for some problems, the advantages of UTF-8 are
big enough to compensate for this problem.  Note that in a variable-
length character encoding, there's no trouble whatever with a table
of pointers into text; the *only* problem is when you need to find
the Nth character cheaply.

> An advantage of using UTF-8 would be that for pure ASCII files
> no conversion would be necessary for IO.

Be careful.  There are almost no pure ASCII files left.  Caf. 
Ordoez. ?Smart quotes?

  -Tim
Posted by Tim Bray (Guest)
on 18.06.2006 06:36
(Received via mailing list)
On Jun 17, 2006, at 6:52 AM, Austin Ziegler wrote:

>> The internal encoding has to be decided, probably between UTF-8,
>> UTF-16, and UTF-32 by the String class implementor.
>
> Completely disagree. Matz has the right choice on this one. You can't
> think in just terms of a pure Ruby implementation -- you *must* think
> in terms of the Ruby/C interface for extensions as well.

Point of information: Of all the widely-used methods of encoding
international strings, UTF-8 is by far the easiest to deal with in C.

> Trust me on this
> one: I *have* done some low-level encoding work. Additionally, even
> though I might have marked a network object as "UTF-8", I may not know
> whether it's *actually* UTF-8 or not until

That's an incredibly important point in a networked world.  One of
the reasons XML has had so much success, probably more than it
deserves, is that its encoding is self-descriptive.  To quote Larry
Wall: "An XML document knows what encoding it's in."  Since HTTP
headers are (sigh) known to be wrong on occasion, this is a pretty
big value-add.

>> - This ties Ruby's String to Unicode. A safe choice IMHO, or would we
>> really consider something else? Note that we don't commit to a
>> particular encoding of Unicode strongly.
>
> This is a wash. I think that it's better to leave the options open.
> After all, it *is* a hope of mine to have Ruby running on iSeries
> (AS/400) and *that* still uses EBCDIC.

EBCDIC is in fact an encoding of Unicode.  Just saying that it's
necessary to be clear both as to what character set is being
supported, and what limitations on encoding are enforced.

-Tim
Posted by Tim Bray (Guest)
on 18.06.2006 06:45
(Received via mailing list)
On Jun 17, 2006, at 10:34 AM, Stefan Lang wrote:

> Or do you mean Ruby should determine the encoding
> automatically? IMO, that would be bad magic and error-prone.

Not possible in the general case.  There are a few data formats
including XML and ASN.1, which make it possible to reliably infer the
encoding from the instance, but a lot of Web processing these days is
best-guess, and often fails.

> How does the regular expression engine work then?

The two sane options are
(a) have a fixed encoding for Strings and compile the regex in such a
way that it runs directly on the encoding.  This has been done for
both UTF-8 and UTF-16 and is insanely efficient, but it locks you
into the fixed encoding.
(b) have an iterator which produces abstract characters from whatever
encoding is in use and run the regex over the characters, not the
bytes of the representation.  The implementation is trickier and
performance is an issue, but you're not locked to an encoding.

-Tim
Posted by Tim Bray (Guest)
on 18.06.2006 06:51
(Received via mailing list)
On Jun 17, 2006, at 2:55 PM, Michal Suchanek wrote:

> First for reasons of efficiency. If an application is going to perform
> lots of slicing and poking on strings it will want some encoding that
> is suiatble for that such as UTF-32. If an application runs on system
> with little memory it will want space-efficient encoding (ie UTF-8 or
> UTF-16 for Asian languages).

Um, the practical experience is that the code required to unpack a
UTF-8 stream into a sequence of integer codepoints (and reverse the
process) is easy and very efficient; to the point that for "slicing
and poking", UTF-8 vs UTF-16 vs UTF-32 is pretty well a wash.

  -Tim
Posted by Tim Bray (Guest)
on 18.06.2006 06:57
(Received via mailing list)
On Jun 17, 2006, at 3:15 PM, Austin Ziegler wrote:

> Why? What if the library that I'm interfacing with requires EUC-JP?
> Sorry, but Unicode is *not necessarily* the right answer.

Indeed it's not, but this argument escapes me.  If you try feed that
library an Arabic string, something will break, because EUC-JP can't
represent Arabic.  So what?  Whatever character set(s) you
standardize on, there is going to be existing software that won't be
able to handle all of it... I'm just not following your argument.

  -Tim
Posted by Tim Bray (Guest)
on 18.06.2006 07:00
(Received via mailing list)
On Jun 17, 2006, at 3:22 PM, gwtmp01@mac.com wrote:

> be locale and encoding-aware?
I'm not close enough to Ruby to have a useful opinion, but for many
other software systems, the designers decided that the performance
and interoperability gains achievable by limiting themselves to
Unicode were a compelling enough argument, and so chose.

In particular, these days, both the W3C and the IETF overwhelmingly
specify the use of Unicode characters when text is to be included in
protocols or data delivery formats.  So even if you can handle lots
of non-Unicode stuff, the Net may have difficulty getting it to you. -
Tim
Posted by Charles O Nutter (Guest)
on 18.06.2006 07:03
(Received via mailing list)
I'll chime back in with my not-so-expert opinion, so it's known where I
stand. Take it for whatever it's worth.

- I almost entirely agree with Juergen's longer post on what unicode 
support
should look like in 2.0. I won't go into the details of what I disagree 
with
because I'm a little squishy in those areas.
- I believe that supporting encoding-tagged strings would be a horrible,
horrible mess for both Ruby VM/interpreter implementers and extension
implementers while not adding any serious benefits for Ruby the 
language.
When it comes down to it, you're going to have string A using encoding X 
and
string B using encoding Y and in order to work with them both together
you'll have to find some common ground. Settle on common ground early or 
you
pay the price to do it EVERY time you work with strings later.
- I have no intention to ever write a C extension for Ruby. I know many 
out
there do. However, I think the important thing about Ruby is Ruby, and
making the language bend over backwards to make life easier for C 
hackers is
absurd. Making unicode support needlessly complex in Ruby (the language)
only ends up hurting its usability. I for one would not want to 
sacrifice
the beauty and simplicity of Ruby solely to apease the C community. 
Flame on
if you will, but The Ruby Way should rule here.
- In the end, I should not have to care what encoding strings use 
internally
unless I absolutely have to know. Every time questions come up about 
unicode
support in Java, I have to look it up...UTF-8? UTF-16? UCS-2? I rarely 
need
to know this information, and I rarely remember it. That's exactly the
point. Make the one internal encoding whatever is deemed most flexible, 
most
performant, and above all *most global*. Nobody writing Ruby code should
have to care.
- I so rarely work with Strings on a character-by-character basis, and 
when
I do all I should have to say is get_character and know that what I have
represents a full and complete character representation. If you're 
dealing
with bytes, call it what it is--the aforementioned ByteBuffer. Ruby 
needs to
support the concepts of Strings and ByteBuffers independently.

I think it all comes back to a simple question: Which method of 
supporting
unicode would feel the most "Ruby"? Which one is DRY and KISS and all 
the
other lovely acronyms this community holds so dear? Figure that out, and
there's your answer. I'd be willing to bet it's not
every-string-can-encode-differently, because I don't see how that would 
ever
help me write better Ruby code...and improving Ruby is the point of all
this, right?
Posted by Tim Bray (Guest)
on 18.06.2006 07:07
(Received via mailing list)
On Jun 17, 2006, at 4:15 PM, Julian 'Julik' Tarkhanov wrote:

> I would much rather prefer UTF-8 in a language such as Ruby which  
> is often used as glue between
> other systems. UTF-8 is used for interchange and it's indisputable.  
> If you go for UTF-16 or UTF-32, you are most likely
> to convert every single character of text files you read (in text  
> files present in the wild AFAIK UTF-16 and UTF-32 are a minority,  
> thanks to the BOM and other setbacks).

There's a lot of UTF-16 out there.  There's more ISO-8859-* than
that, and more Microsoft code-page-* text than everything else put
together.  Yes, with UTF-16 & -32 you do a lot of byte swapping but
it's pretty cheap and pretty reliable.  (I like UTF-8 too, but it's
not without issues).

  -Tim
Posted by Michal Suchanek (Guest)
on 18.06.2006 12:54
(Received via mailing list)
On 6/18/06, Stefan Lang <langstefan@gmx.at> wrote:
> > encoding that is suiatble for that such as UTF-32. If an
> encoding. If somebody has enough time and motivation he can
> even implement support for multiple encodings and let the user
> choose at build-time.

Why? It can already handle utf-8 strings or arrays of unicode
codepoints. They just do not feel like strings with ruby 1.8. What I
want is a glue in string class that does make them feel so.

> > had to read anyway?
> encoding was choosen and iff the encoding of the input doesn't
> match the internal encoding.

However, if you can choose the encoding there is no need to recode at
all. You just keep the string as is, and there is a good chance the
output encoding will match the input encoding.

And in case you need to recode the string you got the encoding
information, and the recoding can be done automatically, and only when
needed.

Michal
Posted by Michal Suchanek (Guest)
on 18.06.2006 13:09
(Received via mailing list)
On 6/18/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:
> If you go for UTF-16 or UTF-32, you are most likely
> to convert every single character of text files you read (in text
> files present in the wild AFAIK UTF-16 and UTF-32 are a minority,
> thanks to the BOM and other setbacks).

Here you go. You can have the strings in UTF-8, and I can heve them in
UTF-32. That is the flexibility of the solution without a fixed
encoding.

> > how would Unicode Ruby and TRON Ruby exchange strings?
>
> I think Alan Little summed it up very well. The problem with Unicode
> in Ruby is strive for perfection
> (i.e. satisfy the users of every conceivable or needed encoding).
> It's very noble and I personally can't imagine it
> (even with the "democratic coerce" approach Austin cited). The only
> thing I don't know if a system having this type of handling can be
> built at all and how it will interoperate.

But quite a few people here look like they do know. I do not know much
about regexes but I can imagine just about any other string operation.
And the current regexes already do operate on multiple encodings.

>
> Up until now all scripting languages I used somewhat (Perl, Python,
> Ruby) allowed all encodings in strings and doing Unicode in them hurts.

And how that leads to the conclusion that there should be only one 
encoding?

>
> Bluntly put, I am selfish and I don't believe in the "saving grace"
> of the M17N (because I just can't wrap it around my head and I sure
> as hell know it's going to be VERY complex).

That's the point. If it is wrapped into the string class you do not
have to wrap it around your head.

> It's also something that bothers me the most about Ruby's "unicode
> discussions" (I've read all of them on this list dating back to 2002
> because I need it to work NOW) and they
> always transcend into this kind of religious discussion in the spirit
> of "but your encoding is not good enough", "but my bad encoding isn't
> that one and I still need it to work" etc.

And that is eaxctly why a fixed encoding is bad. If strings can be
encoded in any way there is no point i religious discussions which
encoding you like the most.

>
> While for me the greatest thing about Unicode is that it's Just Good
> Enough. And it doesn't seem Unicode is indeed THAT useless for CJK
> languages either
> (although I'm sure Paul can correct me - all the 4 languages I am in
> control of use only 2 scripting systems with some odd additions here
> and there).

It is JustGoodEnouhg for most cases but not for all. It is not useless
for CJK, just suboptimal because of the Han unification. And it also
does not try to include the historic characters.

>
> And no, I didn't have a chance to see a TRON system in the wild. If
> someone would show me one within 200 km distance I would be glad to
> take a look.

I do not care. Some poeple find that encoding useful. Since the
potential to support any encoding including TRON does not get in the
way when I deal with my text I am fine with that.

Michal
Posted by Juergen Strobel (Guest)
on 18.06.2006 16:18
(Received via mailing list)
On Sun, Jun 18, 2006 at 07:21:25AM +0900, Austin Ziegler wrote:
> 
> Which code page? EBCDIC has as many code pages (including a UTF-EBCDIC)
> as exist in other 8-byte encodings.

Obviously, EBCDIC -> UNICODE -> same EBCDIC Codepage as before.

> >>Not to mention that Matz has explicitly stated in the past that he
> >character class, and *that* was Java's main folly. (UCS-2 is a
> >strictly 16 bit per character encoding, but new Unicode standards
> >specify 21 bit characters, so they had to "extend" it).
> 
> Um. Do you mean UTF-32? Because there's *no* binary representaiton of
> Unicode Character Code Points that isn't an encoding of some sort. If
> that's the case, that's unacceptable from a memory representation.

Yes, I do mean the String *interface* to be UTF-32, or pure code
points which is the same but less suscept to to standard changes, if
accessed at character level. If accessed at substring level, a
substring of a String is obviously a String, and you don't need a
bitwise representation at all.

According to my proposal, Strings do not need an encoding from the
String user's point of view when working just with Strings, and users
won't care apart from memory/performance consumption, which I believe
can be made good enough with a totally encapsulted, internal storage
format to be decided later. I will avoid a premature optimization
debate here now.

Of course encoding matters when Strings are read or written somewhere,
or converted to bit-/bytewise representation explicitly. The Encoding
Framework, however it'll look, needs to be able to convert to and from
Unicode code points for these operations only, and not between
arbitrary encodings. (You *may* code this to recode directly from
the internal storage format for performance reasons, but that'll be
transparent to the String user.)

This breaks down for characters not represented in Unicode at all, and
is a nuisance for some characters affected by the Han Unification
issue.  But Unicode set out to prevent exactly this, and if we
beleieve in Unicode at all, we can only hope they'll fix this in an
upcoming revision. Meanwhile we could map any additional characters
(or sets of) we need to higher, unused Unicode plains, that'll be no
worse than having different, possibly incompatible kinds of Strings.

We'll need an additional class for pure byte vectors, or just use
Array for this kind of work, and I think this is cleaner.

Regarding Java, they switched from UCS-2 to UTF-16 (mostly). UCS-2 is
a pure 16 bit per character encoding and cannot represent codepoints
above 0xffff. UTF-16 works alike UTF-8, but with 16 bit chunks.  But
their abstraction of a single character, the class Char(acter), is
still only 16 bit wide which leads to confusion and similiar to the C
type char, which cannot represent all real characters either. It is
even worse than in C, because C explicitly defines char to be a memory
cell of 8 bits or more, whereas Java really meant Char to be a
character.

> >I am unaware of unsolveable problems with Unicode and Eastern
> >languages, I asked specifically about it. If you think Unicode is
> >unfixably flawed in this respect, I guess we all should write off
> >Unicode now rather than later? Can you detail why Unicode is
> >unacceptable as a single world wide unifying character set?
> >Especially, are there character sets which cannot be converted to
> >Unicode and back, which is the main requirement to have Unicode
> >Strings in a non-Unicode environment?
> 
> Legacy data and performance.

Map legacy data, that is characters still not in Unicode, to a high
Plane in Unicode. That way all characters can be used together all the
time. When Unicode includes them we can change that to the official
code points. Note there are no files in String's internal storage
format, so we don't have to worry about reencoding them.

I am not worried about performance. I'd code in C if I were, or
Lisp.

For one, Moore's law is at work and my whole proposal was for 2.0. My
proposal only adds a constant factor to String handling, it doesn't
have higher order complexity.

On the other hand, conversions needs to be done at other times with my
proposal than for M17N Strings, and it depends on the application if
that is more or less often.  String-String operations never need to do
recoding, as opposed to M17N Strings. I/O always needs conversion, and
may need conversion with M17N too. I havea a hunch that allowing
different kinds of Strings around (as in M17N presumely) should
require recoding far more often.

Jürgen
Posted by Juergen Strobel (Guest)
on 18.06.2006 16:49
(Received via mailing list)
On Sun, Jun 18, 2006 at 07:22:34AM +0900, gwtmp01@mac.com wrote:
> be agnostic about these things but still provide a coherent framework  
> for building libraries and applications that can be locale and  
> encoding-aware?
> 
> Gary Wright
> 

Maybe I was unclear. I did't mean Ruby has too choose an existing
standard, but Ruby has to choose which set of characters to handle in
Strings, in the mathematical sense.

Language implementation, and usage of the String class should be
easier if this set is

- well defined

Unicode code points are pretty good in this respect, better than the
union of all characters in all encodings of possible M17N Strings.
And we may use private extensions to Unicode for legacy characters not
included in Unicode already.

- All characters are equally allowed in all Strings.

M17N fails this one. a[5] = b[3] if their encodings are incompatible?

At best it'll coerce a to an encoding which can handle both, which
would be Unicode 98% of the time any way, 1% something else, and 1%
totally fail. Don't nail me down on the numbers.

Mathematically, String functions should be defined on the whole set,
not subsets, or their application becomes a chore.

Jürgen
Posted by Julian 'Julik' Tarkhanov (Guest)
on 18.06.2006 16:56
(Received via mailing list)
On 18-jun-2006, at 6:17, Tim Bray wrote:
>
> Be careful.  Case folding is a horrible can of worms, is rarely  
> implemented correctly, and when it is (the Java library tries  
> really hard) is insanely expensive.  The reason is that case  
> conversion is not only language-sensitive but jurisdiction  
> sensitive (in some respects different in France & Québec).  Trying  
> to do case-folding on text that is not known to be ASCII is likely  
> a symptom of a bug.

Let's write a specification.
Posted by Austin Ziegler (austin)
on 18.06.2006 17:32
(Received via mailing list)
On 6/18/06, Juergen Strobel <strobel@secure.at> wrote:
> On Sun, Jun 18, 2006 at 07:21:25AM +0900, Austin Ziegler wrote:
>> Um. Do you mean UTF-32? Because there's *no* binary representaiton of
>> Unicode Character Code Points that isn't an encoding of some sort. If
>> that's the case, that's unacceptable from a memory representation.
> Yes, I do mean the String *interface* to be UTF-32, or pure code
> points which is the same but less suscept to to standard changes, if
> accessed at character level. If accessed at substring level, a
> substring of a String is obviously a String, and you don't need a
> bitwise representation at all.

Again, this is completely unacceptable from a memory usage perspective.
I certainly don't want my programs taking up 4x the additional memory
for string handling.

But "pure code points" is a red herring and a mistake in any case. Code
points aren't sufficient. You need glyphs, and some glyphs can be
produced with multiple code points (e.g., LOWERCASE A + COMBINING ACUTE
ACCENT as opposed to A ACUTE). Indeed, some glyphs can *only* be
produced with multiple code points. Dealing with this intelligently
requires a *lot* of smarts, but it's precisely what we should do.

> According to my proposal, Strings do not need an encoding from the
> String user's point of view when working just with Strings, and users
> won't care apart from memory/performance consumption, which I believe
> can be made good enough with a totally encapsulted, internal storage
> format to be decided later. I will avoid a premature optimization
> debate here now.

Again, you are incorrect. I *do* care about the encoding of each String
that I deal with, because only that allows me (or String) to deal with
conversions appropriately. Granted, *most* of the time, I won't care.
But I do work with legacy code page stuff from time to time, and
pronouncements that I won't care are just arrogance or ignorance.

> Of course encoding matters when Strings are read or written somewhere,
> or converted to bit-/bytewise representation explicitly. The Encoding
> Framework, however it'll look, needs to be able to convert to and from
> Unicode code points for these operations only, and not between
> arbitrary encodings. (You *may* code this to recode directly from
> the internal storage format for performance reasons, but that'll be
> transparent to the String user.)

I prefer arbitrary encoding conversion capability.

> This breaks down for characters not represented in Unicode at all, and
> is a nuisance for some characters affected by the Han Unification
> issue.  But Unicode set out to prevent exactly this, and if we
> beleieve in Unicode at all, we can only hope they'll fix this in an
> upcoming revision. Meanwhile we could map any additional characters
> (or sets of) we need to higher, unused Unicode plains, that'll be no
> worse than having different, possibly incompatible kinds of Strings.

Those choices aren't ours to make.

> We'll need an additional class for pure byte vectors, or just use
> Array for this kind of work, and I think this is cleaner.

I don't. Such an additional class adds unnecessary complexity to
interfaces. This is the *main* reason that I oppose the foolish choice
to pick a fixed encoding for Ruby Strings.

>> Legacy data and performance.
> Map legacy data, that is characters still not in Unicode, to a high
> Plane in Unicode. That way all characters can be used together all the
> time. When Unicode includes them we can change that to the official
> code points. Note there are no files in String's internal storage
> format, so we don't have to worry about reencoding them.

Um. This is the statement of someone who is ignoring legacy issues.
Performance *is* a big issue when you're dealing with enough legacy
data. Don't punish people because of your own arrogance about encoding
choices.

Again: Unicode Is Not Always The Right Choice. Anyone who tells you
otherwise is selling you a Unicode toolkit and only has their wallet in
mind. Unicode is *often* the right choice, but it's *not* the only
choice and there are times when having the *flexibility* to work in
other encodings without having to work through Unicode as an
intermediary is the right choice. And from an API perspective,
separating String and "ByteVector" is a mistake.

> On the other hand, conversions needs to be done at other times with my
> proposal than for M17N Strings, and it depends on the application if
> that is more or less often.  String-String operations never need to do
> recoding, as opposed to M17N Strings. I/O always needs conversion, and
> may need conversion with M17N too. I havea a hunch that allowing
> different kinds of Strings around (as in M17N presumely) should
> require recoding far more often.

Unlikely. Mixed-encoding data handling is uncommon.

-austin
Posted by Julian 'Julik' Tarkhanov (Guest)
on 18.06.2006 17:32
(Received via mailing list)
On 18-jun-2006, at 13:08, Michal Suchanek wrote:
>
> But quite a few people here look like they do know. I do not know much
> about regexes but I can imagine just about any other string operation.
> And the current regexes already do operate on multiple encodings.
Oh, lord... Have you at least tried that to make such assumtpions? In
other words, tell me, can Ruby's regexes cope with the following:

/[а-я]/
/[а-я]/i

or something like this:
http://rubyforge.org/cgi-bin/viewvc.cgi/icu4r/samples/demo_regexp.rb?
revision=1.2&root=icu4r&view=markup

>
>
> And how that leads to the conclusion that there should be only one  
> encoding?
Very simply - I use many pieces of software written in many languages
all the time, with non-Latin text.
I know that when they want to get "historically compatible" problems
arise. And the software that settles on Unicode
internally or somehow enforces it on the programmer usually works
best (all Cocoa and all C#. And to a certain extens yes, Java).

>
>>
>> Bluntly put, I am selfish and I don't believe in the "saving grace"
>> of the M17N (because I just can't wrap it around my head and I sure
>> as hell know it's going to be VERY complex).
>
> That's the point. If it is wrapped into the string class you do not
> have to wrap it around your head.

This is rather naive.
>
> And that is eaxctly why a fixed encoding is bad. If strings can be
> encoded in any way there is no point i religious discussions which
> encoding you like the most.

Yes, it just becomes hard and error prone to process them.
>
> It is JustGoodEnouhg for most cases but not for all. It is not useless
> for CJK, just suboptimal because of the Han unification. And it also
> does not try to include the historic characters.

I think this thread is going to end the same as the one in 2002 did.
Posted by Yukihiro Matsumoto (Guest)
on 18.06.2006 18:37
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Sun, 18 Jun 2006 23:46:40 +0900, Juergen Strobel 
<strobel@secure.at> writes:

|Language implementation, and usage of the String class should be
|easier if this set is
|
|- well defined 
|- All characters are equally allowed in all Strings.

I understand these attributes might make implementation easier.   But
who cares if I don't care.  And I am not sure how these make usage
easier, really.

Somebody who owns gigabytes of text data in legacy encoding (e.g. me),
wants to avoid encoding conversion back and forth between Unicode and
legacy encoding everytime.  Another somebody want text processing on
historical text which character set is far bigger than Unicode.  The
"well-defined" simple implementation just prohibits those demands.  On
the contrary, M17N approach does not bother Universal Character Set
solution.  You just need to choose Unicode (UTF-8 or UTF-16) as
internal string representation, and convert encoding on I/O as you
might have done in Unicode centric languages.  Nothing lost.

You may worry about implementation difficulty (and performance), but
don't.  It's _my_ concern.  I made a prototype, and have convinced
that I can implement it with acceptable performance.

|Unicode code points are pretty good in this respect, better than the
|union of all characters in all encodings of possible M17N Strings.
|And we may use private extensions to Unicode for legacy characters not
|included in Unicode already.

"private extensions".  No.  It just cause another nightmare.

							matz.
Posted by Tim Bray (Guest)
on 18.06.2006 19:27
(Received via mailing list)
On Jun 18, 2006, at 8:29 AM, Austin Ziegler wrote:

> You need glyphs, and some glyphs can be
> produced with multiple code points (e.g., LOWERCASE A + COMBINING  
> ACUTE
> ACCENT as opposed to A ACUTE).

This is another thing you need your String class to be smart about.
You want an equality test between "ms" and "ms" to always be true
even their "" characters are encoded differently.  The right way to
solve this is called "Early Uniform Normalization" (see http://
www.w3.org/TR/2003/WD-charmod-20030822/#sec-Normalization); the idea
is you normalize the composed characters at the time you create the
string, then the internal equality test can be done with strcmp() or
equivalent.

>> Map legacy data, that is characters still not in Unicode, to a high
>> Plane in Unicode. That way all characters can be used together all  
>> the
>> time. When Unicode includes them we can change that to the official
>> code points. Note there are no files in String's internal storage
>> format, so we don't have to worry about reencoding them.
>
> Um. This is the statement of someone who is ignoring legacy issues.
> Performance *is* a big issue when you're dealing with enough legacy
> data.

Note that you don't have to use a high plane.  The Private Use Area
in the Basic Multilingual Pane has 6,400 code points, which is quite
a few.  Even if you did use a high plane, it's not obvious there'd be
a detectable runtime performance penalty.

>  Unicode is *often* the right choice, but it's *not* the only
> choice and there are times when having the *flexibility* to work in
> other encodings without having to work through Unicode as an
> intermediary is the right choice.

That may be the case.  You need to do a cost-benefit analysis; you
could buy a lot of simplicity by decreeing all-Unicode-internally;
would the benefits of allowing non-Unicode characters be big enough
to to compensate for the loss of simplicity?  I don't know the
answer, but it needs thinking about.

  -Tim
Posted by Christian Neukirchen (Guest)
on 18.06.2006 21:21
(Received via mailing list)
Tim Bray <tbray@textuality.com> writes:

> is you normalize the composed characters at the time you create the
> string, then the internal equality test can be done with strcmp() or
> equivalent.

Does that mean that  binary.to_unicode.to_binary != binary  is possible?
That could turn out pretty bad, no?
Posted by Julian 'Julik' Tarkhanov (Guest)
on 18.06.2006 21:33
(Received via mailing list)
On 18-jun-2006, at 21:17, Christian Neukirchen wrote:

>> solve this is called "Early Uniform Normalization" (see http://
>> www.w3.org/TR/2003/WD-charmod-20030822/#sec-Normalization); the idea
>> is you normalize the composed characters at the time you create the
>> string, then the internal equality test can be done with strcmp() or
>> equivalent.
>
> Does that mean that  binary.to_unicode.to_binary != binary  is  
> possible?
> That could turn out pretty bad, no?

And it does as long as you are not careful. One of the things I do is
normalize all that come IN
into something that is suitable and predictable.
Posted by Tim Bray (Guest)
on 18.06.2006 22:53
(Received via mailing list)
On Jun 18, 2006, at 12:17 PM, Christian Neukirchen wrote:

> possible?
> That could turn out pretty bad, no?

Yes, but having "ms" != "ms" is pretty bad too; the alternative is
normalizing at comparison time, which would really hurt for example
in a big sort, so you'd need to cache the normalized form, which
would be a lot more code.

binary.to_unicode looks a little weird to me... can you do that
without knowing what the binary is?  If it's text in a known
encoding, no breakage should occur.  If it's unknown bit patterns,
you can't really expect anything sensible to happen... or am I
missing an obvious scenario?  -Tim
Posted by Juergen Strobel (Guest)
on 18.06.2006 22:53
(Received via mailing list)
On Sat, Jun 17, 2006 at 11:24:45PM +0900, Austin Ziegler wrote:
> On 6/17/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:
> >On 17-jun-2006, at 15:52, Austin Ziegler wrote:
> >>>8. Because Strings are tightly integrated into the language with the
> >>>source reader and are used pervasively, much of this cannot be
> >>>provided by add-on libraries, even with open classes. Therefore the
> >>>need to have it in Ruby's canonical String class. This will break
> >>>some old uses of String, but now is the right time for that.
> >>"Now" isn't; Ruby 2.0 is. Maybe Ruby 1.9.1.

My title was "A Plan for Unicode Strings in Ruby 2.0". I don't want to
rush things or break 1.8.

Jürgen
Posted by Juergen Strobel (Guest)
on 18.06.2006 23:15
(Received via mailing list)
On Mon, Jun 19, 2006 at 01:33:54AM +0900, Yukihiro Matsumoto wrote:
> 
> solution.  You just need to choose Unicode (UTF-8 or UTF-16) as
> internal string representation, and convert encoding on I/O as you
> might have done in Unicode centric languages.  Nothing lost.
> 
> You may worry about implementation difficulty (and performance), but
> don't.  It's _my_ concern.  I made a prototype, and have convinced
> that I can implement it with acceptable performance.

I never worried about performance much, that's Austin. :P

Thanks for clarifying that. So far I could not find much info on how
exactly M17N will work, especially on the role of the encoding tag, so
I had to guess a lot.

Given your explanation, it seems our ways are quite similiar on the
interface side of things, so far as Unicode is concerned. You chose a
more powerful (and more complex) parametric class design for where I
would have left open only the possiblity of transparently useable
subclasses for performance reasons.

I am happy we've worked that out now. And you are right, I am not that
much interested in the implementation, thank you for doing it. My
concern was with the interface of the String class, but several
posters misunderstood me and tried to draw me into implementation
issues.

Jürgen
Posted by Yukihiro Matsumoto (Guest)
on 19.06.2006 01:02
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Mon, 19 Jun 2006 00:29:46 +0900, Julian 'Julik' Tarkhanov 
<listbox@julik.nl> writes:

|In other words, tell me, can Ruby's regexes cope with the following:
|
|/[-]/
|/[-]/i

1.9 Oniguruma regexp engine should handle these, otherwise it's a bug.

							matz.
Posted by Julian 'Julik' Tarkhanov (Guest)
on 19.06.2006 01:11
(Received via mailing list)
On 19-jun-2006, at 1:00, Yukihiro Matsumoto wrote:

>
> 1.9 Oniguruma regexp engine should handle these, otherwise it's a bug.


I'll try to check. Oniguruma on 1.8.4. didn't cope, but maybe it just
weren't hooked in properly.
Posted by Yukihiro Matsumoto (Guest)
on 19.06.2006 01:57
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Mon, 19 Jun 2006 08:09:29 +0900, Julian 'Julik' Tarkhanov 
<listbox@julik.nl> writes:

|> |/[-]/
|> |/[-]/i
|>
|> 1.9 Oniguruma regexp engine should handle these, otherwise it's a bug.
|
|I'll try to check. Oniguruma on 1.8.4. didn't cope, but maybe it just  
|weren't hooked in properly.

If you have any problem, send us a report with what you expect and
what you get.

							matz.
Posted by Julian 'Julik' Tarkhanov (Guest)
on 19.06.2006 03:34
(Received via mailing list)
On 19-jun-2006, at 1:56, Yukihiro Matsumoto wrote:

> a bug.
> |
> |I'll try to check. Oniguruma on 1.8.4. didn't cope, but maybe it just
> |weren't hooked in properly.
>
> If you have any problem, send us a report with what you expect and
> what you get.

Well, I tried on the CVS latest (1.9) and I get:

irb(main):011:0> "Н??лагода?Ная" =~ /[а-я]/i
=> 6 (should be zero)

That is - character classes work, casefolding doesn't.
Posted by Yukihiro Matsumoto (Guest)
on 19.06.2006 06:06
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Mon, 19 Jun 2006 10:32:08 +0900, Julian 'Julik' Tarkhanov 
<listbox@julik.nl> writes:

|Well, I tried on the CVS latest (1.9) and I get:
|
|irb(main):011:0> "" =~ /[-]/i
|=> 6 (should be zero)
|
|That is - character classes work, casefolding doesn't.

I found out that Oniguruma casefolding works only for characters
within iso8869-*.  Considering the size of the casefolding table it is
compromise for the time being.  I will fix this in the future.

							matz.
Posted by Julian 'Julik' Tarkhanov (Guest)
on 19.06.2006 07:24
(Received via mailing list)
On 19-jun-2006, at 6:05, Yukihiro Matsumoto wrote:
> |
> |That is - character classes work, casefolding doesn't.
>
> I found out that Oniguruma casefolding works only for characters
> within iso8869-*.  Considering the size of the casefolding table it is
> compromise for the time being.  I will fix this in the future.

Thanks for the clarification :-)
Posted by Dmitry Severin (Guest)
on 19.06.2006 07:58
(Received via mailing list)
Correct me,if I'm wrong, but for Matz's plan on M17N, summary is:
1. String internally will remain the same : char *ptr, long len - in 
bytes
2. String instances will have encoding tag
3. All String/Regexp methods will respect that encoding tag and return
char(glyph) indexes
4. Methods like byte_size, codepoints, each_char, each_codepoint will be
introduced(?)
5. slice will always accept chars indices and return substrings

I'd say that WOULD BE GOOD, and with methods like
String#enforce_encoding!(encoding) and 
String#coerce_encoding!(otherstring)
it won't require developers (for C extensions also) to look at encoding 
tag,
just set it when needed.

But, I can see several imlementation issues and possible options, that
should be considered:
- what will happen if one tries to perfom str1.operation(str2) on two
strings with different encodings:
  a) raise exception
  b) silent coerce one or both strings to some "compatible"
charset/encoding, update encoding of result, replacing non-convertable 
chars
using fallback mappings? (ouch, this can be split to set of options)
  c) same as b) but raise exception if non-loss conversion is not 
possible?
  d) same as b) but warn if non-loss conversion is not possible?
  e) downgrade encoding tag of acceptor to "raw/bytes" and process it?

- what will happen if one changes encoding tag for String instance:
  a) check and raise exception if current bytes don't represent valid
encoding sequence?
  b) just set new tag?
  c) convert byte sequence to given encoding, using fallback mappings?

- what to do with IO:
  a) IO will return strings in "raw/bytes"?
  b) IO can be tagged and will return Strings with given econding tag?
  c) IO can be tagged and is by default tagged with global encoding tag?
  d) IO can be tagged, but is not tagged by default, although methods
returning strings (such as read, readlines) will use global encoding 
tag?
  e) if IO is tagged and one tries to write to it a String with 
different
encoding, what will happen?

- what will be default encoding tag for new Strings:
  a) "raw/bytes"
  b) derived from system properties of host platform
  c) option b) and can be overriden in application (btw, $KCODE, as 
present,
must definitely go away!!!)

- how to process source code files:
  a) restrict them to ASCII and require all non-ASCII strings to be
externalized?
  b) process them as "raw/bytes"?
  c) introduce some kind of commented pragma for source files allowing 
to
set encoding,

- at present time Ruby parser can parse only sources in ASCII compatible
encoding.  Would it change?

- what encodings will have Numeric.to_s, Time.to_s etc., or String has 
to
have/conform for String#to_f, String#to_i?

On Unicode:
- case-independent canonical string matches/searches DO MATTER. And even 
for
encodings, that code variants of glyphs with different codepoints
"variant-insensitive" search, as for me, is desired. Will there be such
functionality?

- string comparison: will <=> use at least UCA rules for Unicode 
strings, or
only byte-order comparisons will stay?

- is_digit, is_space, is_alpha, is_foobarbaz etc. could matter, when 
writing
a custom parser. Will those methods be provided for one-char strings?


Yes, this is short and incomplete list, but, you should get my point: 
it's
not that easy -- there are dozens of decisions, with their pros and 
cons, to
be done and implemented :(
Posted by Yukihiro Matsumoto (Guest)
on 19.06.2006 09:57
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Mon, 19 Jun 2006 14:57:22 +0900, "Dmitry Severin" 
<dmitry.severin@gmail.com> writes:

|But, I can see several imlementation issues and possible options, that
|should be considered:

Thank you for the ideas.

|- what will happen if one tries to perfom str1.operation(str2) on two
|strings with different encodings:
|  a) raise exception
|  b) silent coerce one or both strings to some "compatible"
|charset/encoding, update encoding of result, replacing non-convertable chars
|using fallback mappings? (ouch, this can be split to set of options)
|  c) same as b) but raise exception if non-loss conversion is not possible?
|  d) same as b) but warn if non-loss conversion is not possible?
|  e) downgrade encoding tag of acceptor to "raw/bytes" and process it?

a), unless either of strings is "ascii" and the other is "ascii"
compatible.  This point is arguable.

|- what will happen if one changes encoding tag for String instance:
|  a) check and raise exception if current bytes don't represent valid
|encoding sequence?
|  b) just set new tag?
|  c) convert byte sequence to given encoding, using fallback mappings?

b), encoding conformance check shall done lazily.  I think there's a
need for explicit encoding conformance check method.

|- what to do with IO:
|  a) IO will return strings in "raw/bytes"?
|  b) IO can be tagged and will return Strings with given econding tag?
|  c) IO can be tagged and is by default tagged with global encoding tag?
|  d) IO can be tagged, but is not tagged by default, although methods
|returning strings (such as read, readlines) will use global encoding tag?
|  e) if IO is tagged and one tries to write to it a String with different
|encoding, what will happen?

c), the global default shall be set from locale setting.

|- what will be default encoding tag for new Strings:
|  a) "raw/bytes"
|  b) derived from system properties of host platform
|  c) option b) and can be overriden in application (btw, $KCODE, as present,
|must definitely go away!!!)

Encoding for literal strings are set by pragma.

|- how to process source code files:
|  a) restrict them to ASCII and require all non-ASCII strings to be
|externalized?
|  b) process them as "raw/bytes"?
|  c) introduce some kind of commented pragma for source files allowing to
|set encoding,

1.9 already has encoding pragma a la Python PEP263.

|- at present time Ruby parser can parse only sources in ASCII compatible
|encoding.  Would it change?

No.  Ruby would not allow scripts in EBCDIC, nor UTF-16, although it
allows processing of those encoding.

|- what encodings will have Numeric.to_s, Time.to_s etc., or String has to
|have/conform for String#to_f, String#to_i?

Good point.  Currently, I think they should work on ASCII.

|On Unicode:
|- case-independent canonical string matches/searches DO MATTER. And even for
|encodings, that code variants of glyphs with different codepoints
|"variant-insensitive" search, as for me, is desired. Will there be such
|functionality?

Casefold search/match will be provided for Regexp.  "variant
insensitive" search should be accomplished by explicit normalization
or collation.

|- string comparison: will <=> use at least UCA rules for Unicode strings, or
|only byte-order comparisons will stay?

Byte order comparison.  UCA rules or such should be done explicitly
via normalization or collation.

|- is_digit, is_space, is_alpha, is_foobarbaz etc. could matter, when writing
|a custom parser. Will those methods be provided for one-char strings?

Those functions will be provided via Regexp.  I am not sure if we will
provide character classification methods for strings.

							matz.
Posted by Christian Neukirchen (Guest)
on 19.06.2006 13:17
(Received via mailing list)
Tim Bray <tbray@textuality.com> writes:

>>
> without knowing what the binary is?  If it's text in a known
> encoding, no breakage should occur.  If it's unknown bit patterns,
> you can't really expect anything sensible to happen... or am I
> missing an obvious scenario?  -Tim

Those were just fictive method calls.  But let's say I read from
a pipe and I know it contains UTF-16 with BOM, then .to_unicode
would make perfect sense, no?

In case of binary bit patterns, I sooner or later would expect some
kind of EncodingError, given this API.  (I haven't seen yet drafts of
how the API really will be.)
Posted by Michal Suchanek (Guest)
on 19.06.2006 14:40
(Received via mailing list)
On 6/19/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> |- what will happen if one tries to perfom str1.operation(str2) on two
> compatible.  This point is arguable.
What is "ascii"? Specifically I would like string operations to suceed
in cases when both strings are encoded as different subset of Unicode
(or anything else). ie concatenating an ISO-8859-2 and an ISO-8859-1
string sould result in UTF-* string, not an error.

However, this would make the errors from incompatible encodings more
surprising as they would be very infrequent.

I wonder what operations on raw strings (ones without specified
encoding) would do. Or where one of the strings is raw, and the other
is not.


> c), the global default shall be set from locale setting.
>

I am not sure this is good for network IO as well. For diagnostics it
might be useful to set the default to none, and have string raise an
exception when such strings are combined with other strings.

It is only obvious for STDIN and STDOUT that they should follow the
locale setting.

hmm, but it would need to carefully consider which operations should
work on raw strings and which not. Perhaps it is not as nice as it
looks at the first glance.

Thanks

Michal
Posted by Yukihiro Matsumoto (Guest)
on 19.06.2006 15:02
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Mon, 19 Jun 2006 21:39:33 +0900, "Michal Suchanek" 
<hramrach@centrum.cz> writes:

|> a), unless either of strings is "ascii" and the other is "ascii"
|> compatible.  This point is arguable.
|
|What is "ascii"? Specifically I would like string operations to suceed
|in cases when both strings are encoded as different subset of Unicode
|(or anything else). ie concatenating an ISO-8859-2 and an ISO-8859-1
|string sould result in UTF-* string, not an error.

Every encoding has an attribute named ascii_compat.  EUC_JP, SJIS,
ISO-8859-* and UTF-8 are declared ascii compatible, where EBCDIC,
UTF-16 and UTF-32 are not.  No other auto conversion shall be done,
since we don't particularly encourage mixed encoding model.

|> |- what to do with IO:
|> |  a) IO will return strings in "raw/bytes"?
|> |  b) IO can be tagged and will return Strings with given econding tag?
|> |  c) IO can be tagged and is by default tagged with global encoding tag?
|> |  d) IO can be tagged, but is not tagged by default, although methods
|> |returning strings (such as read, readlines) will use global encoding tag?
|> |  e) if IO is tagged and one tries to write to it a String with different
|> |encoding, what will happen?
|>
|> c), the global default shall be set from locale setting.
|
|I am not sure this is good for network IO as well. For diagnostics it
|might be useful to set the default to none, and have string raise an
|exception when such strings are combined with other strings.
|
|It is only obvious for STDIN and STDOUT that they should follow the
|locale setting.

Restricting default encoding from locale to STDIO may be a good idea.
There's still open issues, since default encoding from locale is not
covered by the prototype, so we need more experience.

							matz.
Posted by Dmitrii Dimandt (Guest)
on 19.06.2006 15:28
(Received via mailing list)
On 6/19/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> |(or anything else). ie concatenating an ISO-8859-2 and an ISO-8859-1
> |string sould result in UTF-* string, not an error.
>
> Every encoding has an attribute named ascii_compat.  EUC_JP, SJIS,
> ISO-8859-* and UTF-8 are declared ascii compatible, where EBCDIC,
> UTF-16 and UTF-32 are not.  No other auto conversion shall be done,
> since we don't particularly encourage mixed encoding model.
>

I wonder. Why cannot Strings throughout Ruby be _always_ represented
as Unicode and why no let ICU handle the conversion between various
encodings for incoming and outgoing data?
(http://www.ibm.com/software/globalization/icu/). I know, it is a
long-stanbding issue on Unicode's Han unification process, but without
proper Unicode support Ruby is destined to be a toy for
English-speaking and Japanese communities only. (And as I'm gearing up
to prepare a web-site in Russian, Turkish and English, I feel that
using Ruby could prove to be a major pain in the nether regions of my
body :) )
Posted by Austin Ziegler (austin)
on 19.06.2006 15:34
(Received via mailing list)
On 6/19/06, Dmitrii Dimandt <dmitriid@gmail.com> wrote:
> I wonder. Why cannot Strings throughout Ruby be _always_ represented
> as Unicode and why no let ICU handle the conversion between various
> encodings for incoming and outgoing data?
> (http://www.ibm.com/software/globalization/icu/). I know, it is a
> long-stanbding issue on Unicode's Han unification process, but without
> proper Unicode support Ruby is destined to be a toy for
> English-speaking and Japanese communities only. (And as I'm gearing up
> to prepare a web-site in Russian, Turkish and English, I feel that
> using Ruby could prove to be a major pain in the nether regions of my
> body :) )

This entire discussion is centered around a proposal to do exactly
that. There are many *very good* reasons to avoid doing this. Unicode
Is Not Always The Answer.

It's *usually* the answer, but there are times when it's just easier
to work with data in an established code page.

-austin
Posted by Dmitrii Dimandt (Guest)
on 19.06.2006 15:47
(Received via mailing list)
On 6/19/06, Austin Ziegler <halostatue@gmail.com> wrote:
> > body :) )
>
> This entire discussion is centered around a proposal to do exactly
> that. There are many *very good* reasons to avoid doing this. Unicode
> Is Not Always The Answer.
>
> It's *usually* the answer, but there are times when it's just easier
> to work with data in an established code page.
>

I totally agree with that. IMO, the point lies exactly in this
"*usually* an answer". What was the last time 90% of developers had to
wonder what encoding their data was in ;-) And with the advent of
Unicode (and storage becoming cheaper and cheaper and developers
becoming more and more lazy and lazy) more and more of that data is
going to be Unicode.

So, since Unicode is *usually* the answer, make it as painless as
possible. Make all String methods and any other functions that work
with strings accept Unicode straight out of the box without any
worries on the developer's part. And provide alternatives (or optional
parameters?) that would allow the few more encoding-aware gurus :) do
whatever they want with encodings.

Because otherwise we are in a risk of ending up with incompatible
extensions to strings that "simplfy" a developer's life (and the
trend's already begun). I wouldn't want a C/C++ scenario with a string
class upon string class upon extension upon extension that aim to do
something String should do from the start.

All is IMHO, of course :)
Posted by Austin Ziegler (austin)
on 19.06.2006 16:35
(Received via mailing list)
On 6/19/06, Dmitrii Dimandt <dmitriid@gmail.com> wrote:
> Because otherwise we are in a risk of ending up with incompatible
> extensions to strings that "simplfy" a developer's life (and the
> trend's already begun). I wouldn't want a C/C++ scenario with a string
> class upon string class upon extension upon extension that aim to do
> something String should do from the start.

I think that's more likely with (a) what we have now and (b) a
Unicode-internal approach. (Indeed, a Unicode-internal approach
*requires* separating a byte vector from String, which doubles
interface complexity.) I would suggest that you look through the whole
discussion and particular attention to Matz's statements.

-austin
Posted by Tim Bray (Guest)
on 19.06.2006 18:09
(Received via mailing list)
On Jun 19, 2006, at 4:16 AM, Christian Neukirchen wrote:

>> without knowing what the binary is?  If it's text in a known
>> encoding, no breakage should occur.  If it's unknown bit patterns,
>> you can't really expect anything sensible to happen... or am I
>> missing an obvious scenario?  -Tim
>
> Those were just fictive method calls.  But let's say I read from
> a pipe and I know it contains UTF-16 with BOM, then .to_unicode
> would make perfect sense, no?

Yep.  And yes, calling to_unicode on it might in fact change the bit
patterns if you adopted Early Uniform Normalization (which would be a
good thing to do).  -Tim
Posted by Michal Suchanek (Guest)
on 19.06.2006 19:22
(Received via mailing list)
On 6/19/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> |(or anything else). ie concatenating an ISO-8859-2 and an ISO-8859-1
> |string sould result in UTF-* string, not an error.
>
> Every encoding has an attribute named ascii_compat.  EUC_JP, SJIS,
> ISO-8859-* and UTF-8 are declared ascii compatible, where EBCDIC,
> UTF-16 and UTF-32 are not.  No other auto conversion shall be done,
> since we don't particularly encourage mixed encoding model.

Reading what you said it appears it would be only possible to add
ascii strings to ascii-compatible sttings. That does not sound very
useful.
If the intended meanig was rather that operations on two
ascii-compatible strings
should always be possible, and that the result is again
ascii-compatible that would sound better.

But it makes these "ascii" encodings a special case. In particular, it
makes UTF-32 less convenient to use.
I guess that for calculation so complex that it would really benefit
form the fast random access of UTF-32 it is reasonable to create a
wrapper that converts the arguments and results. However, If one wants
to perform several such (different) consecutive calculations there are
going to be several useless conversions. It is certainly possible to
make the input interface clever enough to get it right for both UTF-32
and ascii strings but requiring the user to do the conversion on
results does not look nice.

The compatibility could also be just general value that specifies the
encoding family.

ie " ".compatibility => :ascii

ASCII="".encode(:utf8).compatibility

raise "Incompatible encoding #{str.encoding}" unless str.compatibility 
== ASCII

But different families could be possible. I am not sure if any other
encoding families of any significance exist, though.

Thanks

Michal
Posted by Tim Bray (Guest)
on 19.06.2006 19:48
(Received via mailing list)
On Jun 19, 2006, at 6:31 AM, Austin Ziegler wrote:

> This entire discussion is centered around a proposal to do exactly
> that. There are many *very good* reasons to avoid doing this. Unicode
> Is Not Always The Answer.
>
> It's *usually* the answer, but there are times when it's just easier
> to work with data in an established code page.

To enlighten the ignorant, could you describe one or two scenarios
where a Unicode-based String class would get in the way?  To use your
words, make things less easy?  I would probably not agree that there
are "*many good*" reasons to avoid this, but probably that's just
because I've been fortunate enough to not encounter the problem
scenarios.  This material would have application in a far larger
domain than just Ruby, obviously.  -Tim
Posted by Austin Ziegler (austin)
on 19.06.2006 20:35
(Received via mailing list)
On 6/19/06, Tim Bray <tbray@textuality.com> wrote:
> are "*many good*" reasons to avoid this, but probably that's just
> because I've been fortunate enough to not encounter the problem
> scenarios.  This material would have application in a far larger
> domain than just Ruby, obviously.  -Tim

I've found that a Unicode-based string class gets in the way when it
forces you to work around it. For most text-processing purposes, it
*isn't* an issue. But when you've got text that you don't *know* the
origin encoding (and you're probably working in a different code page),
a Unicode-based string class usually guesses wrong.

Transparent Unicode conversion only works when it is guaranteed that the
starting code page and the ending code page are identical. It's
*definitely* a legacy data issue, and doesn't affect most people, but it
has affected me in dealing with (in a non-Ruby context) NetWare.
Additionally, the overhead of converting to Unicode if your entire data
set is in ISO-8859-1 is unnecessary; again, this is a specialized case.

More problematic, from the Ruby perspective, is the that a Unicode-based
string class would require that there be a wholly separate byte vector
class; I am not sure that is necessary or wise. The first time I read a
JPG into a String, I was delighted -- the interface presented was so
clean and nice as opposed to having to muck around in languages that
force multiple interfaces because of such a presentation.

Like I said, I'm not anti-Unicode, and I want Ruby's Unicode support to
be the best, bar none. I'm not willing to compromise on API or
flexibility to gain that, though.

-austin
Posted by Yukihiro Matsumoto (Guest)
on 20.06.2006 01:40
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Tue, 20 Jun 2006 02:20:10 +0900, "Michal Suchanek" 
<hramrach@centrum.cz> writes:

|Reading what you said it appears it would be only possible to add
|ascii strings to ascii-compatible sttings. That does not sound very
|useful.

You will have all your strings in the encoding you choose as a
internal encoding in the usual case, so that you will have a few
compatibility problem.  Only if you want to handle multiple encodings
at a time, you need explicit code conversion for mix encoding
operations.

|I guess that for calculation so complex that it would really benefit
|form the fast random access of UTF-32 it is reasonable to create a
|wrapper that converts the arguments and results. However, If one wants
|to perform several such (different) consecutive calculations there are
|going to be several useless conversions.

I am not sure what you mean.  I feel like that my plan does not have
anything against UTF-32 in this regard.  Perhaps, I am missing
something.  What is going to cause useless conversions?

							matz.
Posted by Michal Suchanek (Guest)
on 20.06.2006 14:13
(Received via mailing list)
On 6/20/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> internal encoding in the usual case, so that you will have a few
> compatibility problem.  Only if you want to handle multiple encodings
> at a time, you need explicit code conversion for mix encoding
> operations.

If I read pieces of text from web pages they can be in different
encodings. I do not see any reason why such pieces of text could not
be automatically concatenated as long as they are all subset of
unicode.

It was the complaint of one of the people here that in Python strings
with different encodings exist but the operations on tham fail. And it
makes the life of anybody working with such strings unneccessarily
hard. They have to be converted explicitly.

>
> |I guess that for calculation so complex that it would really benefit
> |form the fast random access of UTF-32 it is reasonable to create a
> |wrapper that converts the arguments and results. However, If one wants
> |to perform several such (different) consecutive calculations there are
> |going to be several useless conversions.
>
> I am not sure what you mean.  I feel like that my plan does not have
> anything against UTF-32 in this regard.  Perhaps, I am missing
> something.  What is going to cause useless conversions?

If automatic conversions aren't implemented at all, utf-32 does not
really stand out in this regard.

Thanks

Michal
Posted by Timothy Bennett (Guest)
on 20.06.2006 15:57
(Received via mailing list)
On 6/20/06, Michal Suchanek <hramrach@centrum.cz> wrote:
>
>
> If I read pieces of text from web pages they can be in different
> encodings. I do not see any reason why such pieces of text could not
> be automatically concatenated as long as they are all subset of
> unicode.


Having different encodings on one web page is a good way to make sure 
that
the page won't display correctly, since all the browsers I know of 
display
all text on a page using just one encoding.  Granted, if the encoding is 
a
subset of unicode, it may still manage to work out, but personally I 
keep
running in to pages that display some of the characters as garbage no 
matter
what encoding I instruct my browser to use.  So, no, I don't think it 
should
be valid to concatenate strings with different encodings.
Posted by Matthew Smillie (notmatt)
on 20.06.2006 16:34
(Received via mailing list)
On Jun 20, 2006, at 14:54, Timothy Bennett wrote:

> sure that
> be valid to concatenate strings with different encodings.
So we shouldn't do it because it doesn't work in web browsers?

Hopefully we don't apply that criteria globally, or we'd never get
anything done.
Posted by Michal Suchanek (Guest)
on 20.06.2006 16:41
(Received via mailing list)
On 6/20/06, Timothy Bennett <timothy.s.bennett@gmail.com> wrote:
> the page won't display correctly, since all the browsers I know of display
> all text on a page using just one encoding.  Granted, if the encoding is a
> subset of unicode, it may still manage to work out, but personally I keep
> running in to pages that display some of the characters as garbage no matter
> what encoding I instruct my browser to use.  So, no, I don't think it should
> be valid to concatenate strings with different encodings.

No, I meant that the strings are, of course, converted to a common
encoding such as utf-8 before they are concatenated.
The point is that you do not have to care in which encoding you
obtained the pieces and convert them manually to a common encoding if
the string class can do it automatically for you.

Thanks

Michal
Posted by unknown (Guest)
on 20.06.2006 17:46
(Received via mailing list)
On Jun 20, 2006, at 8:09 AM, Michal Suchanek wrote:
> If I read pieces of text from web pages they can be in different
> encodings. I do not see any reason why such pieces of text could not
> be automatically concatenated as long as they are all subset of
> unicode.

I'm not sure I understand what 'subset of unicode' means.

Do you mean two different encodings of Unicode code points?
As in 'UTF-8 and UTF-16 are subsets of Unicode'?

That usage seems unusual to me.  Are you using 'subset' and 'encoding'
as synonyms or am I missing subtle difference?



Gary Wright
Posted by Tim Bray (Guest)
on 20.06.2006 18:08
(Received via mailing list)
On Jun 20, 2006, at 6:54 AM, Timothy Bennett wrote:

> Having different encodings on one web page is a good way to make  
> sure that
> the page won't display correctly
...
>   So, no, I don't think it should
> be valid to concatenate strings with different encodings.

Well, unless you had a String class that took care of the encoding
details and, when you were ready to output, allowed you to say "Give
me that in ISO-8859 or UTF-8 or whatever". -Tim
Posted by Yukihiro Matsumoto (Guest)
on 20.06.2006 18:20
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Tue, 20 Jun 2006 23:33:43 +0900, "Michal Suchanek" 
<hramrach@centrum.cz> writes:

|No, I meant that the strings are, of course, converted to a common
|encoding such as utf-8 before they are concatenated.
|The point is that you do not have to care in which encoding you
|obtained the pieces and convert them manually to a common encoding if
|the string class can do it automatically for you.

If you choose to convert all input text data into Unicode (and convert
them back at output), there's no need for unreliable automatic
conversion.

							matz.
Posted by Michal Suchanek (Guest)
on 20.06.2006 19:52
(Received via mailing list)
On 6/20/06, gwtmp01@mac.com <gwtmp01@mac.com> wrote:
> As in 'UTF-8 and UTF-16 are subsets of Unicode'?
>
> That usage seems unusual to me.  Are you using 'subset' and 'encoding'
> as synonyms or am I missing subtle difference?
>
I mean that iso-8859-1 and iso-8859-2 encodings (as well as many
other) encode a subset of characters available in Unicode, and any of
its utf-* encodings. Thus any string that is encoded using such
encoding can be losslessly and automatically converted to an encoding
of full unicode such as utf-8, and operations on several such
converted strings make sense even if the strings were encoded using
different encodings before the conversion.

The automatic conversion would simplify things if you get strings in
different encodings from outside sources such as various web pages,
databases, etc.

Thanks

Michal
Posted by Michal Suchanek (Guest)
on 21.06.2006 13:46
(Received via mailing list)
On 6/20/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
>
> If you choose to convert all input text data into Unicode (and convert
> them back at output), there's no need for unreliable automatic
> conversion.

Well, it's actually you who chose the conversion on input for me.
Since the strings aren't automatically converted I have to ensure that
I have always strings encoded using the same encoding. And the only
reasonable way I can think of is to convert any string that enters my
application (or class) to an arbitrary encoding I choose in advance.

This is no more reliable than automatic conversion. The reliability or
(un)reliability of the conversion is based on the (un)reliability with
which the actual encoding of the string is determined when it is
obtained. If the encoding tag is wrong the string will be converted
incorrectly. It is the only cause for incorrect conversion wether it
happens manually or automatically.

If conversion was done automatically by the string class it could be
performed lazily. The strings are kept in the encoding in which the
were obtained, and only converted when it is needed because they are
combined with a string in a different encoding. And users of the
srings still have the choice to convert them explicitly when they see
fit.

When such automatic conversion is not available it makes interfacing
with libraries that fetch external data more difficult.

a) I could instruct the library that fetches data from a database or
the web to return them always in the encoding I chose for
reperesenting strings in my application, irregardless of the encoding
the data was originally obtained in.
The disadvantage is that if the encoding was determined incorrectly on
input to the library the data is already garbled.

b) I could get the data from the library in the original encoding in
which it was obtained. Either because I would like to check that the
encoding is correct before converting the data or because the library
does not implement the interface for (a).
The disadvantage is that I have to traverse a potentially complex data
structure and convert all strings so that they work with the other
strings inside my application.

c) Every time I perform a string operation I should first check
(manually) that the two strings are compatible (or catch the exception
very near the opration so that I can convert the arguments and retry).
I do not think this is a reasonable option for the common case that
should be made as simple as possible: the strings can be represented
in Unicode. This may be necessary to some extent in applications
dealing with encodings that are incompatible with Unicode but it
should not be required for the common case.

The people with experience from other languages are complaining that
they have to do (b) or (c) because (a) is usually not implemented. And
ensuring either of the three does look like additional problems that
could be solved elsewhere - in the string class.

Thanks

Michal
Posted by Yukihiro Matsumoto (Guest)
on 21.06.2006 16:04
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Wed, 21 Jun 2006 20:45:38 +0900, "Michal Suchanek" 
<hramrach@centrum.cz> writes:

|> If you choose to convert all input text data into Unicode (and convert
|> them back at output), there's no need for unreliable automatic
|> conversion.
|
|Well, it's actually you who chose the conversion on input for me.
|Since the strings aren't automatically converted I have to ensure that
|I have always strings encoded using the same encoding. And the only
|reasonable way I can think of is to convert any string that enters my
|application (or class) to an arbitrary encoding I choose in advance.

Agreed.  It is me.  Perhaps you don't know how terrible code
conversion can be.  In the ideal world, lazy conversion seems
attractive, but reality bites.  Conversions fail so easily.
Characters lost, text broken.  Failures can not be avoided for various
reasons, mostly historical reasons we can't fix anymore.  When error
happens (often) it's good to detect errors as early as possible,
i.e. on input/output.  So I encourage universal character set model as
far as it is applicable.  You may use UTF-8 or ISO8859-1 for universal
character set.  I may use EUC-JP for it.

For only rare case, there might be need to handle multiple encoding in
an application.  I do want to allow it.  But I am not sure how we can
help that kind of applications, since they are fundamentally complex.
And we don't have enough experience to design a framework for such
applications.

							matz.
Posted by Dmitry Severin (Guest)
on 21.06.2006 16:59
(Received via mailing list)
On 6/21/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
>
>
> For only rare case, there might be need to handle multiple encoding in
> an application.  I do want to allow it.  But I am not sure how we can
> help that kind of applications, since they are fundamentally complex.
> And we don't have enough experience to design a framework for such
> applications.
>
>

I can see one more problem with setting encoding per file and tagging
accordingly string literals in it.
If operations on strings with different encodings will always throw an
exception, problems can raise when one calls such third-party library 
from
script with different encoding.

Here's small example:

library code in file some_utility.rb:
# -*- coding: EUC-JP -*-
module SomeUtility
  def SomeUtility.fancy_format(str)
    "<text>" + str + "</text>" # these literals are tagged as EUC-JP, 
right?
  end
end

application code in file my_app.rb:
# -*- coding: UTF-8 -*-
require 'some_utility'
puts SomeUtility.fancy_format("an utf8 string")  # this literal is 
tagged as
UTF8

If the last call will throw some kind of EncodingMismatchError, how to 
deal
with that?
Posted by Yukihiro Matsumoto (Guest)
on 21.06.2006 17:19
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Wed, 21 Jun 2006 23:56:47 +0900, "Dmitry Severin" 
<dmitry.severin@gmail.com> writes:

|I can see one more problem with setting encoding per file and tagging
|accordingly string literals in it.

Indeed.

|Here's small example:
|
|library code in file some_utility.rb:
|# -*- coding: EUC-JP -*-
|module SomeUtility
|  def SomeUtility.fancy_format(str)
|    "<text>" + str + "</text>" # these literals are tagged as EUC-JP, right?
|  end
|end
|
|application code in file my_app.rb:
|# -*- coding: UTF-8 -*-
|require 'some_utility'
|puts SomeUtility.fancy_format("an utf8 string")  # this literal is tagged as
|UTF8
|
|If the last call will throw some kind of EncodingMismatchError, how to deal
|with that?

I recommend using "ascii" encoding, which is default, for library
files, unless you are sure in what encoding your input data are.
For localization, tools like gettext would help dealing with strings
in the native encoding.

							matz.
Posted by Austin Ziegler (austin)
on 21.06.2006 17:36
(Received via mailing list)
On 6/21/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> I recommend using "ascii" encoding, which is default, for library
> files, unless you are sure in what encoding your input data are.
> For localization, tools like gettext would help dealing with strings
> in the native encoding.

Just a thought. Might it be possible to have a new String literal for
what will be, I think, the most common encoding chosen (UTF-8)? That is,
in addition to:

  # -*- coding: EUC-JP -*-
  "<text>" # tagged as EUC-JP

We allow:

  # -*- coding: EUC-JP -*-
  "<text>" # tagged as EUC-JP
  u"<text>" # tagged as UTF-8

Despite my belief that we should avoid an enforced universal encoding as
the String representation, I *do* plan on making most of my applications
and libraries UTF-8 friendly and aware. It's extremely important that we
be able to work with this cleanly, and if I can simply do either u"foo"
or U"foo" I would find it much easier to deal with in those places where
I need UTF-8/Unicode support.

-austin
Posted by Julian 'Julik' Tarkhanov (Guest)
on 21.06.2006 17:42
(Received via mailing list)
On 21-jun-2006, at 17:18, Yukihiro Matsumoto wrote:
> |If the last call will throw some kind of EncodingMismatchError,  
> how to deal
> |with that?
>
> I recommend using "ascii" encoding, which is default, for library
> files, unless you are sure in what encoding your input data are.
> For localization, tools like gettext would help dealing with strings
> in the native encoding.

Matz, this would be a disaster (if in such a situation a library
throws). It's gonna be like python.
Because it means that 99 percent of the libraries will throw.
Posted by Yukihiro Matsumoto (Guest)
on 21.06.2006 18:21
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Thu, 22 Jun 2006 00:41:02 +0900, Julian 'Julik' Tarkhanov 
<listbox@julik.nl> writes:

|Matz, this would be a disaster (if in such a situation a library  
|throws). It's gonna be like python.
|Because it means that 99 percent of the libraries will throw.

Can you elaborate?  I don't want to see disaster whatever it is.

							matz.
Posted by Yukihiro Matsumoto (Guest)
on 21.06.2006 18:47
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Thu, 22 Jun 2006 00:34:27 +0900, "Austin Ziegler" 
<halostatue@gmail.com> writes:

|Just a thought. Might it be possible to have a new String literal for
|what will be, I think, the most common encoding chosen (UTF-8)? That is,
|in addition to:
|
|  # -*- coding: EUC-JP -*-
|  "<text>" # tagged as EUC-JP
|
|We allow:
|
|  # -*- coding: EUC-JP -*-
|  "<text>" # tagged as EUC-JP
|  u"<text>" # tagged as UTF-8

I am not sure this is a good idea or not (yet).  If your "u" text
contains only ASCII characters, I see no need to tag it "UTF-8", and
if it's not, how do we prepare them?  I think, for example,

   u"\346\235\276\346\234\254" => my family name in Kanji

is too ugly.

							matz.
Posted by Dmitry Severin (Guest)
on 21.06.2006 19:18
(Received via mailing list)
On 6/21/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
>
> Can you elaborate?  I don't want to see disaster whatever it is.
>
>                                                         matz.
>
>

Single scripts and small self-contained applications almost always
are written in the same codepage. Usually text data processing also
is done for the same codepage, that simplifies life a lot even with
current String as byte vector. So recoding is an overhead here, and
external data is only recoded on input/output in relativey small number
of well-defined places, using known subset of source and target 
encodings.
In this case when you know what to expect from your file/network IO, 
things
are OK.

It is also OK, when part of script is extracted and evolves to a 
library,
as long as you use it in the same environment.

But let's view a case when several third-party libraries are used, all
returning
strings with different encodings. gettext for libraries won't solve
everything, as even externalized strings will have some particular 
encoding.
E.g. localization libraries can't fit in only ASCII.

And now calls to methods will behave like some kind of IO in respect to
encoding of passed parameters.
Number of i/o points grows drastically.

How can it be solved in consistent and reliable manner?
a) just simply declare in documentation: "Methods in these classes 
*require*

strings to be in UTF16, you've been warned!!!"

  So users of that code will have to remember those constrains and 
enforce
  encoding of their data before calling those methods. With dynamic 
nature
  of Ruby things will break in unexpected places. No, i dislike idea to
write:

     str.enforce_encoding!(BooClass::INTERNAL_ENCODING)
     b = BooClass.new(str)

b) take care in called methods to enforce encoding
     def process_formatting(str)
        str.enforce_encoding!(MY_INTERNAL_ENCODING)
        # now it is compatible with rest of my code
        # and i can do something with it
     end

 This is also too error-prone :(

And what about processing results of calls? To take care about it in 
caller
code?
       res_str = SomeUtil.fancy_format( str )
       res_str.enforce_encoding!(MY_INTERNAL_ENCODING)

On input parameters and returned results which represent complex 
structures
with some
String fields things will go even worse.

Who will ever cope with this issues?
Probably this is what Julik meant  by "disaster"?

Things shouldn't be that complicated.
Posted by Julian 'Julik' Tarkhanov (Guest)
on 21.06.2006 21:20
(Received via mailing list)
On 21-jun-2006, at 18:20, Yukihiro Matsumoto wrote:

> Can you elaborate?  I don't want to see disaster whatever it is.
I imagine that in the case mentioned the encoding assumed for a
library will depend on the pragma in the source.

Fr instance, I am writing a program that needs to work wuth UTF8
data, but one of the libraries I am using has ASCII in the pragma.
What is going to haveppen if I ship this library UTF8 strings? Python
libraries just throw, because they do all kinds of no-unicode aware
operations on strings
or request Unicode strings explicitly. So anytime you want to ship
something to a library (or get something from STDIN) you have to
decode and encode.
As soon as you forget to, you get exceptions everywhere.
Posted by Julian 'Julik' Tarkhanov (Guest)
on 21.06.2006 21:23
(Received via mailing list)
On 21-jun-2006, at 19:17, Dmitry Severin wrote:
> .
>
> Who will ever cope with this issues?
> Probably this is what Julik meant  by "disaster"?
>
> Things shouldn't be that complicated.

What I meant is the desritption how you get a Python program wielded
from different libraries to be Unicode-aware.
If Ruby works like that I won't be happy. Basically, some libraries
accept Unicode in Python's 16bit form, some accept utf-8 bytestrings
and some can only grok ASCII and will
throw up anyway. These are not going to work on Python 3000 as I
understand.
Posted by Michal Suchanek (Guest)
on 22.06.2006 01:47
(Received via mailing list)
On 6/21/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> |Since the strings aren't automatically converted I have to ensure that
> i.e. on input/output.  So I encourage universal character set model as
> far as it is applicable.  You may use UTF-8 or ISO8859-1 for universal
> character set.  I may use EUC-JP for it.

I do not see how converting the strings on input will make the
situation better than converting them later. The exact place where the
text is garbled because it is converted incorrectly does not change
the fact it is no longer usable, does it?
well, it may be possible to detect characters that are invalid for
certain encoding either by scanning the string or by attempting a
conversion. But I would rather like optional checks that can be added
when something breaks or is likely to break rather than forced
conversion.

Or to put it another way: If I get a string from somewhere where the
encoding is marked incorrectly it is wrong and it should be expected
to fail. And I can do some checks if I think my source of data is not
reliable in this respect. But if I get string that is marked correctly
and it fails because I did not manually convert it it is frustrating.
And needlessly so.

>
> For only rare case, there might be need to handle multiple encoding in
> an application.  I do want to allow it.  But I am not sure how we can
> help that kind of applications, since they are fundamentally complex.
> And we don't have enough experience to design a framework for such
> applications.

I do no think it is that rare. Most people want new web (or any other)
stuff in utf-8 but there is need to interface legacy databases or
applications. Sometimes converting the data to fit the new application
is not practical. For one, the legacy application may be still used as
well.

Anyway, Ruby being as dynamic as it is I should be able to add support
for automatic recoding myself quite easily. The problem is I would not
be able to use it in libraries (should I ever write some) without
risking a clash with similar feature added by somebody else.

Thanks

Michal
Posted by Yukihiro Matsumoto (Guest)
on 22.06.2006 04:36
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Thu, 22 Jun 2006 02:17:53 +0900, "Dmitry Severin" 
<dmitry.severin@gmail.com> writes:

|Things shouldn't be that complicated.

Agreed in principle.  But it seems to be fundamental complexity of the
world of multiple encoding.  I don't think automatic conversion would
improve the situation.  It would cause conversion error almost
randomly.  Do you have any idea to simplify things?

I am eager to hear.

							matz.
Posted by Yukihiro Matsumoto (Guest)
on 22.06.2006 04:46
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Thu, 22 Jun 2006 08:46:08 +0900, "Michal Suchanek" 
<hramrach@centrum.cz> writes:

|I do not see how converting the strings on input will make the
|situation better than converting them later. The exact place where the
|text is garbled because it is converted incorrectly does not change
|the fact it is no longer usable, does it?

It does.  But if you convert encoding lazily, you will have hard time
to track down the source of the error causing data.  It may be input
data from IO, or from some GUI toolkit, or the result of operation
with variety of sources.

|> For only rare case, there might be need to handle multiple encoding in
|> an application.  I do want to allow it.  But I am not sure how we can
|> help that kind of applications, since they are fundamentally complex.
|> And we don't have enough experience to design a framework for such
|> applications.
|
|I do no think it is that rare. Most people want new web (or any other)
|stuff in utf-8 but there is need to interface legacy databases or
|applications. Sometimes converting the data to fit the new application
|is not practical. For one, the legacy application may be still used as
|well.

I understand the challenge, but I don't think it is common to run some
part of your program in legacy encoding (without conversion), and
other part in UTF-8.  You need to convert them into universal encoding
anyway for most of the cases.  That's why I said it rare.

							matz.
Posted by Lugovoi Nikolai (Guest)
on 22.06.2006 08:58
(Received via mailing list)
2006/6/22, Yukihiro Matsumoto <matz@ruby-lang.org>:
> randomly.  Do you have any idea to simplify things?
>
> I am eager to hear.
>



So what will be semantic for encoding tag:
 a) weak suggestion?
 b) strong assertion?

If encoding tag is only weak suggestion (and for now I see it will be
just that), it will imply:
  - performance win (no need to check conformance to told encoding)
  - win in having less complexity (most tasks use source code, text
data input and output all in the same [default host] encoding)
  - portability drawbacks (assumtions made by original coders will be
implicit, but they have to be figured out, when porting to another
environement)
  - reliability drawbacks (weak suggestions are too often ignored, and
you don't know when, where and why they will hit your app, but someday
they will!)

If encoding tag is strong assertion, it will imply:
  - probable performance loss:
     * to assure this string with encoding = "none" (raw) represents
valid encoding sequence of bytes,
       at the same price as String#length
     * need to recode bytes, when changing tag
  - slightly more complexity (developers will have to declare these
assertions explicitly)
  - portability win
  - reliability win

What compromise on this issues would be acceptable?

I'd prefer encoding tag as strong assertion, mostly for reliability 
reasons.

And for operations on Strings with different encodings, I'd like
implicit automatic encoding coercion:
-------------------------------
#
# NOTES:
#  a) String#recode!(new_encoding) replaces current internal byte
representation with new byte sequence,
#     that is recoded current. must raise IncompatibleCharError, if
can't convert char to destination encoding
#  b) downgrading string from some stated encoding to "none"  tag must
be done only explicitly.
#     it is not an option for implicit conversion
#  c) $APPLICATION_UNIVERSAL_ENCODING is a global var, allowed to be
set once and only once per application run.
#     Intent: we want all strings which aren't raw bytes to be in one
single predefined encoding,
#     so all operations on string must return string in conformant 
encoding.
#     Desired encoding is value of $APPLICATION_UNIVERSAL_ENCODING.
#     If $APPLICATION_UNIVERSAL_ENCODING is nil, we go in "democracy
mode", see below.
#
def coerce_encodings(str1, str2)
   enc1 = str1.encoding
   enc2 = str2.encoding

   # simple case, same encodings, will return fast in most cases
   return if enc1 == enc2

   # another simple but rare case, totally incompatible encodings, as
they represent incompatible charsets
   if fully_incompatible_charsets?(enc1, enc2)
   	raise(IncompatibleCharError, "incompatible charsets %s and %s", 
enc1, enc2)
   end

   # uncertainity, handling "none" and preset encoding
   if enc1 == "none" || enc2 == "none"
   	raise(UnknownIntentEncodingError, "can't implicitly coerce
encodings %s and %s, use explicit conversion", enc1, enc2)
   end

   # Tirany mode:
   # we want all strings which aren't raw bytes to be in one single
predefined encoding
   if $APPLICATION_UNIVERSAL_ENCODING
   	str1.recode!($APPLICATION_UNIVERSAL_ENCODING)
	str2.recode!($APPLICATION_UNIVERSAL_ENCODING)
   	return
   end

   # Democracy mode:
   # first try to perform non-loss conversion from one encoding to 
another:
   # 1) direct conversion, without loss, to another encoding, e.g. UTF8 
+ UTF16
   if exists_direct_non_loss_conversion?(enc1, enc2)
   	if exists_direct_non_loss_conversion?(enc2, enc1)
   	# performance hint if both available
	   if str1.byte_length < str2.byte_length
	   	str1.recode!(enc2)
	   else
	   	str2.recode!(enc1)
	   end
	else
		str1.recode!(enc2)
	end
	return
   end
   if exists_direct_non_loss_conversion?(enc2, enc1)
   	str2.recode!(enc1)
	return
   end

   # 2) non-loss conversion to superset
   # (I see no reason to raise exception on KOI8R + CP1251, returning
string in Unicode will be OK)
   if superset_encoding = find_superset_non_loss_conversion?(enc1, enc2)
   	str1.recode!(superset_encoding)
	str2.recode!(superset_encoding)
	return
   end

   # A case for incomplete compatibility:
   # Check if subset of enc1 is also subset of enc2,
   # so some strings in enc1 can be safely recoded to enc2,
   # e.g. two pure ASCII strings, whatever ASCII-compatible encoding 
they have
   if exists_partial_loss_conversion?(enc1, enc2)
	if exists_partial_loss_conversion?(enc2, enc1)
   	   # performance hint if both available
	   if str1.byte_length < str2.byte_length
	   	str1.recode!(enc2)
	   else
	   	str2.recode!(enc1)
	   end
	else
		str1.recode!(enc2)
	end
	return
   end

   # the last thing we can try
   str2.recode!(enc1)
end
---------------------------

So, when operation involves two Strings or String and Regexp, with
different encodings, automatic coercion should be done, as described
above.

That will, probably, solve coding problems (no need to think about
encodings most time), but can have following impacts:
1) after several operations, when one sends string to external IO, it
might be internally encoded in superset of that IO encoding. One has
to remember that and perform external IO accordingly, i.e. to resolve
- to fail on invalid chars or use replacement chars (like U+FFFD),-
but that is unavoidable.
2) some performance hits, which I expect to be rare.

Besides, there can be another class of problems with automatic
coercion: how to ensure consistent work of character ranges in Regexps
and String methods like [count, delete, squeeze, tr, succ, next, upto]
when encodings are coerced?

What I, as Ruby user, wish for Unicode/M17N support:
1) reliability and consistency:
  a) String should be abstraction for character sequence,
  b) String methods shouldn't allow me to garble internal 
representation;
  c) treating String as byte sequence is handy, but must be explict 
stated.
2) coding comfort:
  a) no need to care what encodings have strings while working with 
them;
  b) no need to care what encodings have strings returned from 
third-party code;
  c) using explicit stated conversion options for external IO.
3) on Unicode and i18n : at least to have a set of classes for
Unicode-specific tasks (collation, normalization, string search,
locale-aware formatting etc.) that would efficiently work with Ruby
strings.

And, for all out there, just ask "Which charset/encoding will fit all
the [present and future] needs?". You know the exact answer: "NONE".

> I understand the challenge, but I don't think it is common to run some
> part of your program in legacy encoding (without conversion), and
> other part in UTF-8.  You need to convert them into universal encoding
> anyway for most of the cases.  That's why I said it rare.

uhm, how to convert compiled extension library?
Posted by Yukihiro Matsumoto (Guest)
on 22.06.2006 10:18
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Thu, 22 Jun 2006 15:55:18 +0900, "Lugovoi Nikolai" 
<meadow.nnick@gmail.com> writes:
|> I am eager to hear.
|
|So what will be semantic for encoding tag:
| a) weak suggestion?
| b) strong assertion?

Weak suggestion, if I understand you correctly.

|I'd prefer encoding tag as strong assertion, mostly for reliability reasons.

Hmm, your idea of combination of strong assertion and automatic
conversion seems too complex for me, but it may be worth considering.
Thank you for idea.

|uhm, how to convert compiled extension library?

Every extension that does input/output need to specify (either
explicitly or implicitly) encoding it uses anyway.  I will add
an encoding option to rb_tainted_str_new() and its family.  If it's
possible, I'd like to allow extensions to declare their default
encoding in their initialize function (Init_xxx).

							matz.
Posted by Michal Suchanek (Guest)
on 22.06.2006 12:42
(Received via mailing list)
On 6/22/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> Weak suggestion, if I understand you correctly.
>
> |I'd prefer encoding tag as strong assertion, mostly for reliability reasons.
>
> Hmm, your idea of combination of strong assertion and automatic
> conversion seems too complex for me, but it may be worth considering.
> Thank you for idea.

What I had in mind was much simpler. If the strings do not match just
try to recode to the default encoding which would be unicode most of
the time. Or just try to find a superset.

>
> |uhm, how to convert compiled extension library?
>
> Every extension that does input/output need to specify (either
> explicitly or implicitly) encoding it uses anyway.  I will add
> an encoding option to rb_tainted_str_new() and its family.  If it's
> possible, I'd like to allow extensions to declare their default
> encoding in their initialize function (Init_xxx).
>

But if recoding is not automatic you still have to recode the strings
manually. Both the input to the extension and the results. That is an
annoyance an repetitive code everywhere.

Thanks

Michal
Posted by Juergen Strobel (Guest)
on 22.06.2006 17:32
(Received via mailing list)
On Wed, Jun 21, 2006 at 01:04:55AM +0900, Tim Bray wrote:
> details and, when you were ready to output, allowed you to say "Give  
> me that in ISO-8859 or UTF-8 or whatever". -Tim

That's what I suggested basically. The problem seems to be non-Unicode
demands mainly, and performance issues on the other hand. And it makes
Strings useless as byte buffers, since you have to specifiy the
encoding of the external representation you create the String from at
creation time. To recap:

Private extensions to Unicode are deemed too complex to implement
(Matz).

Transforming legacy or special (non Unicode) data to a ruby-private
internal storage format on I/O is too performance/space intensive
(Matz).

Strings as byte buffers are important to some people, and they don't
want to use another class or array for it, even if RegExp et al would
be extended to handle these too.

While it would be proper OO design, encapsulating the internal String
implementation hampers direct access to the "raw" data for C-hackers,
creating unwanted hurdles, and again performance issues.


I am still not convinced the arguments against this approach really
will hold in the long run, but since I am not the one implementing it
and can't really participate there due to language barriers, I can
only lean back and wait for the first realease of M17N. Learning
English was hard enough for me.

-Jürgen
Posted by Alexey Borzenkov (snaury)
on 25.06.2006 16:41
Yukihiro Matsumoto wrote:
> Alright, then what specific features are you (both) missing?  I don't
> think it is a method to get number of characters in a string.  It
> can't be THAT crucial.  I do want to cover "your missing features" in
> the future M17N support in Ruby.

Sorry for maybe getting into, but here are my 5 cents. When I first 
found out about ruby, I practically almost fell in love with the 
language. Unfortunately, after some studying and experimenting I 
suddenly found that it lacks proper unicode support on win32, in 
particular with file IO and ole automation, i.e. in two cases where I 
had to interoperate with the rest of the world. Win32 really differs 
from Linux and maybe other Unixes in API because in *nix you don't have 
to worry about unicode/whatever, because all of the system depends on 
your current locale. In win32 there are two sets of API, ansi and 
unicode, maybe that was a bad microsoft's decision, but that's a 
reality. Now I am a Russian, and when I write scripts I have to worry 
that not only Russian characters don't get messed up, but characters of 
other languages as well. So that if I receive, say, excel file with a 
lot of languages in that, and I have to process that file somehow I have 
to be sure that no letters will be lost, nor messed up, thus converting 
it to current codepage (1251) is no option for me. The same is with 
filenames, the fact that I'm running russian winxp doesn't mean that I 
have only filenames that fall in 1251 codepage, I also have filenames 
with european characters (umlauts and such), as well as japanese, and 
when I want to write some script that processes these files, I have to 
be able to work with them. At that time this caused me to move to Tcl 
(it has utf-8 encoding everywhere, and it converts to required encoding 
when interoperating with the world). Since then I'm still waiting for 
proper unicode support in ruby (read: proper interoperability with 
operating system and its components using unicode API versions: the ones 
ending with W) and maybe a way to define in which locale (specific code 
page, utf-8, etc) the current script is running.

Hope that clarifies what is currently missing for me (and maybe others, 
I don't know).
Posted by Yukihiro Matsumoto (Guest)
on 25.06.2006 17:11
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Sun, 25 Jun 2006 23:41:48 +0900, Snaury Miyoto <snaury@gmail.com> 
writes:

|Hope that clarifies what is currently missing for me (and maybe others, 
|I don't know).

Unfortunately, not.  I understand Russian people having problem with
multiple encoding, but I don't know how can we help you.

You said Tcl has Unicode support that works well with you.  So that I
think treating all of them in UTF-8 is OK for you.  Then how can it
determine which should be in the current code page, or in Unicode?
Or using Win32 API ending with W could allow you living in the
Unicode?

							matz.
Posted by Izidor Jerebic (Guest)
on 25.06.2006 19:19
(Received via mailing list)
On 22.6.2006, at 10:17, Yukihiro Matsumoto wrote:

>
> |I'd prefer encoding tag as strong assertion, mostly for  
> reliability reasons.
>
> Hmm, your idea of combination of strong assertion and automatic
> conversion seems too complex for me, but it may be worth considering.

Strong assertion + auto conversion is the only solution which will
relieve programmers from manually checking/changing string encodings
in their programs.

Remember, string input/output points in a program are not only system
IO classes, but also all the third party libraries/classes which deal
with strings. So most of the existing Ruby and other external (e.g.
Java) libraries, which can be used from Ruby.

The assumption that only system IO is the entry/exit point for string
encoding is very wrong. This assumption holds only for scripts which
use no third party libraries.

So we have two possibilities:
a) every programmer is forced to implement the above solution in
every program (this is starting to happen already, and current
experience tells us that the future in this direction is disaster!)
b) Ruby interpreter implements this solution, and programmers happily
ignore all the complexity.

So, it is true that we move the complexity into Ruby, but this is
(IMHO) much less complicated and much more needed than e.g.
infinitely big integers which we already have.

If Ruby wants to move forward, it needs transparent String support
and hopefully separation of String and ByteArray, since this un-
separation brought us code which is mostly wrong (currently most of
existing Ruby code breaks if string encoding is honoured, as can be
seen from experience of brave people who modified String class).

Ruby is my favourite language, and if it would have String support as
suggested, software development would be just pure joy...

Please listen to the people which tell of disastrous experience in
other languages. And for good experience, I develop in Cocoa in Mac
OS X for many many years, and it has great String class (ok, the
suggested Ruby class would be even better, but still). Plus it has
separated String and Byte array. The results are superb. There is no
problems, and nobody ever worries about strings and encodings. Ever.
You can check the mailing lists.


izidor
Posted by Julian 'Julik' Tarkhanov (Guest)
on 25.06.2006 19:28
(Received via mailing list)
On 25-jun-2006, at 19:18, Izidor Jerebic wrote:
>
> Please listen to the people which tell of disastrous experience in  
> other languages. And for good experience, I develop in Cocoa in Mac  
> OS X for many many years, and it has great String class (ok, the  
> suggested Ruby class would be even better, but still). Plus it has  
> separated String and Byte array. The results are superb. There is  
> no problems, and nobody ever worries about strings and encodings.  
> Ever. You can check the mailing lists.

The greatest about Cocoa is that I'm able to suspect that 99 percent
of the programs I use do The Right Thing when I want to input russian
text in there, and NOT because the programmer did something special to
make it work. Because if he had to, he wouldn't. In contrast, 70
percent of Carbon applications are not even capable of displaying the
text properly (let alone letting me type it in).
Posted by Austin Ziegler (austin)
on 25.06.2006 21:08
(Received via mailing list)
On 6/25/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> Or using Win32 API ending with W could allow you living in the
> Unicode?

Matz,

I've mentioned it before, but I will be happy to make the Windows APIs
work with Unicode once the m17n Strings exist. Yes, I will be making
them use either UTF-8 (conversion required, most likely to be compatible
with existing code) or UTF-16 (no conversion required). It will work
well: I have done a similar implementation for code that I have written
at work.

-austin
Posted by Austin Ziegler (austin)
on 25.06.2006 21:15
(Received via mailing list)
On 6/25/06, Izidor Jerebic <ij.rubylist@gmail.com> wrote:
> If Ruby wants to move forward, it needs transparent String support and
> hopefully separation of String and ByteArray, since this un-
> separation brought us code which is mostly wrong (currently most of
> existing Ruby code breaks if string encoding is honoured, as can be
> seen from experience of brave people who modified String class).

This is an incorrect and unsupportable statement. It is completely
unnecessary to separate unencoded (e.g., binary) String support into
String and ByteArray.

Please don't try to assume that the problem is this completley
unnecessary division. The problem is that existing strings are
completely unencoded and have no way of being flagged with an encoding
that is supported in any way across all of Ruby.

People are making really *stupid* assumptions based on what choices
other development teams have made, and it's irritating.

Ruby does not need a String with an internal representation in Unicode;
Ruby does not need a separate byte vector. An unencoded string can be
treated as a byte vector with no problems; if it is determined to have
textual meaning, it can be tagged with an encoding very simply and from
that point be treated as a meaningful string. There are times when the
encoding is *not* best treated in Unicode, especially if there are
potential conversion errors.

-austin
Posted by Charles O Nutter (Guest)
on 25.06.2006 22:40
(Received via mailing list)
On 6/25/06, Austin Ziegler <halostatue@gmail.com> wrote:
>
> Ruby does not need a String with an internal representation in Unicode;
> Ruby does not need a separate byte vector. An unencoded string can be
> treated as a byte vector with no problems; if it is determined to have
> textual meaning, it can be tagged with an encoding very simply and from
> that point be treated as a meaningful string. There are times when the
> encoding is *not* best treated in Unicode, especially if there are
> potential conversion errors.
>

When is a ByteArray not a ByteArray? When is a String not a String? Is 
it
correct to mingle the two concepts perpetually, when they each have 
fairly
specific definitions? My problem with continuing to treat String as a 
byte
vector is that it forces two somewhat incompatible concepts on the same
class and the same methods. If you can use a String as both a byte 
vector
and as a sequence of characters by calling the same methods, then 
setting or
clearing encoding suddenly has the side-effect of changing how elements 
of
String are to be treated. If you are providing separate methods for 
working
with bytes as opposed to working with characters, then you are already
splitting the two concepts.

(As an aside, does it make sense that I read from a binary file into a
String? Can I reliably assume that binary content in a String should be
logically manipulable as text strings are? Should my binary String work
anywhere and everywhere a text-based String does? I would think that 
binary
content neither walks nor quacks like a String.)

By your definition, a String can be treated as a ByteArray so long as 
its
internal string does not have an encoding. What do I use if I want to 
have
an encoding and still use byte vector semantics?

It is appropriate that a String is no longer usable as a ByteArray as a
result of changing some state? If there exists any state where String 
cannot
be logically treated as a byte array, then String != ByteArray in the
general case either. The encoding of a String's internal representation
should not dictate the outward behavior of the String.

If, however, you completely separate the two concepts, there's no 
dichotomy.
In that case, a String deals with characters, and you do not have 
guarantees
about byte-boundaries or indexed elements. You only have guarantees 
about
characters, as it should be. Simultaneously, ByteArray would allow you 
to
always work with a vector (array) of bytes, regardless of what those 
bytes
contain.

I'll end it off saying this: I think it's a no-brainer that for dealing 
with
streams of bytes, there should be a non-string byte vector class. If 
folks
are insistent on keeping them the same class, you can't logically 
continue
to call it a String and have it fulfill the dual purposes of byte vector 
and
character vector at the same time. If you plan to provide methods for
supporting both behaviors, you're putting two distinct behaviors into 
the
same type.

I understand the unwillingness to move away from String as a byte 
vector,
but with multibyte support coming you really can't have String == 
ByteArray
without causing problems somewhere. They simply don't have the same
behavior, and trying to pretend they do is asking for trouble.
Posted by Izidor Jerebic (Guest)
on 25.06.2006 22:46
(Received via mailing list)
On 25.6.2006, at 21:12, Austin Ziegler wrote:

> String and ByteArray.
Well, if it is a byte array, it is not a String (an array of
characters), is it?

If Ruby would have RegEx operations on byte arrays, there would be no
need for untyped quasi String. API that has two incompatible things
as one class is just plain ugly and wrong.

Reading jpeg image in a String is totally wrong.  You need bytes. You
get characters, but they aren't really characters, they are bytes.
Until something happens (maybe) and they are characters (maybe), or
they are not (maybe). img_var[5] is what? 6th byte? 9th 2 bytes if
encoding is utf8? What exactly? This is a clear API? There is no need
for bytes masquerading as Strings. None. This practice just confuses
the writer and the reader of the code. You need either bytes or
Strings. Never both in the same variable. They are semantically
totally different. At least they should be (we would not have
problems if people would honour this distinction).

>
> Please don't try to assume that the problem is this completley
> unnecessary division. The problem is that existing strings are
> completely unencoded and have no way of being flagged with an encoding
> that is supported in any way across all of Ruby.

The problem is exactly this: the separation between bytes and
characters. This is the general problem we have and discuss right
now. API should help us solve the problem.

And you apparently missed all the attempts to extend String (also
with encodings a la 1.9) that failed because of existing software,
not because of Ruby.
>
> Ruby does not need a String with an internal representation in  
> Unicode;

Nobody says at this point of conversation that we need internal
representation in unicode for all strings. We just want to avoid
thinking about ANY encoding. We have other things to do. So having a
transparent conversions between compatible encodings is a must.

> Ruby does not need a separate byte vector. An unencoded string can be
> treated as a byte vector with no problems.
> ; if it is determined to have
> textual meaning, it can be tagged with an encoding very simply

It can be, but it is not and will not be. Do you read emails? The
problem is that people do not do things like that. And then other
people have problems. If all the code you run is yours, then you are
right. For many people that is not true.

> . There are times when the
> encoding is *not* best treated in Unicode, especially if there are
> potential conversion errors.

Why do you keep on about this?

Once again - WE DO NOT CARE WHAT ENCODING IS THERE. We just want the
string operations to work without any extra programming work when
operands have compatible encodings.

As written very well by Lugovoi Nikolai:

>  b) no need to care what encodings have strings returned from third- 
> party code;
>  c) using explicit stated conversion options for external IO.
> 3) on Unicode and i18n : at least to have a set of classes for
> Unicode-specific tasks (collation, normalization, string search,
> locale-aware formatting etc.) that would efficiently work with Ruby
> strings.

Me too, please.


izidor
Posted by Michal Suchanek (Guest)
on 26.06.2006 00:16
(Received via mailing list)
On 6/25/06, Izidor Jerebic <ij.rubylist@gmail.com> wrote:
> >
> > This is an incorrect and unsupportable statement. It is completely
> > unnecessary to separate unencoded (e.g., binary) String support into
> > String and ByteArray.
>
> Well, if it is a byte array, it is not a String (an array of
> characters), is it?
>
> If Ruby would have RegEx operations on byte arrays, there would be no
> need for untyped quasi String. API that has two incompatible things
> as one class is just plain ugly and wrong.

Here you contradict yourself. Regexes are string (character)
operations, and you want them on byte arrays. So the concepts aren't
really separate. Similarily, when you read part of a file, and use it
to determine what kind of file it was you do not want to convert that
part into another class or re-read it because somebody decided String
and ByteVector are separate.

Plus this has been already mentioned here.

Michal
Posted by Phillip Hutchings (Guest)
on 26.06.2006 00:25
(Received via mailing list)
> Here you contradict yourself. Regexes are string (character)
> operations, and you want them on byte arrays. So the concepts aren't
> Similarily, when you read part of a file, and use it
> to determine what kind of file it was you do not want to convert that
> part into another class or re-read it because somebody decided String
> and ByteVector are separate.

Why not? When I read CGI params I get them as strings, but if I want
to add them together I need to convert them to integers, because
someone decided that "1" != 1. This is a good thing, so you don't get
"5 purple elephants"+"3 monkeys" = 7, like you do in PHP. Likewise,
when you read from a file/socket/whatever you might not be getting a
real string, you might be getting a byte array. They are fundamentally
different things, a byte array may happen to contain text at some
point, but some time later it may be just a stream of data. Conversely
a String _always_ contains human-readble text in whatever encoding you
want.

As someone who has to work with Unicode in PHP, I'd say it's important
to separate the types. If you want to display something to a user you
have to know what it is, but when you're reading a file you don't
care, unless you know what's in it.

A Unicode String could be a subclass of the byte array with some
niceties for dealing with multibyte characters. Just a thought.
Posted by Tim Bray (Guest)
on 26.06.2006 00:37
(Received via mailing list)
On Jun 25, 2006, at 1:45 PM, Izidor Jerebic wrote:

> Well, if it is a byte array, it is not a String (an array of  
> characters), is it?

+1 to this and to Nutter previously.  Text strings and byte arrays
are different kinds of things and both are useful and I don't see any
benefit from trying to pretend they're the same thing.  But some
apparently-smart people seem to think there is a benefit; perhaps
they could explain it in simple terms for those of us insufficiently-
clued to see it? -Tim
Posted by Yukihiro Matsumoto (Guest)
on 26.06.2006 00:59
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Mon, 26 Jun 2006 05:38:46 +0900, "Charles O Nutter" 
<headius@headius.com> writes:

|When is a ByteArray not a ByteArray? When is a String not a String? Is it
|correct to mingle the two concepts perpetually, when they each have fairly
|specific definitions? My problem with continuing to treat String as a byte
|vector is that it forces two somewhat incompatible concepts on the same
|class and the same methods.

A string is a sequence of data that can be represented by small
integers.  Some may want to treat them as CharacterStrings, other may
want to treat them ByteStrings.  They are not different as you say.
On many platforms, a file can contain text data or binary data.  Is a
chunk of data read from a open file a text, or binary?  If you
separate ByteArray and (Character) String, you will need to have two
separate IO classes, BinaryIO and TextIO, etc.  Or you will need
explicit conversion from read ByteArray to CharacterString.  That
makes Ruby programs look a lot like Java programs, which I don't want
them to be.

One of the good property of Ruby class library is a small number of
classes.  A class might have multiple roles.  For example, a Ruby
Array can be treated as Stacks, Queues, etc.  And it is a good thing,
rather than having separate classes for each role.  Why can't Strings
be both sequence of text and binary data?

							matz.