Ruby Forum Ruby > Unicode roadmap?

Posted by Roman Hausner (rhaus)
on 13.06.2006 23:12
In my opinion, Ruby is practically useless for many applications without 
proper Unicode support. How a modern language can ignore this issue is 
really beyond me.

Is there a plan to get Unicode support into the language anytime soon?
Posted by Yukihiro Matsumoto (Guest)
on 14.06.2006 00:28
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Wed, 14 Jun 2006 06:13:03 +0900, Roman Hausner 
<roman.hausner@gmail.com> writes:
|In my opinion, Ruby is practically useless for many applications without 
|proper Unicode support. How a modern language can ignore this issue is 
|really beyond me.

Define "proper Unicode support" first.

|Is there a plan to get Unicode support into the language anytime soon?

I'm planning enhancing Unicode support in 1.9 in a year or so
(finally).  But I'm not sure that conforms your definition of "proper
Unicode support".  Note that 1.8 handles Unicode (UTF-8) if your
string operations are based on Regexp.

							matz.
Posted by Pete (Guest)
on 14.06.2006 00:38
(Received via mailing list)
> Define "proper Unicode support" first.

having an unicode-equivalent for all methods of class String

like size, slice, upcase

E.g. I tried the unicode plugin... but, alas, who want's to write stuff
like 'normalize_KC' etc. if you just want the frickin' substring of a 
string?!

you need to read books on unicode just to properly use the plugin...

aargg :-((

Best regards
Peter




Yukihiro Matsumoto schrieb:
Posted by Logan Capaldo (Guest)
on 14.06.2006 00:51
(Received via mailing list)
On Jun 13, 2006, at 6:34 PM, Pete wrote:

>> Define "proper Unicode support" first.
>
> having an unicode-equivalent for all methods of class String
>
> like size, slice, upcase
>
> E.g. I tried the unicode plugin... but, alas, who want's to write  
> stuff like 'normalize_KC' etc. if you just want the frickin'  
> substring of a string?!
>

def substring(str, start, len)
   md = str.match(/\A.{#{start}}(.{#{len}})/)
   md[1]
end


def strlength(str)
   n = 0
   str.gsub(/./m) { n += 1; $& }
   n
end


See! Regexps do everything!

Just you know, set $KCODE and use these methods and you are set!

(I am kidding... btw)
Posted by Pete (Guest)
on 14.06.2006 01:00
(Received via mailing list)
From the theoretical point of view this is quite interesting. Also I
understand the humor :-)

Performance and memory consumption should be breathtaking using regexp
just everywhere...

Also there are a ____few____ methods left :-)

As I am German the 'missing' unicode support is one of the greatest
obstacles for me (and probably all other Germans doing their stuff
seriously)...


Logan Capaldo schrieb:
Posted by Victor Shepelev (Guest)
on 14.06.2006 01:13
(Received via mailing list)
From: Pete [mailto:pertl@gmx.org]
Sent: Wednesday, June 14, 2006 1:58 AM
> As I am German the 'missing' unicode support is one of the greatest
> obstacles for me (and probably all other Germans doing their stuff
> seriously)...

The same is for Russians/Ukrainians. In our programming communities 
question
"does the programming language supports Unicode as 'native'?" has very 
high
priority.

/BTW, here is one of the things where Python beats Ruby completely

V.
Posted by James Moore (Guest)
on 14.06.2006 01:59
(Received via mailing list)
I suspect the Japanese posters on this list can answer better than I 
can,
but my impression is that Unicode is, shall we say, not highly thought 
of
outside Europe and North America.  The way they dealt with "Chinese"
characters was apparently more than a bit of a hack, and just doesn't 
work
very well in the real world.  Reading some of the explanations for 
glyphs
versus characters in Unicode just makes you shake your head.  What were 
they
thinking?  Sure doesn't pass the smell test, although I'll be the first 
to
admit I haven't exactly thought deeply about the subject.

There's another problem with Japanese - I've got a friend who's been 
dealing
with some issues around the fact that Japanese apparently innovates new
characters on a regular basis, and everyone is expected to use the new
characters.  (I believe this is called gaiji).  The concept of a fixed
character set apparently just isn't a good idea to start with.

[Awaiting corrections from people who actually know something about this
topic :-)...]

 - James Moore
Posted by David Balmain (Guest)
on 14.06.2006 02:14
(Received via mailing list)
On 6/14/06, James Moore <banshee@banshee.com> wrote:
> with some issues around the fact that Japanese apparently innovates new
> characters on a regular basis, and everyone is expected to use the new
> characters.  (I believe this is called gaiji).  The concept of a fixed
> character set apparently just isn't a good idea to start with.
>
> [Awaiting corrections from people who actually know something about this
> topic :-)...]

There is a good summary of the han unification controversy on wikipedia;

    http://en.wikipedia.org/wiki/Han_unification
Posted by Mat Schaffer (Guest)
on 14.06.2006 03:16
(Received via mailing list)
On Jun 13, 2006, at 7:56 PM, James Moore wrote:
> topic :-)...]
I have one Japanese person here who's never heard of this gaiji
concept.  But it could be new and behind a generation gap of some
kind.  They do sure like to add symbols where they can, though.
Especially graphical star characters.  I see that a lot.
-Mat
Posted by Yukihiro Matsumoto (Guest)
on 14.06.2006 04:38
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Wed, 14 Jun 2006 08:11:49 +0900, "Victor Shepelev" 
<vshepelev@imho.com.ua> writes:

|From: Pete [mailto:pertl@gmx.org]
|Sent: Wednesday, June 14, 2006 1:58 AM
|> As I am German the 'missing' unicode support is one of the greatest
|> obstacles for me (and probably all other Germans doing their stuff
|> seriously)...
|
|The same is for Russians/Ukrainians. In our programming communities question
|"does the programming language supports Unicode as 'native'?" has very high
|priority.

Alright, then what specific features are you (both) missing?  I don't
think it is a method to get number of characters in a string.  It
can't be THAT crucial.  I do want to cover "your missing features" in
the future M17N support in Ruby.

							matz.
Posted by Victor Shepelev (Guest)
on 14.06.2006 07:29
(Received via mailing list)
From: Yukihiro Matsumoto [mailto:matz@ruby-lang.org]
Sent: Wednesday, June 14, 2006 5:37 AM
> |The same is for Russians/Ukrainians. In our programming communities
> 							matz.
I suppose, all we (non-English-writers) need is to have all 
string-related
methods working. Just for now, I think about plain testing each string
method; also, some other classes can be affected by Unicode (possibly
regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes 
are
not: File.open with Russian letters in path don't finds the file.

More generally, it can make sense to have Unicode as the "base" mode; 
where
non-Unicode to stay "old, compatibility" mode.

Something like this.

V.
Posted by Pål Bergström (palb)
on 14.06.2006 07:54
Roman Hausner wrote:
> In my opinion, Ruby is practically useless for many applications without 
> proper Unicode support. How a modern language can ignore this issue is 
> really beyond me.
> 
> Is there a plan to get Unicode support into the language anytime soon?

I also think that this is very important.
Posted by Yukihiro Matsumoto (Guest)
on 14.06.2006 08:37
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Wed, 14 Jun 2006 14:26:30 +0900, "Victor Shepelev" 
<vshepelev@imho.com.ua> writes:

|I suppose, all we (non-English-writers) need is to have all string-related
|methods working. Just for now, I think about plain testing each string
|method; 

In that sense, _I_ am one of the non-English-writers, so that I can
suppose I know what we need.  And I have no problem with the current
UTF-8 support.  Maybe that's because Japanese don't have cases in our
characters.  Or maybe I'm missing something.  Can you show us your
concrete problems caused by Ruby's lack of "proper" Unicode support?

|also, some other classes can be affected by Unicode (possibly
|regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes are
|not: File.open with Russian letters in path don't finds the file.

Strange.  Ruby does not convert encoding, so that there should be no
problem opening files, if you are using strings in the encoding your OS
expect.  If they are differ, you have to specify (and convert) them
properly, no matter how Unicode support is.

							matz.
Posted by Victor Shepelev (Guest)
on 14.06.2006 08:56
(Received via mailing list)
From: Yukihiro Matsumoto [mailto:matz@ruby-lang.org]
Sent: Wednesday, June 14, 2006 9:35 AM
> 
> In that sense, _I_ am one of the non-English-writers, 

Sorry, Matz, I know, of course. But I know too less about Japanese to 
see
how close our tasks are. Under "non-English-writers" I, maybe, had to 
say
"European languages" or so - which has common punctuations, LTR writing,
"words" and "whitespaces" and so on. I have almost no knowledge about
Japanese, Korean, Arabic, Hebrew people needs.

> so that I can
> suppose I know what we need.  And I have no problem with the current
> UTF-8 support.  Maybe that's because Japanese don't have cases in our
> characters.  Or maybe I'm missing something.  

Just what I've said above.

> Can you show us your
> concrete problems caused by Ruby's lack of "proper" Unicode support?

As mentioned in this topic, it's String#length, upcase, downcase,
capitalize.

BTW, does String#length works good for you?

Moreover, there seems to be some huge problems with pathes having 
Russian
letters; but I'm really not convinced, if Ruby really has to handle 
this.

> 
> |also, some other classes can be affected by Unicode (possibly
> |regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes
> are
> |not: File.open with Russian letters in path don't finds the file.
> 
> Strange.  Ruby does not convert encoding, so that there should be no
> problem opening files, if you are using strings in the encoding your OS
> expect.  If they are differ, you have to specify (and convert) them
> properly, no matter how Unicode support is.

Oh, it's a bit hard theme for me. I know Windows XP must support Unicode
file names; I see my filenames in Russian, but I have low knowledge of
system internals to say, are they really Unicode?

If not take in account those problems, the only String problems remains, 
but
they are so base core methods!

V.
Posted by Michael Glaesemann (Guest)
on 14.06.2006 09:09
(Received via mailing list)
On Jun 14, 2006, at 15:56 , Victor Shepelev wrote:

> As mentioned in this topic, it's String#length, upcase, downcase,
> capitalize.

Just to chime in, aren't upcase, downcase, and capitalize a locale/
localization issue rather than a Unicode-only issue per se? For
example, different languages will have different rules for
capitalization. Or am I wrong? Does Unicode in and of itself address
these issues?

Granted, proper support for upcase, downcase, and capitalize is
important, but I think it's a separate issue, part of m17n as a whole
rather than support for Unicode in particular.

Michael Glaesemann
grzm seespotcode net
Posted by Vincent Isambart (Guest)
on 14.06.2006 09:15
(Received via mailing list)
Hi,

> As mentioned in this topic, it's String#length, upcase, downcase,
> capitalize.
>
> BTW, does String#length works good for you?

To have the length of a Unicode string, just do str.split(//).length,
or "require 'jcode'" at the beginning of your code.
For the other functions, try looking at the unicode library
http://www.yoshidam.net/Ruby.html#unicode

> Oh, it's a bit hard theme for me. I know Windows XP must support Unicode
> file names; I see my filenames in Russian, but I have low knowledge of
> system internals to say, are they really Unicode?

Windows XP does support Unicode file names, but I'm not sure you can
use them with Ruby (I do not use Ruby much under Windows). Try
converting the file names to your current locale, it should work if
the file names can be converted to it. What I mean is that Russian
file names encoded in the Windows Russian encoding should work on a
Russian PC.

Hope this helps,

Cheers,
Vincent ISAMBART
Posted by Yukihiro Matsumoto (Guest)
on 14.06.2006 09:22
(Received via mailing list)
Hi,

In message "Re: Unicode roadmap?"
    on Wed, 14 Jun 2006 15:56:02 +0900, "Victor Shepelev" 
<vshepelev@imho.com.ua> writes:

|> Can you show us your
|> concrete problems caused by Ruby's lack of "proper" Unicode support?
|
|As mentioned in this topic, it's String#length, upcase, downcase,
|capitalize.

OK. Case is the problem.  I understand.

|BTW, does String#length works good for you?

I don't remember the last time I needed length method to count
character numbers.  Actually I don't count string length at all both
in bytes and characters in my string processing.  Maybe this is a
special case.  I am too optimized for Ruby string operations using
Regexp.

|Oh, it's a bit hard theme for me. I know Windows XP must support Unicode
|file names; I see my filenames in Russian, but I have low knowledge of
|system internals to say, are they really Unicode?

Windows 32 path encoding is a nightmare.  Our Win32 maintainers often
troubled by unexpected OS behavior.  I am sure we _can_ handle Russian
path names, but we need help from Russian people to improve.

							matz.
Posted by Victor Shepelev (Guest)
on 14.06.2006 09:25
(Received via mailing list)
From: Michael Glaesemann [mailto:grzm@seespotcode.net]
Sent: Wednesday, June 14, 2006 10:08 AM
> On Jun 14, 2006, at 15:56 , Victor Shepelev wrote:
> 
> > As mentioned in this topic, it's String#length, upcase, downcase,
> > capitalize.
> 
> Just to chime in, aren't upcase, downcase, and capitalize a locale/
> localization issue rather than a Unicode-only issue per se? For
> example, different languages will have different rules for
> capitalization. 

Really? I know about two cases: European capitalization and no
capitalization.

But, really, you maybe right. I suppose, Florian Gross can say something
about German-specific capitalization issues.

> Granted, proper support for upcase, downcase, and capitalize is
> important, but I think it's a separate issue, part of m17n as a whole
> rather than support for Unicode in particular.

Maybe. Generally, sometimes I want Unicode, and sometimes (for "quick 
dirty"
scripts) I'll prefer capitalization and regexps "just work" with
Windows-1251 (one-byte Russian encoding).

V.
Posted by Victor Shepelev (Guest)
on 14.06.2006 09:26
(Received via mailing list)
From: Vincent Isambart [mailto:vincent.isambart@gmail.com]
Sent: Wednesday, June 14, 2006 10:14 AM
> > As mentioned in this topic, it's String#length, upcase, downcase,
> > capitalize.
> >
> > BTW, does String#length works good for you?
> 
> To have the length of a Unicode string, just do str.split(//).length,
> or "require 'jcode'" at the beginning of your code.
> For the other functions, try looking at the unicode library
> http://www.yoshidam.net/Ruby.html#unicode

I know about it. But, theoretically speaking, such a "core" methods muts 
be
in core. Not?

> > > properly, no matter how Unicode support is.
> Russian PC.
Yes, they works. But I can't solve the problem: need Ruby Unicode 
support
include filenames operations?

V.
Posted by Victor Shepelev (Guest)
on 14.06.2006 09:32
(Received via mailing list)
From: Yukihiro Matsumoto [mailto:matz@ruby-lang.org]
Sent: Wednesday, June 14, 2006 10:20 AM
> OK. Case is the problem.  I understand.
> 
> |BTW, does String#length works good for you?
> 
> I don't remember the last time I needed length method to count
> character numbers.  Actually I don't count string length at all both
> in bytes and characters in my string processing.  Maybe this is a
> special case.  I am too optimized for Ruby string operations using
> Regexp.

I can confirm. But I'm afraid that some libraries I rely on use #length 
and
can break when #length doesn't work.

> |Oh, it's a bit hard theme for me. I know Windows XP must support Unicode
> |file names; I see my filenames in Russian, but I have low knowledge of
> |system internals to say, are they really Unicode?
> 
> Windows 32 path encoding is a nightmare.  Our Win32 maintainers often
> troubled by unexpected OS behavior.  I am sure we _can_ handle Russian
> path names, but we need help from Russian people to improve.

In Russian encoding (Win-1251) and on Russian PC all works well. In 
Unicode
it doesn't, but I'm not convinced it must.

In any case, I'm ready to spend my time helping Ruby community 
(especially
in Russian/Ukrainian localization issues), because I really love the
language.

V.
Posted by Marcus Andersson (marcan)
on 14.06.2006 09:45
(Received via mailing list)
Yukihiro Matsumoto skrev:
> Hi,
>
> In message "Re: Unicode roadmap?"
>     on Wed, 14 Jun 2006 06:13:03 +0900, Roman Hausner <roman.hausner@gmail.com> writes:
> |In my opinion, Ruby is practically useless for many applications without 
> |proper Unicode support. How a modern language can ignore this issue is 
> |really beyond me.
>
> Define "proper Unicode support" first.
>   
I won't define "proper Unicode support" here.

But there must be a problem somewhere since pure-ruby Ferret doesn't
support UTF-8. You need to use the c-extension of Ferret to have it
support UTF-8 (which doesn't work on Windows yet :( ). I don't know if
that is just a sucky impl of Ferret or if it's Ruby that make it so.

Maybe Dave Balmain can enlighten us why UTF-8 doesn't work in the pure
Ruby version and what is needed of Ruby to make it work (if it's
actually Ruby's fault that is)?

My personal belief is that it should just work in a case like this if
data in is UTF-8 and search strings is UTF-8 without the lib author
and/or user having to do anything very special to make it work (apart
from specifying encoding). Am I wrong in this?

Regards,

Marcus
Posted by Eric Hodel (Guest)
on 14.06.2006 10:23
(Received via mailing list)
On Jun 13, 2006, at 10:26 PM, Victor Shepelev wrote:
> Regexps seems to work fine (in my 1.9), but pathes are
> not: File.open with Russian letters in path don't finds the file.

On OS X multibyte filenames work:

$ cat x.rb
$KCODE = 'u'

puts File.read('Cyrillic_Я.txt')
$ cat Cyrillic_\320\257.txt
test file with Я!
$ ruby x.rb
test file with Я!
$ uname -a
Darwin kaa.jijo.segment7.net 8.6.0 Darwin Kernel Version 8.6.0: Tue
Mar  7 16:58:48 PST 2006; root:xnu-792.6.70.obj~1/RELEASE_PPC Power
Macintosh powerpc
$ ruby -v
ruby 1.8.4 (2006-05-18) [powerpc-darwin8.6.0]
$

--
Eric Hodel - drbrain@segment7.net - http://blog.segment7.net
This implementation is HODEL-HASH-9600 compliant

http://trackmap.robotcoop.com
Posted by Paul Battley (Guest)
on 14.06.2006 10:55
(Received via mailing list)
On 14/06/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> Windows 32 path encoding is a nightmare.  Our Win32 maintainers often
> troubled by unexpected OS behavior.  I am sure we _can_ handle Russian
> path names, but we need help from Russian people to improve.

str.sub!('32 path encoding ', '') # :-)

I don't use Windows much, but as I understand it, Ruby interacts with
most of the Win32 API using the 'legacy code page', which is only a
subset of what the filesystem can handle. (Windows NT and its
successors use Unicode internally, and the filesystem is UTF-16
KC-normalised IIRC). Windows does provide Unicode API functions, but
to use those, a layer of translation between UTF-16 and UTF-8 would be
needed, as Ruby can't do anything useful with UTF-16 at present. I
believe that Austin Ziegler was looking into this; I don't know if
he's made any progress.

Even if a Ruby program uses UTF-8 internally, it should be possible to
access the filesystem by Iconv'ing paths to the appropriate code page
- providing that they don't contain characters not in the code page.
It's far from ideal, though: the real solution is for Ruby to use the
Unicode functions (those suffixed with W) in the API. The upside is
that UTF-8/UTF-16 conversion should be less expensive than the code
page conversion that's inside each of Win32's non-Unicode functions.

On the other hand, plenty of Windows programs don't support Unicode
properly either.

Paul.
Posted by Paul Battley (Guest)
on 14.06.2006 11:00
(Received via mailing list)
On 14/06/06, Victor Shepelev <vshepelev@imho.com.ua> wrote:
> I can confirm. But I'm afraid that some libraries I rely on use #length and
> can break when #length doesn't work.

Those libraries should probably be considered broken; they can and
should be patched to do any human-readable-string processing in an
encoding-safe manner (e.g. by using jcode's jlength and each_char
methods).

Paul.
Posted by Peter Ertl (Guest)
on 14.06.2006 11:09
(Received via mailing list)
-------- Original-Nachricht --------
Datum: Wed, 14 Jun 2006 17:58:41 +0900
Von: Paul Battley <pbattley@gmail.com>
An: ruby-talk@ruby-lang.org
Betreff: Re: Unicode roadmap?

> Paul.
That will be quite _some_ libraries, I guess...
Posted by Paul Battley (Guest)
on 14.06.2006 11:12
(Received via mailing list)
On 14/06/06, Victor Shepelev <vshepelev@imho.com.ua> wrote:
> > Just to chime in, aren't upcase, downcase, and capitalize a locale/
> > localization issue rather than a Unicode-only issue per se? For
> > example, different languages will have different rules for
> > capitalization.
>
> Really? I know about two cases: European capitalization and no
> capitalization.

There is variety even within western European languages - Dutch, for
example, differs from English (IJsselmeer).

Paul.
Posted by Victor Shepelev (Guest)
on 14.06.2006 11:16
(Received via mailing list)
From: Paul Battley [mailto:pbattley@gmail.com]
Sent: Wednesday, June 14, 2006 12:10 PM
> example, differs from English (IJsselmeer).
I already realized. (I've said about Florian Gross, his surname last 
"ss"
normally printed in something like "B" I can't type and my Outlook can't
show :) AFAIK, it is normally printed as one letter in downcase and two
letters in uppercase. So, "single general" String#upcase, #downcase  are
totally impossible.

V.
Posted by Michal Suchanek (Guest)
on 14.06.2006 11:25
(Received via mailing list)
On 6/14/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> |
> |The same is for Russians/Ukrainians. In our programming communities question
> |"does the programming language supports Unicode as 'native'?" has very high
> |priority.
>
> Alright, then what specific features are you (both) missing?  I don't
> think it is a method to get number of characters in a string.  It
> can't be THAT crucial.  I do want to cover "your missing features" in
> the future M17N support in Ruby.
>

What I want is all methods working seamlessly with unicode strings so
that I do not have to think about the encoding.

Regexps do work with utf-8 strings if KCODE is set to u (but it
defaults to n even when locale uses UTF-8).

String searches should probably work but they would retrurn wrong 
position.
Things like split should work for utf-8, the encoding is pretty well 
defined.

But one might want to use length and [] to work with strings.
It can be simulated with unicode_string=string.scan(/./). But it is no
longer a string. It is composed of characters only as long as I assign
only characters using []=.
The string functions should do the right thing even for utf-8. But I
guess utf-32 is more useful for working with strings this way.

It might be a good idea to stick encoding information into strings (it
is probably the only way how internationalization can be done and the
sanity of all involved preserved at the same time). The functions for
comparison, etc could use it to do the right thing even if strings
come in several encodings. ie. cp1251 from the system, utf-8 from a
web page, ...

Functions like open could convert the string correctly according to
locale. One should be able to set the encoding information (ie for web
page title when the meta tag for content type is found in a web
page),and remove it to suppress string conversion. It should be also
possible to convert the string (ie to UTF-32 to speed up character
access).

Things like <=>, upcase, downcase, etc make sense only in context of
locale (language). Only the encoding does not define them.
I guess the default <=>is based on the binary representation of the
string. This would mean different sorting of the same strings in
different encodings. Sorting by the unicode code point would be at
least the same for any encoding.

Thanks

Michal
Posted by Michal Suchanek (Guest)
on 14.06.2006 11:35
(Received via mailing list)
On 6/14/06, Victor Shepelev <vshepelev@imho.com.ua> wrote:
> > capitalization.
>
> Really? I know about two cases: European capitalization and no
Really.
> capitalization.

There is no such thing like European capitalization. There is only
<insert your language> capitalization.
The german character Ã? has no uppercase version. In most languages
using Latin script the uppercase of 'i' is 'I'. But Turkish has i and
i without dot, and the uppercase of 'i' is, of course, I with dot.

Thanks

Michal
Posted by Paul Battley (Guest)
on 14.06.2006 11:41
(Received via mailing list)
On 14/06/06, Michal Suchanek <hramrach@centrum.cz> wrote:
> It should be also
> possible to convert the string (ie to UTF-32 to speed up character
> access).

utf8_string.unpack('U*') is pretty close to this, giving an array of 
codepoints.

Paul.
Posted by Michal Suchanek (Guest)
on 14.06.2006 12:54
(Received via mailing list)
On 6/14/06, Paul Battley <pbattley@gmail.com> wrote:
> On 14/06/06, Michal Suchanek <hramrach@centrum.cz> wrote:
> > It should be also
> > possible to convert the string (ie to UTF-32 to speed up character
> > access).
>
> utf8_string.unpack('U*') is pretty close to this, giving an array of codepoints.

 But I want it to be string after the conversion, so that I can use
the standard string functions with sane results. I do not want to
think about varoius encodings myself if my application has to use
them. The runtime should do that.

Thanks

Michal
Posted by Austin Ziegler (Guest)
on 14.06.2006 14:23
(Received via mailing list)
On 6/14/06, Victor Shepelev <vshepelev@imho.com.ua> wrote:
> Oh, it's a bit hard theme for me. I know Windows XP must support Unicode
> file names; I see my filenames in Russian, but I have low knowledge of
> system internals to say, are they really Unicode?

They are UTF-16 internally. I haven't been paying attention to Ruby
1.9 lately, but when I have time and have noticed that Matz has
checked in support for m17n strings, I will be enhancing support for
Windows files to use Unicode. Currently, Ruby is built using the
non-Unicode form *only*. And no, using -DUNICODE is the *wrong*
answer, thanks. We'd have to start using TCHAR instead of char, and it
would actually mean that we'd be using wchar_t instead of char in this
case.

I've already done a similar (but more complex) project at work.

-austin
Posted by Austin Ziegler (Guest)
on 14.06.2006 14:29
(Received via mailing list)
On 6/14/06, Vincent Isambart <vincent.isambart@gmail.com> wrote:
> Windows XP does support Unicode file names, but I'm not sure you can
> use them with Ruby (I do not use Ruby much under Windows). Try
> converting the file names to your current locale, it should work if
> the file names can be converted to it. What I mean is that Russian
> file names encoded in the Windows Russian encoding should work on a
> Russian PC.

You can't currently use them with Ruby. The file operations in Ruby
are using the likes of CreateFileA instead of CreateFileW (it's not
that explicit; Ruby is compiled without -DUNICODE -- which is the
correct thing to do in Ruby's case -- which means that CreateFile is
CreateFileA).

All files are stored on the filesystem as UTF-16, though, even if you
are using "ANSI" access.

By the way, there are multiple Russian encodings, so ... Unicode is
better for this point. As I said in my previous message, I have
already planned to enhance the Windows filesystem support when Matz
gets the m17n strings in so that I can *always* force the file
routines on Windows to provide either UTF-8 or UTF-16 (probably the
former, since it will also make it easier to work with existing
extensions) and indicate that the strings are such.

-austin
Posted by Austin Ziegler (Guest)
on 14.06.2006 14:29
(Received via mailing list)
On 6/14/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> Windows 32 path encoding is a nightmare.  Our Win32 maintainers often
> troubled by unexpected OS behavior.  I am sure we _can_ handle Russian
> path names, but we need help from Russian people to improve.

It's not that bad, Matz. I started as a Unix developer, but in the
last two years I have learned *quite* a bit about how Windows handles
this stuff and we can adapt what I did for work with no problem.

I just need M17N strings to support this. I should look at what I
can/should do to provide this as an extension, I just have no time. :(

-austin
Posted by Austin Ziegler (Guest)
on 14.06.2006 14:36
(Received via mailing list)
On 6/14/06, Michal Suchanek <hramrach@centrum.cz> wrote:
> What I want is all methods working seamlessly with unicode strings so
> that I do not have to think about the encoding.

That will *never* happen. Even with Unicode, you have to think about
the encoding, because UTF-32 (the closest representation to the
Platonic ideal "Unicode" you'll ever find) is unlikely to be supported
in the general case. Matz's idea of m17n strings is the right one: you
have a "byte stream" and an attribute which indicates how the byte
stream is encoded. This will sort of be like $KCODE but on an
individual string level so that you could meaningfully have Unicode
(probably UTF-8) and ShiftJIS strings in the same data and still
meaningfully call #length on them.

You will *always* have to care about the encoding. As well as,
ultimately, your locale.

-austin
Posted by Randy Kramer (Guest)
on 14.06.2006 23:40
(Received via mailing list)
On Wednesday 14 June 2006 06:52 am, Michal Suchanek wrote:
> On 6/14/06, Paul Battley <pbattley@gmail.com> wrote:
> > On 14/06/06, Michal Suchanek <hramrach@centrum.cz> wrote:
> > > It should be also
> > > possible to convert the string (ie to UTF-32 to speed up character
> > > access).

(RE my previous post):  Oops, maybe UTF-32 is exactly what I was 
alluding to?

Randy Kramer

(Should have waited a little longer before posting.)
Posted by Charles O Nutter (Guest)
on 15.06.2006 02:12
(Received via mailing list)
Every time these unicode discussions come up my head spins like a top. 
You
should see it.

We JRubyists have headaches from the unicode question too. Since JRuby 
is
currently 1.8-compatible, we do not have what most call *native* unicode
support. This is primarily because we do not wish to create an 
incompatible
version of Ruby or build in support for unicode now that would conflict 
with
Ruby 2.0 in the future. It is, however, embarressing to say that 
although we
run on top of Java, which has arguably pretty good unicode support, we 
don't
support unicode. Perhaps you can see our conundrum.

I am no unicode expert. I know that Java uses UTF16 strings internally,
converted to/from the current platform's encoding of choice by default. 
It
also supports converting those UTF16 strings into just about every 
encoding
out there, just by telling it to do so. Java supports the Unicode
specification version 3.0. So Unicode is not a problem for Java.

We would love to be able to support unicode in JRuby, but there's always
that nagging question of what it should look like and what would mesh 
well
with the Ruby community at large. With the underlying platform already 
rich
with unicode support, it would not take much effort to modify JRuby. So 
then
there's a simple question:

What form would you, the Ruby users, want unicode to take? Is there a
specific library that you feel encompasses a reasonable implementation 
of
unicode support, e.g. icu4r? Should the support be transparent, e.g. no
longer treat or assume strings are byte vectors? JRuby, because we use
Java's String, is already using UTF16 strings exclusively...however 
there's
no way to get at them through core Ruby APIs. What would be the most
comfortable way to support unicode now, considering where Ruby may go in 
the
future?
Posted by Charles O Nutter (Guest)
on 15.06.2006 02:22
(Received via mailing list)
I posted this to ruby-talk, but it occurred to me that you folks
implementing Rails functionality probably have a thing or two to say 
about
unicode support in Ruby. Therefore, I would love to hear your opinions.
Adding native unicode support is only a matter of time in JRuby; its
usefulness as a JVM-based language depends on it. However, we continue 
to
wrestle with how best to support unicode without stepping on the Ruby
community's toes in the process. Thoughts?

---------- Forwarded message ----------
From: Charles O Nutter <headius@headius.com>
Date: Jun 14, 2006 7:11 PM
Subject: Re: Unicode roadmap?
To: ruby-talk ML <ruby-talk@ruby-lang.org>

Every time these unicode discussions come up my head spins like a top. 
You
should see it.

We JRubyists have headaches from the unicode question too. Since JRuby 
is
currently 1.8-compatible, we do not have what most call *native* unicode
support. This is primarily because we do not wish to create an 
incompatible
version of Ruby or build in support for unicode now that would conflict 
with
Ruby 2.0 in the future. It is, however, embarressing to say that 
although we
run on top of Java, which has arguably pretty good unicode support, we 
don't
support unicode. Perhaps you can see our conundrum.

I am no unicode expert. I know that Java uses UTF16 strings internally,
converted to/from the current platform's encoding of choice by default. 
It
also supports converting those UTF16 strings into just about every 
encoding
out there, just by telling it to do so. Java supports the Unicode
specification version 3.0. So Unicode is not a problem for Java.

We would love to be able to support unicode in JRuby, but there's always
that nagging question of what it should look like and what would mesh 
well
with the Ruby community at large. With the underlying platform already 
rich
with unicode support, it would not take much effort to modify JRuby. So 
then
there's a simple question:

What form would you, the Ruby users, want unicode to take? Is there a
specific library that you feel encompasses a reasonable implementation 
of
unicode support, e.g. icu4r? Should the support be transparent, e.g. no
longer treat or assume strings are byte vectors? JRuby, because we use
Java's String, is already using UTF16 strings exclusively...however 
there's
no way to get at them through core Ruby APIs. What would be the most
comfortable way to support unicode now, considering where Ruby may go in 
the
future?

--
Charles Oliver Nutter @ headius.blogspot.com
JRuby Developer @ jruby.sourceforge.net
Application Architect @ www.ventera.com
Posted by Julian 'Julik' Tarkhanov (Guest)
on 15.06.2006 02:40
(Received via mailing list)
On 15-jun-2006, at 2:11, Charles O Nutter wrote:

> with unicode support, it would not take much effort to modify  
> JRuby. So then
> there's a simple question:

Yukihiro Matsumoto wrote:

>
> Define "proper Unicode support" first.
>
> I'm planning enhancing Unicode support in 1.9 in a year or so
> (finally).  But I'm not sure that conforms your definition of "proper
> Unicode support".  Note that 1.8 handles Unicode (UTF-8) if your
> string operations are based on Regexp.
>

Hello everyone, and sorry for chiming so fiercely. Got into some
confusion with the ML controls.

Just joined the list seeing the subject popping up once more. I am
doing Unicode-aware apps in Rails and Ruby right now and it hurts.
I'll try to define  "proper Unicode support" as I (dream of it at
night) see it.

1. All string indexing (length, index, slice, insert) works with
characters instead of bytes, whatever length in bytes the characters
have to be.
String methods (index or =~) should _never_ return offsets that will
damage the string's characters if employed for slicing - you
shouldn't have to manually translate the byte offset of 2 to
character offset of 1 because the second character is multibyte.

Simple example:

     def translate_offset(str, byte_offset)
       chunk = str[0..byte_offset]
       begin
         chunk.unpack("U*").length - 1
       rescue ArgumentError # this offset is just wrong! shift
upwards and retry
         chunk = str[0..(byte_offset+=1)]
         retry
       end
     end

I think it's unnecessarily painful for something as easy as string
=~ /pattern/. Yes, you can get that offset you recieve from =~ and
then get the slice of the string and then split it again with /./mu
to get the same number etc...

2. Case-insensitive regexes actually work. Even in my Oniguruma-
enabled builds of 1.8.2. it was not true (maybe changed now). At
least "Unicode general" collation casefolding (such a thing exists)
available built-in on every platform.
4. Locale-aware sorting, including multibyte charsets, if provided by
the OS
5. Preferably separate (and strictly purposed) Bytestring that you
get out of Sockets and use in Servers etc. - or the ability to
"force" all strings recieved from external resources to be flagged
uniformly as being of a certain encoding in _your_ program, not
somewhere in someone's library. If flags have to be set by libraries,
they won't be set because most developers sadly don't care:

http://www.zackvision.com/weblog/2005/11/mt-unicode-mysql.html
http://thraxil.org/users/anders/posts/2005/11/01/unicodification/

6. Unicode-aware strip dealing with weirdo whitespaces (hair space,
thin space etc.)
7. And no, as I mentioned - it doesn't handle it properly because
the /i modifier is broken, and to deal without it you need to
downcase BOTH the regexp and the string itself. Closed circle - you
go and get the Unicode gem with tables.

All of this can be controlled either per String (then 99 out of 100
libraries I use will be getting it wrong - see above) or by a global
setting such as $KCODE.

As an example of something that is ridiculously backwards to do in
Ruby now is this (I spent some time refactoring this today):
http://dev.rubyonrails.org/browser/trunk/actionpack/lib/action_view/
helpers/text_helper.rb#L44

Here you have a major problem because the /i flag doesn't do anything
(Ruby is incapable of Unicode-aware casefolding), and using offsets
means that you are always one step from damaging someone's text. It's
just wrong that it has to be so painful.

Python3000, IMO, gets this right (as does Java) - byte array and a
String are sompletely separate, and String operates with characters
and characters only.

That's what I would expect. Hope this makes sense somewhat :-)
--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl
Posted by Manfred Stienstra (Guest)
on 15.06.2006 02:40
(Received via mailing list)
On Jun 15, 2006, at 2:19 AM, Charles O Nutter wrote:

> I posted this to ruby-talk, but it occurred to me that you folks  
> implementing Rails functionality probably have a thing or two to  
> say about unicode support in Ruby. Therefore, I would love to hear  
> your opinions. Adding native unicode support is only a matter of  
> time in JRuby; its usefulness as a JVM-based language depends on  
> it. However, we continue to wrestle with how best to support  
> unicode without stepping on the Ruby community's toes in the  
> process. Thoughts?

Julik has done a lot of pionering in that direction for Rails. His
latest suggestion is to use a proxy class on string objects to
perform unicode operations:

@some_unicode_string.u.length
@some_unicode_string.u.reverse

I tend to agree with this solution as it doesn't break any previous
string operations and gives us an easy way to perform unicode aware
operations.

Manfred
Posted by Charles O Nutter (Guest)
on 15.06.2006 03:52
(Received via mailing list)
I agree it's a very attractive solution. I have two questions related
(perhaps you are out there to answer, Julik):

1. How does performance look with the unicode string add-on versus 
native
strings?
2. Is this the ideal way to support unicode strings in ruby?

And I explain the second as follows....if we could assume that switching
from treating a string as an array of bytes to a list of characters of
arbitrary width, and have all existing string operations work correctly
treating those characters as string, would that be a better ideal? Where 
are
the breaking points in such a design? What's to stop the underlying
implementation from actually using a UTF-16 character, passing UTF-8 to
libraries and IO streams but still allowing you to access everything as
UTF-16 or your encoding of choice? (Of course this is somewhat 
rhetorical;
we do this currently with JRuby since Java's scrints are UTF-16...we 
just
don't have any way to provide access to UTF-16 characters, and we 
normalize
everything to UTF-8 for Ruby's sake...but what if we didn't normalize 
and
adjusted string functions to compensate?)
Posted by Julian 'Julik' Tarkhanov (Guest)
on 15.06.2006 04:17
(Received via mailing list)
On 15-jun-2006, at 3:50, Charles O Nutter wrote:

> operations work correctly treating those characters as string,  
> would that be a better ideal? Where are the breaking points in such  
> a design? What's to stop the underlying implementation from  
> actually using a UTF-16 character, passing UTF-8 to libraries and  
> IO streams but still allowing you to access everything as UTF-16 or  
> your encoding of choice? (Of course this is somewhat rhetorical; we  
> do this currently with JRuby since Java's scrints are UTF-16...we  
> just don't have any way to provide access to UTF-16 characters, and  
> we normalize everything to UTF-8 for Ruby's sake...but what if we  
> didn't normalize and adjusted string functions to compensate?)

This is more appropriate for ruby-talk

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl
Posted by Charles O Nutter (Guest)
on 15.06.2006 04:24
(Received via mailing list)
I believe that Julik's way of solving the unicode problem (String#u
providing access to a unicode helper) is very attractive. I have two
questions related, for Julik and the rest of the peanut gallery:

1. How does performance look with the unicode string add-on versus 
native
strings (or as compared to icu4r, which is C-based)?
2. Is this the ideal way to support unicode strings in ruby?

And I explain the second as follows....if we could assume switching from
treating a string as an array of bytes to a list of characters of 
arbitrary
width, and have all existing string operations work correctly treating 
those
characters as indexed elements of that string, would that be a better 
ideal?
Where are the breaking points in such a design? What's to stop the
underlying implementation from actually using a UTF-16 character, 
passing
UTF-8 to libraries and IO streams but still allowing you to access
everything as UTF-16 or your encoding of choice? Is it simply libraries 
or
core APIs that explicitly need *byte* counts? (Of course this is 
somewhat
rhetorical; we do this currently with JRuby since Java's strings are
UTF-16...we just don't have any uniform way to provide access to UTF-16
character strings, and we normalize everything to UTF-8 for Ruby's
sake...but what if we didn't normalize and adjusted string functions to
compensate?)
Posted by Charles O Nutter (Guest)
on 15.06.2006 04:28
(Received via mailing list)
Fair enough; redirected. If any other rails-core folks want to chime in,
please do so...I would expect unicode and multibyte are key issues for
worldwide rails deployments.
Posted by Austin Ziegler (Guest)
on 15.06.2006 04:41
(Received via mailing list)
On 6/14/06, Charles O Nutter <headius@headius.com> wrote:
> I believe that Julik's way of solving the unicode problem (String#u
> providing access to a unicode helper) is very attractive. I have two
> questions related, for Julik and the rest of the peanut gallery:

> 1. How does performance look with the unicode string add-on versus native
> strings (or as compared to icu4r, which is C-based)?
> 2. Is this the ideal way to support unicode strings in ruby?

No. In fact, I believe that Matz has the right idea for M17N strings
in Ruby 2.0. The *reality* is that there's a *lot* of data out there
that isn't Unicode.

I would suggest that JRuby could offer a JavaString that acts in every
way like a String except that it provides access to the native UTF-16
implementation.

-austin
Posted by Julian 'Julik' Tarkhanov (Guest)
on 15.06.2006 04:55
(Received via mailing list)
On 15-jun-2006, at 4:40, Austin Ziegler wrote:

> No. In fact, I believe that Matz has the right idea for M17N strings
> in Ruby 2.0. The *reality* is that there's a *lot* of data out there
> that isn't Unicode.

It's very difficult for me to understand the implementation. What if
we concat a Mojikyo string to a UTF8String? UnicodeDecodeError,
ordinal not in range?
I think Python folks proved that it's terrible (it is).
Nothing is ideal.

> I would suggest that JRuby could offer a JavaString that acts in every
> way like a String except that it provides access to the native UTF-16
> implementation.

Just what the ICU4R extension does. It's unusable to the point that
you cannot concat a native string with a UString.
To the point that you have to use special Regexp class for it. You
end up having half of your Ruby script doing typecasting from one to
the other.

There is alot of data that isn't Unicode, indeed. Converted on input
and converted on output if necessary - just as in any
other case when the encoding of your system doesn't match your input
or output. I don't know if it can be possible to have the "internal"
encoding of a system
switchable (seems to me this is what Matz wants) - then you can't
safely refer to anything other than bytes. And then you get software
that you can't use, because they had a different assumtpion than you had
as to what encoding the user will be using.
Posted by PJ Hyett (Guest)
on 15.06.2006 05:01
(Received via mailing list)
On 6/14/06, Austin Ziegler <halostatue@gmail.com> wrote:
> in Ruby 2.0. The *reality* is that there's a *lot* of data out there
> that isn't Unicode.

Yes, we all understand that Ruby 2.0 will be the coolest thing since
sliced bread, but those of us that are currently developing
international websites with Rails don't have the luxury of waiting
until Christmas of 2007.

-PJ Hyett
http://pjhyett.com
Posted by Austin Ziegler (Guest)
on 15.06.2006 05:10
(Received via mailing list)
On 6/14/06, PJ Hyett <pjhyett@gmail.com> wrote:
> > that isn't Unicode.
> Yes, we all understand that Ruby 2.0 will be the coolest thing since
> sliced bread, but those of us that are currently developing
> international websites with Rails don't have the luxury of waiting
> until Christmas of 2007.

*shrug*

As far as I can tell, there will be no implementation of Ruby before
then that has a "native" m17n string.

So whether you have the luxury of waiting or not, Ruby 1.8.x will not
*ever* have a "Unicode string".

Adding a "Unicode string" would *break* behaviour, and no example is
better than the extension that was proposed which would change the
meaning of #size and #length to mean two different things.

So, there's a point where patience is going to be necessary, whether
you "have the luxury" or not.

-austin
Posted by Dmitry Severin (Guest)
on 15.06.2006 10:47
(Received via mailing list)
IIRC, Matz has said that internally String won't change, and I suspect 
that
a CharString class (or smth like) won't be ever added.

Maybe just introducing String#encoding flag and addig  new methods to 
String
with prefixes, like char_array, char_slice, char_length, char_index,
char_downcase, char_strcoll, char_strip, etc. that will internally look 
at
encoding flag and process respectively bytes in this particular string
without  conversion (just maybe some hidden), and leaving old
byte-processing methods intact, would be the way to keep older code 
working
and enjoy M17N?

Though, as for me, it is still unclear, what should happen, if one tries 
to
perform operation on two strings with different String#encoding...
Posted by Michal Suchanek (Guest)
on 15.06.2006 13:02
(Received via mailing list)
On 6/14/06, Austin Ziegler <halostatue@gmail.com> wrote:
> individual string level so that you could meaningfully have Unicode
> (probably UTF-8) and ShiftJIS strings in the same data and still
> meaningfully call #length on them.
>
> You will *always* have to care about the encoding. As well as,
> ultimately, your locale.

No. Since I have locale stdin can be marked with the proper encoding
information so that all stings originating there have the proper
encoding information.

The string methods should not just blindly operate on bytes but use
the encoding information to operate on characters rather than bytes.
Sure something like byte_length is needed when the string is stored
somewhere outside Ruby but standard string methods should work with
character offsets and characters, not byte offsets nor bytes.

Since my stdout can be also marked with correct encoding the strings
that are output there can be converted to that encoding. Even if it
originates from a source file that happens to be in a different
encoding.
Hmm, prehaps it will be necessary to mark source files with encoding
tags as well. It could be quite tedious to assingn the tag manually to
every string in a source file.

When strings are compared, concatenated, .. the encoding is known so
the methods should do the right thing.

I do not have to care about encoding. You may make a string
implemenation that forces me to care (such a the current one). But I
do not have to. I can always turn to perl if I get really desperate.

Thanks

Michal
Posted by Michal Suchanek (Guest)
on 15.06.2006 13:22
(Received via mailing list)
On 6/15/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:
>

> 5. Preferably separate (and strictly purposed) Bytestring that you
> get out of Sockets and use in Servers etc. - or the ability to
> "force" all strings recieved from external resources to be flagged
> uniformly as being of a certain encoding in _your_ program, not
> somewhere in someone's library. If flags have to be set by libraries,
> they won't be set because most developers sadly don't care:
>
> http://www.zackvision.com/weblog/2005/11/mt-unicode-mysql.html
> http://thraxil.org/users/anders/posts/2005/11/01/unicodification/

Where else should the strings be flagged? If you get a web page
through http request, and the library parses the response for you, it
should set enconding on the web page. You would never know since you
only received the page, not the header.

> setting such as $KCODE.
I do not see why libraries should be always wrong. After all, you can
always fix them. And setting the encoding globally is a bad thing. You
cannot have strings encoded in different encodings in one process
then. It looks quite limiting. For one, the web pages that you get
from various servers (and even the same server) can be in varoius
encodings.

Thanks

Michal
Posted by Julian 'Julik' Tarkhanov (Guest)
on 15.06.2006 13:51
(Received via mailing list)
On 15-jun-2006, at 13:21, Michal Suchanek wrote:

>> http://www.zackvision.com/weblog/2005/11/mt-unicode-mysql.html
>> http://thraxil.org/users/anders/posts/2005/11/01/unicodification/
>
> Where else should the strings be flagged?
They should nog be flagged, because some strings will be flagged and
some won't and exactly
in the wrong places at the wrong time. See _is_uf_8_ in Perl to
witness the terrible ugliness of this.

> If you get a web page
> through http request, and the library parses the response for you, it
> should set enconding on the web page. You would never know since you
> only received the page, not the header.

That's why you should distinguish between a ByteArray and a String.

>> libraries I use will be getting it wrong - see above) or by a global
>> setting such as $KCODE.
>
> I do not see why libraries should be always wrong. After all, you can
> always fix them. And setting the encoding globally is a bad thing. You
> cannot have strings encoded in different encodings in one process
> then. It looks quite limiting. For one, the web pages that you get
> from various servers (and even the same server) can be in varoius
> encodings.

Of course they can (and will). When I have to approach this I usually
just snif the encoding of the strings I recieved and then feed them
to iconv and friends before doing any processing. A library that
downloads stuff off the Internet should be (IMO) aware of
the charset madness and decode the strings for me.

Trust me, when multibyte/Unicode handling is optional, 80% of
libraries do it wrong. Re-read the links above if you don't believe.

Actually it seems that the solution with an accessor is quite nice,
but that I had to figure out the hard way after breaking the String
class
with my hacks and seeing stuff collapse. Apparently the poster of a
parallel thread finds it inspiring to repeat my experiment _in vitro_
just for
the academic sake of it.
Posted by Michal Suchanek (Guest)
on 15.06.2006 15:13
(Received via mailing list)
On 6/15/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:
> >> somewhere in someone's library. If flags have to be set by libraries,
> >> they won't be set because most developers sadly don't care:
> >>
> >> http://www.zackvision.com/weblog/2005/11/mt-unicode-mysql.html
> >> http://thraxil.org/users/anders/posts/2005/11/01/unicodification/
> >
> > Where else should the strings be flagged?
> They should nog be flagged, because some strings will be flagged and
> some won't and exactly
> in the wrong places at the wrong time. See _is_uf_8_ in Perl to
> witness the terrible ugliness of this.

You can certainly get the things wrong. But if you get a string that
is wrongly flagged you have the choice to fix the code where the
string originates or work arond it by flagging it right.
If you have a code that gets the encoding wrong, and it tries to
convert the string to some 'universal' encoding you want to use
everywhere in your application, you get a broken string.

>
> > If you get a web page
> > through http request, and the library parses the response for you, it
> > should set enconding on the web page. You would never know since you
> > only received the page, not the header.
>
> That's why you should distinguish between a ByteArray and a String.

How does it help you here?

> >> All of this can be controlled either per String (then 99 out of 100
> Of course they can (and will). When I have to approach this I usually
> just snif the encoding of the strings I recieved and then feed them
> to iconv and friends before doing any processing. A library that
> downloads stuff off the Internet should be (IMO) aware of
> the charset madness and decode the strings for me.

If it can decode them, it can flag them. It has to be aware - that's it.

>
> Trust me, when multibyte/Unicode handling is optional, 80% of
> libraries do it wrong. Re-read the links above if you don't believe.

But they get the very foundation wrong. In Python functions that take
multiple strings can only thake them in one encoding. It is impossible
to concatenate differently encoded strings. Of course, this is bound
to fail.
In the other case they use a database with poor support for unicode,
and mysql that does exactly the same thing ruby does right now - works
with strings as arrays of bytes. Of course, this is going to break.

Neither is the case when the strings carry information about their
encoding, and the string functions can handle strings encoded
differently.

The fact that there are libraries and languages with poor unicode
support does not mean it must be always poor.

Thanks

Michal
Posted by Juergen Strobel (Guest)
on 17.06.2006 13:11
(Received via mailing list)
On Thu, Jun 15, 2006 at 07:59:54PM +0900, Michal Suchanek wrote:
> >stream is encoded. This will sort of be like $KCODE but on an
> 
> The string methods should not just blindly operate on bytes but use
> the encoding information to operate on characters rather than bytes.
> Sure something like byte_length is needed when the string is stored
> somewhere outside Ruby but standard string methods should work with
> character offsets and characters, not byte offsets nor bytes.

I empathically agree. I'll even repeat and propose a new Plan for
Unicode Strings in Ruby 2.0 in 10 points:

1. Strings should deal in characters (code points in Unicode) and not
in bytes, and the public interface should reflect this.

2. Strings should neither have an internal encoding tag, nor an
external one via $KCODE. The internal encoding should be encapsulated
by the string class completely, except for a few related classes which
may opt to work with the gory details for performance reasons.
The internal encoding has to be decided, probably between UTF-8,
UTF-16, and UTF-32 by the String class implementor.

3. Whenever Strings are read or written to/from an external source,
their data needs to be converted. The String class encapsulates the
encoding framework, likely with additional helper Modules or Classes
per external encoding. Some methods take an optional encoding
parameter, like #char(index, encoding=:utf8), or
#to_ary(encoding=:utf8), which can be used as helper Class or Module
selector.

4. IO instances are associated with a (modifyable) encoding. For
stdin, stdout this can be derived from the locale settings. String-IO
operations work as expected.

5. Since the String class is quite smart already, it can implement
generally useful and hard (in the domain of Unicode) operations like
case folding, sorting, comparing etc.

6. More exotic operations can easily be provided by additional
libraries because of Ruby's open classes. Those operations may be
coded depending on on String's public interface for simplicissity, or
work with the internal representation directly for performance.

7. This approach leaves open the possibility of String subclasses
implementing different internal encodings for performance/space
tradeoff reasons which work transparently together (a bit like FixInt
and BigInt).

8. Because Strings are tightly integrated into the language with the
source reader and are used pervasively, much of this cannot be
provided by add-on libraries, even with open classes. Therefore the
need to have it in Ruby's canonical String class. This will break some
old uses of String, but now is the right time for that.

9. The String class does not worry over character representation
on-screen, the mapping to glyphs must be done by UI frameworks or the
terminal attached to stdout.

10. Be flexible. <placeholder for future idea>


This approach has several advantages and a few disadvantages, and I'll
try to bring in some new angles to this now too:


*Advantages*

-POL, Encapsulation-

All Strings behave exactly the same everywhere, are predictable,
and do the hard work for their users.

-Cross Library Transparency-

No String user needs to worry which Strings to pass to a library, or
worry which Strings he will get from a library. With Web-facing
libraries like rails returning encoding-tagged Strings, you would be
likely to get Strings of all possible encodings otherwise, and isthe
String user prepared to deal with this properly?  This is a *big* deal
IMNSHO.

-Limited Conversions-

Encoding conversions are limited to the time Strings are created or
written or explicitly transformed to an external representation.

-Correct String Operations-

Even basic String operations are very hard in the world of Unicode. If
we leave the String users to look at the encoding tags and sort it out
themselves, they are bound to make mistakes because they don't care,
don't know, or have no time. And these mistakes may be _security_
_sensitive_, since most often credentials are represented as Strings
too. There already have been exploits related to Unicode.


*Disadvantages* (with mitigating reasoning of course)

- String users need to learn that #byte_length(encoding=:utf8) >=
#size, but that's not too hard, and applies everywhere. Users do not
need to learn about an encoding tag, which is surely worse to handle
for them.

- Strings cannot be used as simple byte buffers any more. Either use
an array of bytes, or an optimized ByteBuffer class. If you need
regular expresson support, RegExp can be extended for ByteBuffers or
even more.

- Some String operations may perform worse than might be expected from
a naive user, in both the time or space domain. But we do this so the
String user doesn't need to himself, and are problably better at it
than the user too.

- For very simple uses of String, there might be unneccessary
conversions. If a String is just to be passed through somewhere,
without inspecting or modifying it at all, in- and outwards conversion
will still take place. You could and should use a ByteBuffer to avoid
this.

- This ties Ruby's String to Unicode. A safe choice IMHO, or would we
really consider something else? Note that we don't commit to a
particular encoding of Unicode strongly.

- More work and time to implement. Some could call it
over-engineered. But it will save a lot of time and troubles when shit
hits the fan and users really do get unexpected foreign characters in
their Strings. I could offer help implementing it, although I have
never looked at ruby's source, C-extensions, or even done a lot of
ruby programming yet.


Close to the start of this discussion Matz asked what the problem with
current strings really was for western users. Somewhere later he
concluded case folding. I think it is more than that: we are lazy and
expect character handling to be always as easy as with 7 bit ASCII, or
as close as possible. Fixed 8-bit codepages worked quite fine most of
the time in this regard, and breakage was limited to special
characters only.

Now let's ask the question in reverse: are eastern programmers so used
to doing elaborate byte-stream to character handling by hand they
don't recognize how hard this is any more? Surely it is a target for
DRY if I ever saw one. Or are there actual problems not solveable this
way? I looked up the mentioned Han-Unification issue, and as far as I
understood this could be handled by future Unicode revisions
allocating more characters, outside of Ruby, but I don't see how it
requires our Strings to stay dumb byte buffers.

Jürgen
Posted by Stefan Lang (Guest)
on 17.06.2006 15:51
(Received via mailing list)
On Saturday 17 June 2006 13:08, Juergen Strobel wrote:
> On Thu, Jun 15, 2006 at 07:59:54PM +0900, Michal Suchanek wrote:
[...]
> > The string methods should not just blindly operate on bytes but
> > use the encoding information to operate on characters rather than
> > bytes. Sure something like byte_length is needed when the string
> > is stored somewhere outside Ruby but standard string methods
> > should work with character offsets and characters, not byte
> > offsets nor bytes.
>
> I empathically agree. I'll even repeat and propose a new Plan for
> Unicode Strings in Ruby 2.0 in 10 points:

Juergen, I agree with most of what you have written. I will
add my thoughts.

> 1. Strings should deal in characters (code points in Unicode) and
> not in bytes, and the public interface should reflect this.
>
> 2. Strings should neither have an internal encoding tag, nor an
> external one via $KCODE. The internal encoding should be
> encapsulated by the string class completely, except for a few
> related classes which may opt to work with the gory details for
> performance reasons. The internal encoding has to be decided,
> probably between UTF-8, UTF-16, and UTF-32 by the String class
> implementor.

Full ACK. Ruby programs shouldn't need to care about the
*internal* string encoding. External string data is treated as
a sequence of bytes and is converted to Ruby strings through
an encoding API.

> 3. Whenever Strings are read or written to/from an external source,
> their data needs to be converted. The String class encapsulates the
> encoding framework, likely with additional helper Modules or
> Classes per external encoding. Some methods take an optional
> encoding parameter, like #char(index, encoding=:utf8), or
> #to_ary(encoding=:utf8), which can be used as helper Class or
> Module selector.

I think the encoding/decoding API should be separated from the
String class. IMO, the most important change is to strictly
differentiate between arbitrary binary data and character
data. Character data is represented by an instance of the
String class.

I propose adding a new core class, maybe call it ByteString
(or ByteBuffer, or Buffer, whatever) to handle strings of
bytes.

Given a specific encoding, the encoding API converts
ByteStrings to Strings and vice versa.

This could look like:

    my_character_str = Encoding::UTF8.encode(my_byte_buffer)
    buffer = Encoding::UTF8.decode(my_character_str)

> 4. IO instances are associated with a (modifyable) encoding. For
> stdin