Unicode roadmap?

rhaus · June 13, 2006, 11:12pm

In my opinion, Ruby is practically useless for many applications without
proper Unicode support. How a modern language can ignore this issue is
really beyond me.

Is there a plan to get Unicode support into the language anytime soon?

rhaus · June 14, 2006, 12:28am

Hi,

In message “Re: Unicode roadmap?”
on Wed, 14 Jun 2006 06:13:03 +0900, Roman H.
[email protected] writes:
|In my opinion, Ruby is practically useless for many applications without
|proper Unicode support. How a modern language can ignore this issue is
|really beyond me.

Define “proper Unicode support” first.

|Is there a plan to get Unicode support into the language anytime soon?

I’m planning enhancing Unicode support in 1.9 in a year or so
(finally). But I’m not sure that conforms your definition of “proper
Unicode support”. Note that 1.8 handles Unicode (UTF-8) if your
string operations are based on Regexp.

						matz.

rhaus · June 14, 2006, 12:51am

On Jun 13, 2006, at 6:34 PM, Pete wrote:

Define “proper Unicode support” first.

having an unicode-equivalent for all methods of class String

like size, slice, upcase

E.g. I tried the unicode plugin… but, alas, who want’s to write
stuff like ‘normalize_KC’ etc. if you just want the frickin’
substring of a string?!

def substring(str, start, len)
md = str.match(/\A.{#{start}}(.{#{len}})/)
md[1]
end

def strlength(str)
n = 0
str.gsub(/./m) { n += 1; $& }
n
end

See! Regexps do everything!

Just you know, set $KCODE and use these methods and you are set!

(I am kidding… btw)

rhaus · June 14, 2006, 12:38am

Define “proper Unicode support” first.

having an unicode-equivalent for all methods of class String

like size, slice, upcase

E.g. I tried the unicode plugin… but, alas, who want’s to write stuff
like ‘normalize_KC’ etc. if you just want the frickin’ substring of a
string?!

you need to read books on unicode just to properly use the plugin…

aargg :-((

Best regards
Peter

Yukihiro M. schrieb:

rhaus · June 14, 2006, 1:00am

From the theoretical point of view this is quite interesting. Also I
understand the humor

Performance and memory consumption should be breathtaking using regexp
just everywhere…

Also there are a few methods left

As I am German the ‘missing’ unicode support is one of the greatest
obstacles for me (and probably all other Germans doing their stuff
seriously)…

Logan C. schrieb:

rhaus · June 14, 2006, 1:13am

From: Pete [mailto:[email protected]]
Sent: Wednesday, June 14, 2006 1:58 AM

As I am German the ‘missing’ unicode support is one of the greatest
obstacles for me (and probably all other Germans doing their stuff
seriously)…

The same is for Russians/Ukrainians. In our programming communities
question
“does the programming language supports Unicode as ‘native’?” has very
high
priority.

/BTW, here is one of the things where Python beats Ruby completely

V.

rhaus · June 14, 2006, 1:59am

I suspect the Japanese posters on this list can answer better than I
can,
but my impression is that Unicode is, shall we say, not highly thought
of
outside Europe and North America. The way they dealt with “Chinese”
characters was apparently more than a bit of a hack, and just doesn’t
work
very well in the real world. Reading some of the explanations for
glyphs
versus characters in Unicode just makes you shake your head. What were
they
thinking? Sure doesn’t pass the smell test, although I’ll be the first
to
admit I haven’t exactly thought deeply about the subject.

There’s another problem with Japanese - I’ve got a friend who’s been
dealing
with some issues around the fact that Japanese apparently innovates new
characters on a regular basis, and everyone is expected to use the new
characters. (I believe this is called gaiji). The concept of a fixed
character set apparently just isn’t a good idea to start with.

[Awaiting corrections from people who actually know something about this
topic :-)…]

James M.

rhaus · June 14, 2006, 3:16am

On Jun 13, 2006, at 7:56 PM, James M. wrote:

topic :-)…]
I have one Japanese person here who’s never heard of this gaiji
concept. But it could be new and behind a generation gap of some
kind. They do sure like to add symbols where they can, though.
Especially graphical star characters. I see that a lot.
-Mat

rhaus · June 14, 2006, 2:14am

On 6/14/06, James M. [email protected] wrote:

with some issues around the fact that Japanese apparently innovates new
characters on a regular basis, and everyone is expected to use the new
characters. (I believe this is called gaiji). The concept of a fixed
character set apparently just isn’t a good idea to start with.

[Awaiting corrections from people who actually know something about this
topic :-)…]

There is a good summary of the han unification controversy on wikipedia;

http://en.wikipedia.org/wiki/Han_unification

rhaus · June 14, 2006, 4:38am

Hi,

In message “Re: Unicode roadmap?”
on Wed, 14 Jun 2006 08:11:49 +0900, “Victor S.”
[email protected] writes:

|From: Pete [mailto:[email protected]]
|Sent: Wednesday, June 14, 2006 1:58 AM
|> As I am German the ‘missing’ unicode support is one of the greatest
|> obstacles for me (and probably all other Germans doing their stuff
|> seriously)…
|
|The same is for Russians/Ukrainians. In our programming communities question
|“does the programming language supports Unicode as ‘native’?” has very high
|priority.

Alright, then what specific features are you (both) missing? I don’t
think it is a method to get number of characters in a string. It
can’t be THAT crucial. I do want to cover “your missing features” in
the future M17N support in Ruby.

						matz.

rhaus · June 14, 2006, 7:29am

From: Yukihiro M. [mailto:[email protected]]
Sent: Wednesday, June 14, 2006 5:37 AM

|The same is for Russians/Ukrainians. In our programming communities
matz.
I suppose, all we (non-English-writers) need is to have all
string-related
methods working. Just for now, I think about plain testing each string
method; also, some other classes can be affected by Unicode (possibly
regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes
are
not: File.open with Russian letters in path don’t finds the file.

More generally, it can make sense to have Unicode as the “base” mode;
where
non-Unicode to stay “old, compatibility” mode.

Something like this.

V.

rhaus · June 14, 2006, 8:37am

Hi,

In message “Re: Unicode roadmap?”
on Wed, 14 Jun 2006 14:26:30 +0900, “Victor S.”
[email protected] writes:

|I suppose, all we (non-English-writers) need is to have all string-related
|methods working. Just for now, I think about plain testing each string
|method;

In that sense, I am one of the non-English-writers, so that I can
suppose I know what we need. And I have no problem with the current
UTF-8 support. Maybe that’s because Japanese don’t have cases in our
characters. Or maybe I’m missing something. Can you show us your
concrete problems caused by Ruby’s lack of “proper” Unicode support?

|also, some other classes can be affected by Unicode (possibly
|regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes are
|not: File.open with Russian letters in path don’t finds the file.

Strange. Ruby does not convert encoding, so that there should be no
problem opening files, if you are using strings in the encoding your OS
expect. If they are differ, you have to specify (and convert) them
properly, no matter how Unicode support is.

						matz.

rhaus · June 14, 2006, 7:54am

Roman H. wrote:

In my opinion, Ruby is practically useless for many applications without
proper Unicode support. How a modern language can ignore this issue is
really beyond me.

Is there a plan to get Unicode support into the language anytime soon?

I also think that this is very important.

rhaus · June 14, 2006, 9:09am

On Jun 14, 2006, at 15:56 , Victor S. wrote:

As mentioned in this topic, it’s String#length, upcase, downcase,
capitalize.

Just to chime in, aren’t upcase, downcase, and capitalize a locale/
localization issue rather than a Unicode-only issue per se? For
example, different languages will have different rules for
capitalization. Or am I wrong? Does Unicode in and of itself address
these issues?

Granted, proper support for upcase, downcase, and capitalize is
important, but I think it’s a separate issue, part of m17n as a whole
rather than support for Unicode in particular.

Michael G.
grzm seespotcode net

rhaus · June 14, 2006, 8:56am

From: Yukihiro M. [mailto:[email protected]]
Sent: Wednesday, June 14, 2006 9:35 AM

In that sense, I am one of the non-English-writers,

Sorry, Matz, I know, of course. But I know too less about Japanese to
see
how close our tasks are. Under “non-English-writers” I, maybe, had to
say
“European languages” or so - which has common punctuations, LTR writing,
“words” and “whitespaces” and so on. I have almost no knowledge about
Japanese, Korean, Arabic, Hebrew people needs.

so that I can
suppose I know what we need. And I have no problem with the current
UTF-8 support. Maybe that’s because Japanese don’t have cases in our
characters. Or maybe I’m missing something.

Just what I’ve said above.

Can you show us your
concrete problems caused by Ruby’s lack of “proper” Unicode support?

As mentioned in this topic, it’s String#length, upcase, downcase,
capitalize.

BTW, does String#length works good for you?

Moreover, there seems to be some huge problems with pathes having
Russian
letters; but I’m really not convinced, if Ruby really has to handle
this.

|also, some other classes can be affected by Unicode (possibly
|regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes
are
|not: File.open with Russian letters in path don’t finds the file.

Strange. Ruby does not convert encoding, so that there should be no
problem opening files, if you are using strings in the encoding your OS
expect. If they are differ, you have to specify (and convert) them
properly, no matter how Unicode support is.

Oh, it’s a bit hard theme for me. I know Windows XP must support Unicode
file names; I see my filenames in Russian, but I have low knowledge of
system internals to say, are they really Unicode?

If not take in account those problems, the only String problems remains,
but
they are so base core methods!

V.

rhaus · June 14, 2006, 9:15am

Hi,

As mentioned in this topic, it’s String#length, upcase, downcase,
capitalize.

BTW, does String#length works good for you?

To have the length of a Unicode string, just do str.split(//).length,
or “require ‘jcode’” at the beginning of your code.
For the other functions, try looking at the unicode library
http://www.yoshidam.net/Ruby.html#unicode

Oh, it’s a bit hard theme for me. I know Windows XP must support Unicode
file names; I see my filenames in Russian, but I have low knowledge of
system internals to say, are they really Unicode?

Windows XP does support Unicode file names, but I’m not sure you can
use them with Ruby (I do not use Ruby much under Windows). Try
converting the file names to your current locale, it should work if
the file names can be converted to it. What I mean is that Russian
file names encoded in the Windows Russian encoding should work on a
Russian PC.

Hope this helps,

Cheers,
Vincent ISAMBART

rhaus · June 14, 2006, 9:22am

Hi,

In message “Re: Unicode roadmap?”
on Wed, 14 Jun 2006 15:56:02 +0900, “Victor S.”
[email protected] writes:

|> Can you show us your
|> concrete problems caused by Ruby’s lack of “proper” Unicode support?
|
|As mentioned in this topic, it’s String#length, upcase, downcase,
|capitalize.

OK. Case is the problem. I understand.

|BTW, does String#length works good for you?

I don’t remember the last time I needed length method to count
character numbers. Actually I don’t count string length at all both
in bytes and characters in my string processing. Maybe this is a
special case. I am too optimized for Ruby string operations using
Regexp.

|Oh, it’s a bit hard theme for me. I know Windows XP must support Unicode
|file names; I see my filenames in Russian, but I have low knowledge of
|system internals to say, are they really Unicode?

Windows 32 path encoding is a nightmare. Our Win32 maintainers often
troubled by unexpected OS behavior. I am sure we can handle Russian
path names, but we need help from Russian people to improve.

						matz.

rhaus · June 14, 2006, 9:25am

From: Michael G. [mailto:[email protected]]
Sent: Wednesday, June 14, 2006 10:08 AM

On Jun 14, 2006, at 15:56 , Victor S. wrote:

As mentioned in this topic, it’s String#length, upcase, downcase,
capitalize.

Just to chime in, aren’t upcase, downcase, and capitalize a locale/
localization issue rather than a Unicode-only issue per se? For
example, different languages will have different rules for
capitalization.

Really? I know about two cases: European capitalization and no
capitalization.

But, really, you maybe right. I suppose, Florian G. can say something
about German-specific capitalization issues.

Granted, proper support for upcase, downcase, and capitalize is
important, but I think it’s a separate issue, part of m17n as a whole
rather than support for Unicode in particular.

Maybe. Generally, sometimes I want Unicode, and sometimes (for “quick
dirty”
scripts) I’ll prefer capitalization and regexps “just work” with
Windows-1251 (one-byte Russian encoding).

V.

rhaus · June 14, 2006, 9:26am

From: Vincent I. [mailto:[email protected]]
Sent: Wednesday, June 14, 2006 10:14 AM

As mentioned in this topic, it’s String#length, upcase, downcase,
capitalize.

BTW, does String#length works good for you?

To have the length of a Unicode string, just do str.split(//).length,
or “require ‘jcode’” at the beginning of your code.
For the other functions, try looking at the unicode library
http://www.yoshidam.net/Ruby.html#unicode

I know about it. But, theoretically speaking, such a “core” methods muts
be
in core. Not?

properly, no matter how Unicode support is.
Russian PC.
Yes, they works. But I can’t solve the problem: need Ruby Unicode
support
include filenames operations?

V.

rhaus · June 14, 2006, 9:45am

Yukihiro M. skrev:

Hi,

In message “Re: Unicode roadmap?”
on Wed, 14 Jun 2006 06:13:03 +0900, Roman H. [email protected] writes:
|In my opinion, Ruby is practically useless for many applications without
|proper Unicode support. How a modern language can ignore this issue is
|really beyond me.

Define “proper Unicode support” first.

I won’t define “proper Unicode support” here.

But there must be a problem somewhere since pure-ruby Ferret doesn’t
support UTF-8. You need to use the c-extension of Ferret to have it
support UTF-8 (which doesn’t work on Windows yet ). I don’t know if
that is just a sucky impl of Ferret or if it’s Ruby that make it so.

Maybe Dave Balmain can enlighten us why UTF-8 doesn’t work in the pure
Ruby version and what is needed of Ruby to make it work (if it’s
actually Ruby’s fault that is)?

My personal belief is that it should just work in a case like this if
data in is UTF-8 and search strings is UTF-8 without the lib author
and/or user having to do anything very special to make it work (apart
from specifying encoding). Am I wrong in this?

Regards,

Marcus