Unicode roadmap?


#1

In my opinion, Ruby is practically useless for many applications without
proper Unicode support. How a modern language can ignore this issue is
really beyond me.

Is there a plan to get Unicode support into the language anytime soon?


#2

Hi,

In message “Re: Unicode roadmap?”
on Wed, 14 Jun 2006 06:13:03 +0900, Roman H.
removed_email_address@domain.invalid writes:
|In my opinion, Ruby is practically useless for many applications without
|proper Unicode support. How a modern language can ignore this issue is
|really beyond me.

Define “proper Unicode support” first.

|Is there a plan to get Unicode support into the language anytime soon?

I’m planning enhancing Unicode support in 1.9 in a year or so
(finally). But I’m not sure that conforms your definition of “proper
Unicode support”. Note that 1.8 handles Unicode (UTF-8) if your
string operations are based on Regexp.

						matz.

#3

On Jun 13, 2006, at 6:34 PM, Pete wrote:

Define “proper Unicode support” first.

having an unicode-equivalent for all methods of class String

like size, slice, upcase

E.g. I tried the unicode plugin… but, alas, who want’s to write
stuff like ‘normalize_KC’ etc. if you just want the frickin’
substring of a string?!

def substring(str, start, len)
md = str.match(/\A.{#{start}}(.{#{len}})/)
md[1]
end

def strlength(str)
n = 0
str.gsub(/./m) { n += 1; $& }
n
end

See! Regexps do everything!

Just you know, set $KCODE and use these methods and you are set!

(I am kidding… btw)


#4

Define “proper Unicode support” first.

having an unicode-equivalent for all methods of class String

like size, slice, upcase

E.g. I tried the unicode plugin… but, alas, who want’s to write stuff
like ‘normalize_KC’ etc. if you just want the frickin’ substring of a
string?!

you need to read books on unicode just to properly use the plugin…

aargg :-((

Best regards
Peter

Yukihiro M. schrieb:


#5

From the theoretical point of view this is quite interesting. Also I
understand the humor :slight_smile:

Performance and memory consumption should be breathtaking using regexp
just everywhere…

Also there are a few methods left :slight_smile:

As I am German the ‘missing’ unicode support is one of the greatest
obstacles for me (and probably all other Germans doing their stuff
seriously)…

Logan C. schrieb:


#6

From: Pete [mailto:removed_email_address@domain.invalid]
Sent: Wednesday, June 14, 2006 1:58 AM

As I am German the ‘missing’ unicode support is one of the greatest
obstacles for me (and probably all other Germans doing their stuff
seriously)…

The same is for Russians/Ukrainians. In our programming communities
question
“does the programming language supports Unicode as ‘native’?” has very
high
priority.

/BTW, here is one of the things where Python beats Ruby completely

V.


#7

I suspect the Japanese posters on this list can answer better than I
can,
but my impression is that Unicode is, shall we say, not highly thought
of
outside Europe and North America. The way they dealt with “Chinese”
characters was apparently more than a bit of a hack, and just doesn’t
work
very well in the real world. Reading some of the explanations for
glyphs
versus characters in Unicode just makes you shake your head. What were
they
thinking? Sure doesn’t pass the smell test, although I’ll be the first
to
admit I haven’t exactly thought deeply about the subject.

There’s another problem with Japanese - I’ve got a friend who’s been
dealing
with some issues around the fact that Japanese apparently innovates new
characters on a regular basis, and everyone is expected to use the new
characters. (I believe this is called gaiji). The concept of a fixed
character set apparently just isn’t a good idea to start with.

[Awaiting corrections from people who actually know something about this
topic :-)…]

  • James M.

#8

On Jun 13, 2006, at 7:56 PM, James M. wrote:

topic :-)…]
I have one Japanese person here who’s never heard of this gaiji
concept. But it could be new and behind a generation gap of some
kind. They do sure like to add symbols where they can, though.
Especially graphical star characters. I see that a lot.
-Mat


#9

On 6/14/06, James M. removed_email_address@domain.invalid wrote:

with some issues around the fact that Japanese apparently innovates new
characters on a regular basis, and everyone is expected to use the new
characters. (I believe this is called gaiji). The concept of a fixed
character set apparently just isn’t a good idea to start with.

[Awaiting corrections from people who actually know something about this
topic :-)…]

There is a good summary of the han unification controversy on wikipedia;

http://en.wikipedia.org/wiki/Han_unification

#10

Hi,

In message “Re: Unicode roadmap?”
on Wed, 14 Jun 2006 08:11:49 +0900, “Victor S.”
removed_email_address@domain.invalid writes:

|From: Pete [mailto:removed_email_address@domain.invalid]
|Sent: Wednesday, June 14, 2006 1:58 AM
|> As I am German the ‘missing’ unicode support is one of the greatest
|> obstacles for me (and probably all other Germans doing their stuff
|> seriously)…
|
|The same is for Russians/Ukrainians. In our programming communities question
|“does the programming language supports Unicode as ‘native’?” has very high
|priority.

Alright, then what specific features are you (both) missing? I don’t
think it is a method to get number of characters in a string. It
can’t be THAT crucial. I do want to cover “your missing features” in
the future M17N support in Ruby.

						matz.

#11

From: Yukihiro M. [mailto:removed_email_address@domain.invalid]
Sent: Wednesday, June 14, 2006 5:37 AM

|The same is for Russians/Ukrainians. In our programming communities
matz.
I suppose, all we (non-English-writers) need is to have all
string-related
methods working. Just for now, I think about plain testing each string
method; also, some other classes can be affected by Unicode (possibly
regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes
are
not: File.open with Russian letters in path don’t finds the file.

More generally, it can make sense to have Unicode as the “base” mode;
where
non-Unicode to stay “old, compatibility” mode.

Something like this.

V.


#12

Hi,

In message “Re: Unicode roadmap?”
on Wed, 14 Jun 2006 14:26:30 +0900, “Victor S.”
removed_email_address@domain.invalid writes:

|I suppose, all we (non-English-writers) need is to have all string-related
|methods working. Just for now, I think about plain testing each string
|method;

In that sense, I am one of the non-English-writers, so that I can
suppose I know what we need. And I have no problem with the current
UTF-8 support. Maybe that’s because Japanese don’t have cases in our
characters. Or maybe I’m missing something. Can you show us your
concrete problems caused by Ruby’s lack of “proper” Unicode support?

|also, some other classes can be affected by Unicode (possibly
|regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes are
|not: File.open with Russian letters in path don’t finds the file.

Strange. Ruby does not convert encoding, so that there should be no
problem opening files, if you are using strings in the encoding your OS
expect. If they are differ, you have to specify (and convert) them
properly, no matter how Unicode support is.

						matz.

#13

Roman H. wrote:

In my opinion, Ruby is practically useless for many applications without
proper Unicode support. How a modern language can ignore this issue is
really beyond me.

Is there a plan to get Unicode support into the language anytime soon?

I also think that this is very important.


#14

On Jun 14, 2006, at 15:56 , Victor S. wrote:

As mentioned in this topic, it’s String#length, upcase, downcase,
capitalize.

Just to chime in, aren’t upcase, downcase, and capitalize a locale/
localization issue rather than a Unicode-only issue per se? For
example, different languages will have different rules for
capitalization. Or am I wrong? Does Unicode in and of itself address
these issues?

Granted, proper support for upcase, downcase, and capitalize is
important, but I think it’s a separate issue, part of m17n as a whole
rather than support for Unicode in particular.

Michael G.
grzm seespotcode net


#15

From: Yukihiro M. [mailto:removed_email_address@domain.invalid]
Sent: Wednesday, June 14, 2006 9:35 AM

In that sense, I am one of the non-English-writers,

Sorry, Matz, I know, of course. But I know too less about Japanese to
see
how close our tasks are. Under “non-English-writers” I, maybe, had to
say
“European languages” or so - which has common punctuations, LTR writing,
“words” and “whitespaces” and so on. I have almost no knowledge about
Japanese, Korean, Arabic, Hebrew people needs.

so that I can
suppose I know what we need. And I have no problem with the current
UTF-8 support. Maybe that’s because Japanese don’t have cases in our
characters. Or maybe I’m missing something.

Just what I’ve said above.

Can you show us your
concrete problems caused by Ruby’s lack of “proper” Unicode support?

As mentioned in this topic, it’s String#length, upcase, downcase,
capitalize.

BTW, does String#length works good for you?

Moreover, there seems to be some huge problems with pathes having
Russian
letters; but I’m really not convinced, if Ruby really has to handle
this.

|also, some other classes can be affected by Unicode (possibly
|regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes
are
|not: File.open with Russian letters in path don’t finds the file.

Strange. Ruby does not convert encoding, so that there should be no
problem opening files, if you are using strings in the encoding your OS
expect. If they are differ, you have to specify (and convert) them
properly, no matter how Unicode support is.

Oh, it’s a bit hard theme for me. I know Windows XP must support Unicode
file names; I see my filenames in Russian, but I have low knowledge of
system internals to say, are they really Unicode?

If not take in account those problems, the only String problems remains,
but
they are so base core methods!

V.


#16

Hi,

As mentioned in this topic, it’s String#length, upcase, downcase,
capitalize.

BTW, does String#length works good for you?

To have the length of a Unicode string, just do str.split(//).length,
or “require ‘jcode’” at the beginning of your code.
For the other functions, try looking at the unicode library
http://www.yoshidam.net/Ruby.html#unicode

Oh, it’s a bit hard theme for me. I know Windows XP must support Unicode
file names; I see my filenames in Russian, but I have low knowledge of
system internals to say, are they really Unicode?

Windows XP does support Unicode file names, but I’m not sure you can
use them with Ruby (I do not use Ruby much under Windows). Try
converting the file names to your current locale, it should work if
the file names can be converted to it. What I mean is that Russian
file names encoded in the Windows Russian encoding should work on a
Russian PC.

Hope this helps,

Cheers,
Vincent ISAMBART


#17

Hi,

In message “Re: Unicode roadmap?”
on Wed, 14 Jun 2006 15:56:02 +0900, “Victor S.”
removed_email_address@domain.invalid writes:

|> Can you show us your
|> concrete problems caused by Ruby’s lack of “proper” Unicode support?
|
|As mentioned in this topic, it’s String#length, upcase, downcase,
|capitalize.

OK. Case is the problem. I understand.

|BTW, does String#length works good for you?

I don’t remember the last time I needed length method to count
character numbers. Actually I don’t count string length at all both
in bytes and characters in my string processing. Maybe this is a
special case. I am too optimized for Ruby string operations using
Regexp.

|Oh, it’s a bit hard theme for me. I know Windows XP must support Unicode
|file names; I see my filenames in Russian, but I have low knowledge of
|system internals to say, are they really Unicode?

Windows 32 path encoding is a nightmare. Our Win32 maintainers often
troubled by unexpected OS behavior. I am sure we can handle Russian
path names, but we need help from Russian people to improve.

						matz.

#18

From: Michael G. [mailto:removed_email_address@domain.invalid]
Sent: Wednesday, June 14, 2006 10:08 AM

On Jun 14, 2006, at 15:56 , Victor S. wrote:

As mentioned in this topic, it’s String#length, upcase, downcase,
capitalize.

Just to chime in, aren’t upcase, downcase, and capitalize a locale/
localization issue rather than a Unicode-only issue per se? For
example, different languages will have different rules for
capitalization.

Really? I know about two cases: European capitalization and no
capitalization.

But, really, you maybe right. I suppose, Florian G. can say something
about German-specific capitalization issues.

Granted, proper support for upcase, downcase, and capitalize is
important, but I think it’s a separate issue, part of m17n as a whole
rather than support for Unicode in particular.

Maybe. Generally, sometimes I want Unicode, and sometimes (for “quick
dirty”
scripts) I’ll prefer capitalization and regexps “just work” with
Windows-1251 (one-byte Russian encoding).

V.


#19

From: Vincent I. [mailto:removed_email_address@domain.invalid]
Sent: Wednesday, June 14, 2006 10:14 AM

As mentioned in this topic, it’s String#length, upcase, downcase,
capitalize.

BTW, does String#length works good for you?

To have the length of a Unicode string, just do str.split(//).length,
or “require ‘jcode’” at the beginning of your code.
For the other functions, try looking at the unicode library
http://www.yoshidam.net/Ruby.html#unicode

I know about it. But, theoretically speaking, such a “core” methods muts
be
in core. Not?

properly, no matter how Unicode support is.
Russian PC.
Yes, they works. But I can’t solve the problem: need Ruby Unicode
support
include filenames operations?

V.


#20

Yukihiro M. skrev:

Hi,

In message “Re: Unicode roadmap?”
on Wed, 14 Jun 2006 06:13:03 +0900, Roman H. removed_email_address@domain.invalid writes:
|In my opinion, Ruby is practically useless for many applications without
|proper Unicode support. How a modern language can ignore this issue is
|really beyond me.

Define “proper Unicode support” first.

I won’t define “proper Unicode support” here.

But there must be a problem somewhere since pure-ruby Ferret doesn’t
support UTF-8. You need to use the c-extension of Ferret to have it
support UTF-8 (which doesn’t work on Windows yet :frowning: ). I don’t know if
that is just a sucky impl of Ferret or if it’s Ruby that make it so.

Maybe Dave Balmain can enlighten us why UTF-8 doesn’t work in the pure
Ruby version and what is needed of Ruby to make it work (if it’s
actually Ruby’s fault that is)?

My personal belief is that it should just work in a case like this if
data in is UTF-8 and search strings is UTF-8 without the lib author
and/or user having to do anything very special to make it work (apart
from specifying encoding). Am I wrong in this?

Regards,

Marcus