In my opinion, Ruby is practically useless for many applications without proper Unicode support. How a modern language can ignore this issue is really beyond me. Is there a plan to get Unicode support into the language anytime soon?
on 13.06.2006 23:12
on 14.06.2006 00:28
Hi,
In message "Re: Unicode roadmap?"
on Wed, 14 Jun 2006 06:13:03 +0900, Roman Hausner
<roman.hausner@gmail.com> writes:
|In my opinion, Ruby is practically useless for many applications without
|proper Unicode support. How a modern language can ignore this issue is
|really beyond me.
Define "proper Unicode support" first.
|Is there a plan to get Unicode support into the language anytime soon?
I'm planning enhancing Unicode support in 1.9 in a year or so
(finally). But I'm not sure that conforms your definition of "proper
Unicode support". Note that 1.8 handles Unicode (UTF-8) if your
string operations are based on Regexp.
matz.
on 14.06.2006 00:38
> Define "proper Unicode support" first.
having an unicode-equivalent for all methods of class String
like size, slice, upcase
E.g. I tried the unicode plugin... but, alas, who want's to write stuff
like 'normalize_KC' etc. if you just want the frickin' substring of a
string?!
you need to read books on unicode just to properly use the plugin...
aargg :-((
Best regards
Peter
Yukihiro Matsumoto schrieb:
on 14.06.2006 00:51
On Jun 13, 2006, at 6:34 PM, Pete wrote: >> Define "proper Unicode support" first. > > having an unicode-equivalent for all methods of class String > > like size, slice, upcase > > E.g. I tried the unicode plugin... but, alas, who want's to write > stuff like 'normalize_KC' etc. if you just want the frickin' > substring of a string?! > def substring(str, start, len) md = str.match(/\A.{#{start}}(.{#{len}})/) md[1] end def strlength(str) n = 0 str.gsub(/./m) { n += 1; $& } n end See! Regexps do everything! Just you know, set $KCODE and use these methods and you are set! (I am kidding... btw)
on 14.06.2006 01:00
From the theoretical point of view this is quite interesting. Also I understand the humor :-) Performance and memory consumption should be breathtaking using regexp just everywhere... Also there are a ____few____ methods left :-) As I am German the 'missing' unicode support is one of the greatest obstacles for me (and probably all other Germans doing their stuff seriously)... Logan Capaldo schrieb:
on 14.06.2006 01:13
From: Pete [mailto:pertl@gmx.org] Sent: Wednesday, June 14, 2006 1:58 AM > As I am German the 'missing' unicode support is one of the greatest > obstacles for me (and probably all other Germans doing their stuff > seriously)... The same is for Russians/Ukrainians. In our programming communities question "does the programming language supports Unicode as 'native'?" has very high priority. /BTW, here is one of the things where Python beats Ruby completely V.
on 14.06.2006 01:59
I suspect the Japanese posters on this list can answer better than I can, but my impression is that Unicode is, shall we say, not highly thought of outside Europe and North America. The way they dealt with "Chinese" characters was apparently more than a bit of a hack, and just doesn't work very well in the real world. Reading some of the explanations for glyphs versus characters in Unicode just makes you shake your head. What were they thinking? Sure doesn't pass the smell test, although I'll be the first to admit I haven't exactly thought deeply about the subject. There's another problem with Japanese - I've got a friend who's been dealing with some issues around the fact that Japanese apparently innovates new characters on a regular basis, and everyone is expected to use the new characters. (I believe this is called gaiji). The concept of a fixed character set apparently just isn't a good idea to start with. [Awaiting corrections from people who actually know something about this topic :-)...] - James Moore
on 14.06.2006 02:14
On 6/14/06, James Moore <banshee@banshee.com> wrote: > with some issues around the fact that Japanese apparently innovates new > characters on a regular basis, and everyone is expected to use the new > characters. (I believe this is called gaiji). The concept of a fixed > character set apparently just isn't a good idea to start with. > > [Awaiting corrections from people who actually know something about this > topic :-)...] There is a good summary of the han unification controversy on wikipedia; http://en.wikipedia.org/wiki/Han_unification
on 14.06.2006 03:16
On Jun 13, 2006, at 7:56 PM, James Moore wrote:
> topic :-)...]
I have one Japanese person here who's never heard of this gaiji
concept. But it could be new and behind a generation gap of some
kind. They do sure like to add symbols where they can, though.
Especially graphical star characters. I see that a lot.
-Mat
on 14.06.2006 04:38
Hi,
In message "Re: Unicode roadmap?"
on Wed, 14 Jun 2006 08:11:49 +0900, "Victor Shepelev"
<vshepelev@imho.com.ua> writes:
|From: Pete [mailto:pertl@gmx.org]
|Sent: Wednesday, June 14, 2006 1:58 AM
|> As I am German the 'missing' unicode support is one of the greatest
|> obstacles for me (and probably all other Germans doing their stuff
|> seriously)...
|
|The same is for Russians/Ukrainians. In our programming communities question
|"does the programming language supports Unicode as 'native'?" has very high
|priority.
Alright, then what specific features are you (both) missing? I don't
think it is a method to get number of characters in a string. It
can't be THAT crucial. I do want to cover "your missing features" in
the future M17N support in Ruby.
matz.
on 14.06.2006 07:29
From: Yukihiro Matsumoto [mailto:matz@ruby-lang.org] Sent: Wednesday, June 14, 2006 5:37 AM > |The same is for Russians/Ukrainians. In our programming communities > matz. I suppose, all we (non-English-writers) need is to have all string-related methods working. Just for now, I think about plain testing each string method; also, some other classes can be affected by Unicode (possibly regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes are not: File.open with Russian letters in path don't finds the file. More generally, it can make sense to have Unicode as the "base" mode; where non-Unicode to stay "old, compatibility" mode. Something like this. V.
on 14.06.2006 07:54
Roman Hausner wrote: > In my opinion, Ruby is practically useless for many applications without > proper Unicode support. How a modern language can ignore this issue is > really beyond me. > > Is there a plan to get Unicode support into the language anytime soon? I also think that this is very important.
on 14.06.2006 08:37
Hi,
In message "Re: Unicode roadmap?"
on Wed, 14 Jun 2006 14:26:30 +0900, "Victor Shepelev"
<vshepelev@imho.com.ua> writes:
|I suppose, all we (non-English-writers) need is to have all string-related
|methods working. Just for now, I think about plain testing each string
|method;
In that sense, _I_ am one of the non-English-writers, so that I can
suppose I know what we need. And I have no problem with the current
UTF-8 support. Maybe that's because Japanese don't have cases in our
characters. Or maybe I'm missing something. Can you show us your
concrete problems caused by Ruby's lack of "proper" Unicode support?
|also, some other classes can be affected by Unicode (possibly
|regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes are
|not: File.open with Russian letters in path don't finds the file.
Strange. Ruby does not convert encoding, so that there should be no
problem opening files, if you are using strings in the encoding your OS
expect. If they are differ, you have to specify (and convert) them
properly, no matter how Unicode support is.
matz.
on 14.06.2006 08:56
From: Yukihiro Matsumoto [mailto:matz@ruby-lang.org] Sent: Wednesday, June 14, 2006 9:35 AM > > In that sense, _I_ am one of the non-English-writers, Sorry, Matz, I know, of course. But I know too less about Japanese to see how close our tasks are. Under "non-English-writers" I, maybe, had to say "European languages" or so - which has common punctuations, LTR writing, "words" and "whitespaces" and so on. I have almost no knowledge about Japanese, Korean, Arabic, Hebrew people needs. > so that I can > suppose I know what we need. And I have no problem with the current > UTF-8 support. Maybe that's because Japanese don't have cases in our > characters. Or maybe I'm missing something. Just what I've said above. > Can you show us your > concrete problems caused by Ruby's lack of "proper" Unicode support? As mentioned in this topic, it's String#length, upcase, downcase, capitalize. BTW, does String#length works good for you? Moreover, there seems to be some huge problems with pathes having Russian letters; but I'm really not convinced, if Ruby really has to handle this. > > |also, some other classes can be affected by Unicode (possibly > |regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes > are > |not: File.open with Russian letters in path don't finds the file. > > Strange. Ruby does not convert encoding, so that there should be no > problem opening files, if you are using strings in the encoding your OS > expect. If they are differ, you have to specify (and convert) them > properly, no matter how Unicode support is. Oh, it's a bit hard theme for me. I know Windows XP must support Unicode file names; I see my filenames in Russian, but I have low knowledge of system internals to say, are they really Unicode? If not take in account those problems, the only String problems remains, but they are so base core methods! V.
on 14.06.2006 09:09
On Jun 14, 2006, at 15:56 , Victor Shepelev wrote: > As mentioned in this topic, it's String#length, upcase, downcase, > capitalize. Just to chime in, aren't upcase, downcase, and capitalize a locale/ localization issue rather than a Unicode-only issue per se? For example, different languages will have different rules for capitalization. Or am I wrong? Does Unicode in and of itself address these issues? Granted, proper support for upcase, downcase, and capitalize is important, but I think it's a separate issue, part of m17n as a whole rather than support for Unicode in particular. Michael Glaesemann grzm seespotcode net
on 14.06.2006 09:15
Hi, > As mentioned in this topic, it's String#length, upcase, downcase, > capitalize. > > BTW, does String#length works good for you? To have the length of a Unicode string, just do str.split(//).length, or "require 'jcode'" at the beginning of your code. For the other functions, try looking at the unicode library http://www.yoshidam.net/Ruby.html#unicode > Oh, it's a bit hard theme for me. I know Windows XP must support Unicode > file names; I see my filenames in Russian, but I have low knowledge of > system internals to say, are they really Unicode? Windows XP does support Unicode file names, but I'm not sure you can use them with Ruby (I do not use Ruby much under Windows). Try converting the file names to your current locale, it should work if the file names can be converted to it. What I mean is that Russian file names encoded in the Windows Russian encoding should work on a Russian PC. Hope this helps, Cheers, Vincent ISAMBART
on 14.06.2006 09:22
Hi,
In message "Re: Unicode roadmap?"
on Wed, 14 Jun 2006 15:56:02 +0900, "Victor Shepelev"
<vshepelev@imho.com.ua> writes:
|> Can you show us your
|> concrete problems caused by Ruby's lack of "proper" Unicode support?
|
|As mentioned in this topic, it's String#length, upcase, downcase,
|capitalize.
OK. Case is the problem. I understand.
|BTW, does String#length works good for you?
I don't remember the last time I needed length method to count
character numbers. Actually I don't count string length at all both
in bytes and characters in my string processing. Maybe this is a
special case. I am too optimized for Ruby string operations using
Regexp.
|Oh, it's a bit hard theme for me. I know Windows XP must support Unicode
|file names; I see my filenames in Russian, but I have low knowledge of
|system internals to say, are they really Unicode?
Windows 32 path encoding is a nightmare. Our Win32 maintainers often
troubled by unexpected OS behavior. I am sure we _can_ handle Russian
path names, but we need help from Russian people to improve.
matz.
on 14.06.2006 09:25
From: Michael Glaesemann [mailto:grzm@seespotcode.net] Sent: Wednesday, June 14, 2006 10:08 AM > On Jun 14, 2006, at 15:56 , Victor Shepelev wrote: > > > As mentioned in this topic, it's String#length, upcase, downcase, > > capitalize. > > Just to chime in, aren't upcase, downcase, and capitalize a locale/ > localization issue rather than a Unicode-only issue per se? For > example, different languages will have different rules for > capitalization. Really? I know about two cases: European capitalization and no capitalization. But, really, you maybe right. I suppose, Florian Gross can say something about German-specific capitalization issues. > Granted, proper support for upcase, downcase, and capitalize is > important, but I think it's a separate issue, part of m17n as a whole > rather than support for Unicode in particular. Maybe. Generally, sometimes I want Unicode, and sometimes (for "quick dirty" scripts) I'll prefer capitalization and regexps "just work" with Windows-1251 (one-byte Russian encoding). V.
on 14.06.2006 09:26
From: Vincent Isambart [mailto:vincent.isambart@gmail.com] Sent: Wednesday, June 14, 2006 10:14 AM > > As mentioned in this topic, it's String#length, upcase, downcase, > > capitalize. > > > > BTW, does String#length works good for you? > > To have the length of a Unicode string, just do str.split(//).length, > or "require 'jcode'" at the beginning of your code. > For the other functions, try looking at the unicode library > http://www.yoshidam.net/Ruby.html#unicode I know about it. But, theoretically speaking, such a "core" methods muts be in core. Not? > > > properly, no matter how Unicode support is. > Russian PC. Yes, they works. But I can't solve the problem: need Ruby Unicode support include filenames operations? V.
on 14.06.2006 09:32
From: Yukihiro Matsumoto [mailto:matz@ruby-lang.org] Sent: Wednesday, June 14, 2006 10:20 AM > OK. Case is the problem. I understand. > > |BTW, does String#length works good for you? > > I don't remember the last time I needed length method to count > character numbers. Actually I don't count string length at all both > in bytes and characters in my string processing. Maybe this is a > special case. I am too optimized for Ruby string operations using > Regexp. I can confirm. But I'm afraid that some libraries I rely on use #length and can break when #length doesn't work. > |Oh, it's a bit hard theme for me. I know Windows XP must support Unicode > |file names; I see my filenames in Russian, but I have low knowledge of > |system internals to say, are they really Unicode? > > Windows 32 path encoding is a nightmare. Our Win32 maintainers often > troubled by unexpected OS behavior. I am sure we _can_ handle Russian > path names, but we need help from Russian people to improve. In Russian encoding (Win-1251) and on Russian PC all works well. In Unicode it doesn't, but I'm not convinced it must. In any case, I'm ready to spend my time helping Ruby community (especially in Russian/Ukrainian localization issues), because I really love the language. V.
on 14.06.2006 09:45
Yukihiro Matsumoto skrev: > Hi, > > In message "Re: Unicode roadmap?" > on Wed, 14 Jun 2006 06:13:03 +0900, Roman Hausner <roman.hausner@gmail.com> writes: > |In my opinion, Ruby is practically useless for many applications without > |proper Unicode support. How a modern language can ignore this issue is > |really beyond me. > > Define "proper Unicode support" first. > I won't define "proper Unicode support" here. But there must be a problem somewhere since pure-ruby Ferret doesn't support UTF-8. You need to use the c-extension of Ferret to have it support UTF-8 (which doesn't work on Windows yet :( ). I don't know if that is just a sucky impl of Ferret or if it's Ruby that make it so. Maybe Dave Balmain can enlighten us why UTF-8 doesn't work in the pure Ruby version and what is needed of Ruby to make it work (if it's actually Ruby's fault that is)? My personal belief is that it should just work in a case like this if data in is UTF-8 and search strings is UTF-8 without the lib author and/or user having to do anything very special to make it work (apart from specifying encoding). Am I wrong in this? Regards, Marcus
on 14.06.2006 10:23
On Jun 13, 2006, at 10:26 PM, Victor Shepelev wrote: > Regexps seems to work fine (in my 1.9), but pathes are > not: File.open with Russian letters in path don't finds the file. On OS X multibyte filenames work: $ cat x.rb $KCODE = 'u' puts File.read('Cyrillic_Я.txt') $ cat Cyrillic_\320\257.txt test file with Я! $ ruby x.rb test file with Я! $ uname -a Darwin kaa.jijo.segment7.net 8.6.0 Darwin Kernel Version 8.6.0: Tue Mar 7 16:58:48 PST 2006; root:xnu-792.6.70.obj~1/RELEASE_PPC Power Macintosh powerpc $ ruby -v ruby 1.8.4 (2006-05-18) [powerpc-darwin8.6.0] $ -- Eric Hodel - drbrain@segment7.net - http://blog.segment7.net This implementation is HODEL-HASH-9600 compliant http://trackmap.robotcoop.com
on 14.06.2006 10:55
On 14/06/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote: > Windows 32 path encoding is a nightmare. Our Win32 maintainers often > troubled by unexpected OS behavior. I am sure we _can_ handle Russian > path names, but we need help from Russian people to improve. str.sub!('32 path encoding ', '') # :-) I don't use Windows much, but as I understand it, Ruby interacts with most of the Win32 API using the 'legacy code page', which is only a subset of what the filesystem can handle. (Windows NT and its successors use Unicode internally, and the filesystem is UTF-16 KC-normalised IIRC). Windows does provide Unicode API functions, but to use those, a layer of translation between UTF-16 and UTF-8 would be needed, as Ruby can't do anything useful with UTF-16 at present. I believe that Austin Ziegler was looking into this; I don't know if he's made any progress. Even if a Ruby program uses UTF-8 internally, it should be possible to access the filesystem by Iconv'ing paths to the appropriate code page - providing that they don't contain characters not in the code page. It's far from ideal, though: the real solution is for Ruby to use the Unicode functions (those suffixed with W) in the API. The upside is that UTF-8/UTF-16 conversion should be less expensive than the code page conversion that's inside each of Win32's non-Unicode functions. On the other hand, plenty of Windows programs don't support Unicode properly either. Paul.
on 14.06.2006 11:00
On 14/06/06, Victor Shepelev <vshepelev@imho.com.ua> wrote: > I can confirm. But I'm afraid that some libraries I rely on use #length and > can break when #length doesn't work. Those libraries should probably be considered broken; they can and should be patched to do any human-readable-string processing in an encoding-safe manner (e.g. by using jcode's jlength and each_char methods). Paul.
on 14.06.2006 11:09
-------- Original-Nachricht --------
Datum: Wed, 14 Jun 2006 17:58:41 +0900
Von: Paul Battley <pbattley@gmail.com>
An: ruby-talk@ruby-lang.org
Betreff: Re: Unicode roadmap?
> Paul.
That will be quite _some_ libraries, I guess...
on 14.06.2006 11:12
On 14/06/06, Victor Shepelev <vshepelev@imho.com.ua> wrote: > > Just to chime in, aren't upcase, downcase, and capitalize a locale/ > > localization issue rather than a Unicode-only issue per se? For > > example, different languages will have different rules for > > capitalization. > > Really? I know about two cases: European capitalization and no > capitalization. There is variety even within western European languages - Dutch, for example, differs from English (IJsselmeer). Paul.
on 14.06.2006 11:16
From: Paul Battley [mailto:pbattley@gmail.com]
Sent: Wednesday, June 14, 2006 12:10 PM
> example, differs from English (IJsselmeer).
I already realized. (I've said about Florian Gross, his surname last
"ss"
normally printed in something like "B" I can't type and my Outlook can't
show :) AFAIK, it is normally printed as one letter in downcase and two
letters in uppercase. So, "single general" String#upcase, #downcase are
totally impossible.
V.
on 14.06.2006 11:25
On 6/14/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote: > | > |The same is for Russians/Ukrainians. In our programming communities question > |"does the programming language supports Unicode as 'native'?" has very high > |priority. > > Alright, then what specific features are you (both) missing? I don't > think it is a method to get number of characters in a string. It > can't be THAT crucial. I do want to cover "your missing features" in > the future M17N support in Ruby. > What I want is all methods working seamlessly with unicode strings so that I do not have to think about the encoding. Regexps do work with utf-8 strings if KCODE is set to u (but it defaults to n even when locale uses UTF-8). String searches should probably work but they would retrurn wrong position. Things like split should work for utf-8, the encoding is pretty well defined. But one might want to use length and [] to work with strings. It can be simulated with unicode_string=string.scan(/./). But it is no longer a string. It is composed of characters only as long as I assign only characters using []=. The string functions should do the right thing even for utf-8. But I guess utf-32 is more useful for working with strings this way. It might be a good idea to stick encoding information into strings (it is probably the only way how internationalization can be done and the sanity of all involved preserved at the same time). The functions for comparison, etc could use it to do the right thing even if strings come in several encodings. ie. cp1251 from the system, utf-8 from a web page, ... Functions like open could convert the string correctly according to locale. One should be able to set the encoding information (ie for web page title when the meta tag for content type is found in a web page),and remove it to suppress string conversion. It should be also possible to convert the string (ie to UTF-32 to speed up character access). Things like <=>, upcase, downcase, etc make sense only in context of locale (language). Only the encoding does not define them. I guess the default <=>is based on the binary representation of the string. This would mean different sorting of the same strings in different encodings. Sorting by the unicode code point would be at least the same for any encoding. Thanks Michal
on 14.06.2006 11:35
On 6/14/06, Victor Shepelev <vshepelev@imho.com.ua> wrote: > > capitalization. > > Really? I know about two cases: European capitalization and no Really. > capitalization. There is no such thing like European capitalization. There is only <insert your language> capitalization. The german character Ã? has no uppercase version. In most languages using Latin script the uppercase of 'i' is 'I'. But Turkish has i and i without dot, and the uppercase of 'i' is, of course, I with dot. Thanks Michal
on 14.06.2006 11:41
On 14/06/06, Michal Suchanek <hramrach@centrum.cz> wrote: > It should be also > possible to convert the string (ie to UTF-32 to speed up character > access). utf8_string.unpack('U*') is pretty close to this, giving an array of codepoints. Paul.
on 14.06.2006 12:54
On 6/14/06, Paul Battley <pbattley@gmail.com> wrote: > On 14/06/06, Michal Suchanek <hramrach@centrum.cz> wrote: > > It should be also > > possible to convert the string (ie to UTF-32 to speed up character > > access). > > utf8_string.unpack('U*') is pretty close to this, giving an array of codepoints. But I want it to be string after the conversion, so that I can use the standard string functions with sane results. I do not want to think about varoius encodings myself if my application has to use them. The runtime should do that. Thanks Michal
on 14.06.2006 14:23
On 6/14/06, Victor Shepelev <vshepelev@imho.com.ua> wrote: > Oh, it's a bit hard theme for me. I know Windows XP must support Unicode > file names; I see my filenames in Russian, but I have low knowledge of > system internals to say, are they really Unicode? They are UTF-16 internally. I haven't been paying attention to Ruby 1.9 lately, but when I have time and have noticed that Matz has checked in support for m17n strings, I will be enhancing support for Windows files to use Unicode. Currently, Ruby is built using the non-Unicode form *only*. And no, using -DUNICODE is the *wrong* answer, thanks. We'd have to start using TCHAR instead of char, and it would actually mean that we'd be using wchar_t instead of char in this case. I've already done a similar (but more complex) project at work. -austin
on 14.06.2006 14:29
On 6/14/06, Vincent Isambart <vincent.isambart@gmail.com> wrote: > Windows XP does support Unicode file names, but I'm not sure you can > use them with Ruby (I do not use Ruby much under Windows). Try > converting the file names to your current locale, it should work if > the file names can be converted to it. What I mean is that Russian > file names encoded in the Windows Russian encoding should work on a > Russian PC. You can't currently use them with Ruby. The file operations in Ruby are using the likes of CreateFileA instead of CreateFileW (it's not that explicit; Ruby is compiled without -DUNICODE -- which is the correct thing to do in Ruby's case -- which means that CreateFile is CreateFileA). All files are stored on the filesystem as UTF-16, though, even if you are using "ANSI" access. By the way, there are multiple Russian encodings, so ... Unicode is better for this point. As I said in my previous message, I have already planned to enhance the Windows filesystem support when Matz gets the m17n strings in so that I can *always* force the file routines on Windows to provide either UTF-8 or UTF-16 (probably the former, since it will also make it easier to work with existing extensions) and indicate that the strings are such. -austin
on 14.06.2006 14:29
On 6/14/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote: > Windows 32 path encoding is a nightmare. Our Win32 maintainers often > troubled by unexpected OS behavior. I am sure we _can_ handle Russian > path names, but we need help from Russian people to improve. It's not that bad, Matz. I started as a Unix developer, but in the last two years I have learned *quite* a bit about how Windows handles this stuff and we can adapt what I did for work with no problem. I just need M17N strings to support this. I should look at what I can/should do to provide this as an extension, I just have no time. :( -austin
on 14.06.2006 14:36
On 6/14/06, Michal Suchanek <hramrach@centrum.cz> wrote: > What I want is all methods working seamlessly with unicode strings so > that I do not have to think about the encoding. That will *never* happen. Even with Unicode, you have to think about the encoding, because UTF-32 (the closest representation to the Platonic ideal "Unicode" you'll ever find) is unlikely to be supported in the general case. Matz's idea of m17n strings is the right one: you have a "byte stream" and an attribute which indicates how the byte stream is encoded. This will sort of be like $KCODE but on an individual string level so that you could meaningfully have Unicode (probably UTF-8) and ShiftJIS strings in the same data and still meaningfully call #length on them. You will *always* have to care about the encoding. As well as, ultimately, your locale. -austin
on 14.06.2006 23:40
On Wednesday 14 June 2006 06:52 am, Michal Suchanek wrote: > On 6/14/06, Paul Battley <pbattley@gmail.com> wrote: > > On 14/06/06, Michal Suchanek <hramrach@centrum.cz> wrote: > > > It should be also > > > possible to convert the string (ie to UTF-32 to speed up character > > > access). (RE my previous post): Oops, maybe UTF-32 is exactly what I was alluding to? Randy Kramer (Should have waited a little longer before posting.)
on 15.06.2006 02:12
Every time these unicode discussions come up my head spins like a top. You should see it. We JRubyists have headaches from the unicode question too. Since JRuby is currently 1.8-compatible, we do not have what most call *native* unicode support. This is primarily because we do not wish to create an incompatible version of Ruby or build in support for unicode now that would conflict with Ruby 2.0 in the future. It is, however, embarressing to say that although we run on top of Java, which has arguably pretty good unicode support, we don't support unicode. Perhaps you can see our conundrum. I am no unicode expert. I know that Java uses UTF16 strings internally, converted to/from the current platform's encoding of choice by default. It also supports converting those UTF16 strings into just about every encoding out there, just by telling it to do so. Java supports the Unicode specification version 3.0. So Unicode is not a problem for Java. We would love to be able to support unicode in JRuby, but there's always that nagging question of what it should look like and what would mesh well with the Ruby community at large. With the underlying platform already rich with unicode support, it would not take much effort to modify JRuby. So then there's a simple question: What form would you, the Ruby users, want unicode to take? Is there a specific library that you feel encompasses a reasonable implementation of unicode support, e.g. icu4r? Should the support be transparent, e.g. no longer treat or assume strings are byte vectors? JRuby, because we use Java's String, is already using UTF16 strings exclusively...however there's no way to get at them through core Ruby APIs. What would be the most comfortable way to support unicode now, considering where Ruby may go in the future?
on 15.06.2006 02:22
I posted this to ruby-talk, but it occurred to me that you folks implementing Rails functionality probably have a thing or two to say about unicode support in Ruby. Therefore, I would love to hear your opinions. Adding native unicode support is only a matter of time in JRuby; its usefulness as a JVM-based language depends on it. However, we continue to wrestle with how best to support unicode without stepping on the Ruby community's toes in the process. Thoughts? ---------- Forwarded message ---------- From: Charles O Nutter <headius@headius.com> Date: Jun 14, 2006 7:11 PM Subject: Re: Unicode roadmap? To: ruby-talk ML <ruby-talk@ruby-lang.org> Every time these unicode discussions come up my head spins like a top. You should see it. We JRubyists have headaches from the unicode question too. Since JRuby is currently 1.8-compatible, we do not have what most call *native* unicode support. This is primarily because we do not wish to create an incompatible version of Ruby or build in support for unicode now that would conflict with Ruby 2.0 in the future. It is, however, embarressing to say that although we run on top of Java, which has arguably pretty good unicode support, we don't support unicode. Perhaps you can see our conundrum. I am no unicode expert. I know that Java uses UTF16 strings internally, converted to/from the current platform's encoding of choice by default. It also supports converting those UTF16 strings into just about every encoding out there, just by telling it to do so. Java supports the Unicode specification version 3.0. So Unicode is not a problem for Java. We would love to be able to support unicode in JRuby, but there's always that nagging question of what it should look like and what would mesh well with the Ruby community at large. With the underlying platform already rich with unicode support, it would not take much effort to modify JRuby. So then there's a simple question: What form would you, the Ruby users, want unicode to take? Is there a specific library that you feel encompasses a reasonable implementation of unicode support, e.g. icu4r? Should the support be transparent, e.g. no longer treat or assume strings are byte vectors? JRuby, because we use Java's String, is already using UTF16 strings exclusively...however there's no way to get at them through core Ruby APIs. What would be the most comfortable way to support unicode now, considering where Ruby may go in the future? -- Charles Oliver Nutter @ headius.blogspot.com JRuby Developer @ jruby.sourceforge.net Application Architect @ www.ventera.com
on 15.06.2006 02:40
On 15-jun-2006, at 2:11, Charles O Nutter wrote: > with unicode support, it would not take much effort to modify > JRuby. So then > there's a simple question: Yukihiro Matsumoto wrote: > > Define "proper Unicode support" first. > > I'm planning enhancing Unicode support in 1.9 in a year or so > (finally). But I'm not sure that conforms your definition of "proper > Unicode support". Note that 1.8 handles Unicode (UTF-8) if your > string operations are based on Regexp. > Hello everyone, and sorry for chiming so fiercely. Got into some confusion with the ML controls. Just joined the list seeing the subject popping up once more. I am doing Unicode-aware apps in Rails and Ruby right now and it hurts. I'll try to define "proper Unicode support" as I (dream of it at night) see it. 1. All string indexing (length, index, slice, insert) works with characters instead of bytes, whatever length in bytes the characters have to be. String methods (index or =~) should _never_ return offsets that will damage the string's characters if employed for slicing - you shouldn't have to manually translate the byte offset of 2 to character offset of 1 because the second character is multibyte. Simple example: def translate_offset(str, byte_offset) chunk = str[0..byte_offset] begin chunk.unpack("U*").length - 1 rescue ArgumentError # this offset is just wrong! shift upwards and retry chunk = str[0..(byte_offset+=1)] retry end end I think it's unnecessarily painful for something as easy as string =~ /pattern/. Yes, you can get that offset you recieve from =~ and then get the slice of the string and then split it again with /./mu to get the same number etc... 2. Case-insensitive regexes actually work. Even in my Oniguruma- enabled builds of 1.8.2. it was not true (maybe changed now). At least "Unicode general" collation casefolding (such a thing exists) available built-in on every platform. 4. Locale-aware sorting, including multibyte charsets, if provided by the OS 5. Preferably separate (and strictly purposed) Bytestring that you get out of Sockets and use in Servers etc. - or the ability to "force" all strings recieved from external resources to be flagged uniformly as being of a certain encoding in _your_ program, not somewhere in someone's library. If flags have to be set by libraries, they won't be set because most developers sadly don't care: http://www.zackvision.com/weblog/2005/11/mt-unicode-mysql.html http://thraxil.org/users/anders/posts/2005/11/01/unicodification/ 6. Unicode-aware strip dealing with weirdo whitespaces (hair space, thin space etc.) 7. And no, as I mentioned - it doesn't handle it properly because the /i modifier is broken, and to deal without it you need to downcase BOTH the regexp and the string itself. Closed circle - you go and get the Unicode gem with tables. All of this can be controlled either per String (then 99 out of 100 libraries I use will be getting it wrong - see above) or by a global setting such as $KCODE. As an example of something that is ridiculously backwards to do in Ruby now is this (I spent some time refactoring this today): http://dev.rubyonrails.org/browser/trunk/actionpack/lib/action_view/ helpers/text_helper.rb#L44 Here you have a major problem because the /i flag doesn't do anything (Ruby is incapable of Unicode-aware casefolding), and using offsets means that you are always one step from damaging someone's text. It's just wrong that it has to be so painful. Python3000, IMO, gets this right (as does Java) - byte array and a String are sompletely separate, and String operates with characters and characters only. That's what I would expect. Hope this makes sense somewhat :-) -- Julian 'Julik' Tarkhanov please send all personal mail to me at julik.nl
on 15.06.2006 02:40
On Jun 15, 2006, at 2:19 AM, Charles O Nutter wrote: > I posted this to ruby-talk, but it occurred to me that you folks > implementing Rails functionality probably have a thing or two to > say about unicode support in Ruby. Therefore, I would love to hear > your opinions. Adding native unicode support is only a matter of > time in JRuby; its usefulness as a JVM-based language depends on > it. However, we continue to wrestle with how best to support > unicode without stepping on the Ruby community's toes in the > process. Thoughts? Julik has done a lot of pionering in that direction for Rails. His latest suggestion is to use a proxy class on string objects to perform unicode operations: @some_unicode_string.u.length @some_unicode_string.u.reverse I tend to agree with this solution as it doesn't break any previous string operations and gives us an easy way to perform unicode aware operations. Manfred
on 15.06.2006 03:52
I agree it's a very attractive solution. I have two questions related (perhaps you are out there to answer, Julik): 1. How does performance look with the unicode string add-on versus native strings? 2. Is this the ideal way to support unicode strings in ruby? And I explain the second as follows....if we could assume that switching from treating a string as an array of bytes to a list of characters of arbitrary width, and have all existing string operations work correctly treating those characters as string, would that be a better ideal? Where are the breaking points in such a design? What's to stop the underlying implementation from actually using a UTF-16 character, passing UTF-8 to libraries and IO streams but still allowing you to access everything as UTF-16 or your encoding of choice? (Of course this is somewhat rhetorical; we do this currently with JRuby since Java's scrints are UTF-16...we just don't have any way to provide access to UTF-16 characters, and we normalize everything to UTF-8 for Ruby's sake...but what if we didn't normalize and adjusted string functions to compensate?)
on 15.06.2006 04:17
On 15-jun-2006, at 3:50, Charles O Nutter wrote: > operations work correctly treating those characters as string, > would that be a better ideal? Where are the breaking points in such > a design? What's to stop the underlying implementation from > actually using a UTF-16 character, passing UTF-8 to libraries and > IO streams but still allowing you to access everything as UTF-16 or > your encoding of choice? (Of course this is somewhat rhetorical; we > do this currently with JRuby since Java's scrints are UTF-16...we > just don't have any way to provide access to UTF-16 characters, and > we normalize everything to UTF-8 for Ruby's sake...but what if we > didn't normalize and adjusted string functions to compensate?) This is more appropriate for ruby-talk -- Julian 'Julik' Tarkhanov please send all personal mail to me at julik.nl
on 15.06.2006 04:24
I believe that Julik's way of solving the unicode problem (String#u providing access to a unicode helper) is very attractive. I have two questions related, for Julik and the rest of the peanut gallery: 1. How does performance look with the unicode string add-on versus native strings (or as compared to icu4r, which is C-based)? 2. Is this the ideal way to support unicode strings in ruby? And I explain the second as follows....if we could assume switching from treating a string as an array of bytes to a list of characters of arbitrary width, and have all existing string operations work correctly treating those characters as indexed elements of that string, would that be a better ideal? Where are the breaking points in such a design? What's to stop the underlying implementation from actually using a UTF-16 character, passing UTF-8 to libraries and IO streams but still allowing you to access everything as UTF-16 or your encoding of choice? Is it simply libraries or core APIs that explicitly need *byte* counts? (Of course this is somewhat rhetorical; we do this currently with JRuby since Java's strings are UTF-16...we just don't have any uniform way to provide access to UTF-16 character strings, and we normalize everything to UTF-8 for Ruby's sake...but what if we didn't normalize and adjusted string functions to compensate?)
on 15.06.2006 04:28
Fair enough; redirected. If any other rails-core folks want to chime in, please do so...I would expect unicode and multibyte are key issues for worldwide rails deployments.
on 15.06.2006 04:41
On 6/14/06, Charles O Nutter <headius@headius.com> wrote: > I believe that Julik's way of solving the unicode problem (String#u > providing access to a unicode helper) is very attractive. I have two > questions related, for Julik and the rest of the peanut gallery: > 1. How does performance look with the unicode string add-on versus native > strings (or as compared to icu4r, which is C-based)? > 2. Is this the ideal way to support unicode strings in ruby? No. In fact, I believe that Matz has the right idea for M17N strings in Ruby 2.0. The *reality* is that there's a *lot* of data out there that isn't Unicode. I would suggest that JRuby could offer a JavaString that acts in every way like a String except that it provides access to the native UTF-16 implementation. -austin
on 15.06.2006 04:55
On 15-jun-2006, at 4:40, Austin Ziegler wrote: > No. In fact, I believe that Matz has the right idea for M17N strings > in Ruby 2.0. The *reality* is that there's a *lot* of data out there > that isn't Unicode. It's very difficult for me to understand the implementation. What if we concat a Mojikyo string to a UTF8String? UnicodeDecodeError, ordinal not in range? I think Python folks proved that it's terrible (it is). Nothing is ideal. > I would suggest that JRuby could offer a JavaString that acts in every > way like a String except that it provides access to the native UTF-16 > implementation. Just what the ICU4R extension does. It's unusable to the point that you cannot concat a native string with a UString. To the point that you have to use special Regexp class for it. You end up having half of your Ruby script doing typecasting from one to the other. There is alot of data that isn't Unicode, indeed. Converted on input and converted on output if necessary - just as in any other case when the encoding of your system doesn't match your input or output. I don't know if it can be possible to have the "internal" encoding of a system switchable (seems to me this is what Matz wants) - then you can't safely refer to anything other than bytes. And then you get software that you can't use, because they had a different assumtpion than you had as to what encoding the user will be using.
on 15.06.2006 05:01
On 6/14/06, Austin Ziegler <halostatue@gmail.com> wrote: > in Ruby 2.0. The *reality* is that there's a *lot* of data out there > that isn't Unicode. Yes, we all understand that Ruby 2.0 will be the coolest thing since sliced bread, but those of us that are currently developing international websites with Rails don't have the luxury of waiting until Christmas of 2007. -PJ Hyett http://pjhyett.com
on 15.06.2006 05:10
On 6/14/06, PJ Hyett <pjhyett@gmail.com> wrote: > > that isn't Unicode. > Yes, we all understand that Ruby 2.0 will be the coolest thing since > sliced bread, but those of us that are currently developing > international websites with Rails don't have the luxury of waiting > until Christmas of 2007. *shrug* As far as I can tell, there will be no implementation of Ruby before then that has a "native" m17n string. So whether you have the luxury of waiting or not, Ruby 1.8.x will not *ever* have a "Unicode string". Adding a "Unicode string" would *break* behaviour, and no example is better than the extension that was proposed which would change the meaning of #size and #length to mean two different things. So, there's a point where patience is going to be necessary, whether you "have the luxury" or not. -austin
on 15.06.2006 10:47
IIRC, Matz has said that internally String won't change, and I suspect that a CharString class (or smth like) won't be ever added. Maybe just introducing String#encoding flag and addig new methods to String with prefixes, like char_array, char_slice, char_length, char_index, char_downcase, char_strcoll, char_strip, etc. that will internally look at encoding flag and process respectively bytes in this particular string without conversion (just maybe some hidden), and leaving old byte-processing methods intact, would be the way to keep older code working and enjoy M17N? Though, as for me, it is still unclear, what should happen, if one tries to perform operation on two strings with different String#encoding...
on 15.06.2006 13:02
On 6/14/06, Austin Ziegler <halostatue@gmail.com> wrote: > individual string level so that you could meaningfully have Unicode > (probably UTF-8) and ShiftJIS strings in the same data and still > meaningfully call #length on them. > > You will *always* have to care about the encoding. As well as, > ultimately, your locale. No. Since I have locale stdin can be marked with the proper encoding information so that all stings originating there have the proper encoding information. The string methods should not just blindly operate on bytes but use the encoding information to operate on characters rather than bytes. Sure something like byte_length is needed when the string is stored somewhere outside Ruby but standard string methods should work with character offsets and characters, not byte offsets nor bytes. Since my stdout can be also marked with correct encoding the strings that are output there can be converted to that encoding. Even if it originates from a source file that happens to be in a different encoding. Hmm, prehaps it will be necessary to mark source files with encoding tags as well. It could be quite tedious to assingn the tag manually to every string in a source file. When strings are compared, concatenated, .. the encoding is known so the methods should do the right thing. I do not have to care about encoding. You may make a string implemenation that forces me to care (such a the current one). But I do not have to. I can always turn to perl if I get really desperate. Thanks Michal
on 15.06.2006 13:22
On 6/15/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote: > > 5. Preferably separate (and strictly purposed) Bytestring that you > get out of Sockets and use in Servers etc. - or the ability to > "force" all strings recieved from external resources to be flagged > uniformly as being of a certain encoding in _your_ program, not > somewhere in someone's library. If flags have to be set by libraries, > they won't be set because most developers sadly don't care: > > http://www.zackvision.com/weblog/2005/11/mt-unicode-mysql.html > http://thraxil.org/users/anders/posts/2005/11/01/unicodification/ Where else should the strings be flagged? If you get a web page through http request, and the library parses the response for you, it should set enconding on the web page. You would never know since you only received the page, not the header. > setting such as $KCODE. I do not see why libraries should be always wrong. After all, you can always fix them. And setting the encoding globally is a bad thing. You cannot have strings encoded in different encodings in one process then. It looks quite limiting. For one, the web pages that you get from various servers (and even the same server) can be in varoius encodings. Thanks Michal
on 15.06.2006 13:51
On 15-jun-2006, at 13:21, Michal Suchanek wrote: >> http://www.zackvision.com/weblog/2005/11/mt-unicode-mysql.html >> http://thraxil.org/users/anders/posts/2005/11/01/unicodification/ > > Where else should the strings be flagged? They should nog be flagged, because some strings will be flagged and some won't and exactly in the wrong places at the wrong time. See _is_uf_8_ in Perl to witness the terrible ugliness of this. > If you get a web page > through http request, and the library parses the response for you, it > should set enconding on the web page. You would never know since you > only received the page, not the header. That's why you should distinguish between a ByteArray and a String. >> libraries I use will be getting it wrong - see above) or by a global >> setting such as $KCODE. > > I do not see why libraries should be always wrong. After all, you can > always fix them. And setting the encoding globally is a bad thing. You > cannot have strings encoded in different encodings in one process > then. It looks quite limiting. For one, the web pages that you get > from various servers (and even the same server) can be in varoius > encodings. Of course they can (and will). When I have to approach this I usually just snif the encoding of the strings I recieved and then feed them to iconv and friends before doing any processing. A library that downloads stuff off the Internet should be (IMO) aware of the charset madness and decode the strings for me. Trust me, when multibyte/Unicode handling is optional, 80% of libraries do it wrong. Re-read the links above if you don't believe. Actually it seems that the solution with an accessor is quite nice, but that I had to figure out the hard way after breaking the String class with my hacks and seeing stuff collapse. Apparently the poster of a parallel thread finds it inspiring to repeat my experiment _in vitro_ just for the academic sake of it.
on 15.06.2006 15:13
On 6/15/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote: > >> somewhere in someone's library. If flags have to be set by libraries, > >> they won't be set because most developers sadly don't care: > >> > >> http://www.zackvision.com/weblog/2005/11/mt-unicode-mysql.html > >> http://thraxil.org/users/anders/posts/2005/11/01/unicodification/ > > > > Where else should the strings be flagged? > They should nog be flagged, because some strings will be flagged and > some won't and exactly > in the wrong places at the wrong time. See _is_uf_8_ in Perl to > witness the terrible ugliness of this. You can certainly get the things wrong. But if you get a string that is wrongly flagged you have the choice to fix the code where the string originates or work arond it by flagging it right. If you have a code that gets the encoding wrong, and it tries to convert the string to some 'universal' encoding you want to use everywhere in your application, you get a broken string. > > > If you get a web page > > through http request, and the library parses the response for you, it > > should set enconding on the web page. You would never know since you > > only received the page, not the header. > > That's why you should distinguish between a ByteArray and a String. How does it help you here? > >> All of this can be controlled either per String (then 99 out of 100 > Of course they can (and will). When I have to approach this I usually > just snif the encoding of the strings I recieved and then feed them > to iconv and friends before doing any processing. A library that > downloads stuff off the Internet should be (IMO) aware of > the charset madness and decode the strings for me. If it can decode them, it can flag them. It has to be aware - that's it. > > Trust me, when multibyte/Unicode handling is optional, 80% of > libraries do it wrong. Re-read the links above if you don't believe. But they get the very foundation wrong. In Python functions that take multiple strings can only thake them in one encoding. It is impossible to concatenate differently encoded strings. Of course, this is bound to fail. In the other case they use a database with poor support for unicode, and mysql that does exactly the same thing ruby does right now - works with strings as arrays of bytes. Of course, this is going to break. Neither is the case when the strings carry information about their encoding, and the string functions can handle strings encoded differently. The fact that there are libraries and languages with poor unicode support does not mean it must be always poor. Thanks Michal
on 17.06.2006 13:11
On Thu, Jun 15, 2006 at 07:59:54PM +0900, Michal Suchanek wrote: > >stream is encoded. This will sort of be like $KCODE but on an > > The string methods should not just blindly operate on bytes but use > the encoding information to operate on characters rather than bytes. > Sure something like byte_length is needed when the string is stored > somewhere outside Ruby but standard string methods should work with > character offsets and characters, not byte offsets nor bytes. I empathically agree. I'll even repeat and propose a new Plan for Unicode Strings in Ruby 2.0 in 10 points: 1. Strings should deal in characters (code points in Unicode) and not in bytes, and the public interface should reflect this. 2. Strings should neither have an internal encoding tag, nor an external one via $KCODE. The internal encoding should be encapsulated by the string class completely, except for a few related classes which may opt to work with the gory details for performance reasons. The internal encoding has to be decided, probably between UTF-8, UTF-16, and UTF-32 by the String class implementor. 3. Whenever Strings are read or written to/from an external source, their data needs to be converted. The String class encapsulates the encoding framework, likely with additional helper Modules or Classes per external encoding. Some methods take an optional encoding parameter, like #char(index, encoding=:utf8), or #to_ary(encoding=:utf8), which can be used as helper Class or Module selector. 4. IO instances are associated with a (modifyable) encoding. For stdin, stdout this can be derived from the locale settings. String-IO operations work as expected. 5. Since the String class is quite smart already, it can implement generally useful and hard (in the domain of Unicode) operations like case folding, sorting, comparing etc. 6. More exotic operations can easily be provided by additional libraries because of Ruby's open classes. Those operations may be coded depending on on String's public interface for simplicissity, or work with the internal representation directly for performance. 7. This approach leaves open the possibility of String subclasses implementing different internal encodings for performance/space tradeoff reasons which work transparently together (a bit like FixInt and BigInt). 8. Because Strings are tightly integrated into the language with the source reader and are used pervasively, much of this cannot be provided by add-on libraries, even with open classes. Therefore the need to have it in Ruby's canonical String class. This will break some old uses of String, but now is the right time for that. 9. The String class does not worry over character representation on-screen, the mapping to glyphs must be done by UI frameworks or the terminal attached to stdout. 10. Be flexible. <placeholder for future idea> This approach has several advantages and a few disadvantages, and I'll try to bring in some new angles to this now too: *Advantages* -POL, Encapsulation- All Strings behave exactly the same everywhere, are predictable, and do the hard work for their users. -Cross Library Transparency- No String user needs to worry which Strings to pass to a library, or worry which Strings he will get from a library. With Web-facing libraries like rails returning encoding-tagged Strings, you would be likely to get Strings of all possible encodings otherwise, and isthe String user prepared to deal with this properly? This is a *big* deal IMNSHO. -Limited Conversions- Encoding conversions are limited to the time Strings are created or written or explicitly transformed to an external representation. -Correct String Operations- Even basic String operations are very hard in the world of Unicode. If we leave the String users to look at the encoding tags and sort it out themselves, they are bound to make mistakes because they don't care, don't know, or have no time. And these mistakes may be _security_ _sensitive_, since most often credentials are represented as Strings too. There already have been exploits related to Unicode. *Disadvantages* (with mitigating reasoning of course) - String users need to learn that #byte_length(encoding=:utf8) >= #size, but that's not too hard, and applies everywhere. Users do not need to learn about an encoding tag, which is surely worse to handle for them. - Strings cannot be used as simple byte buffers any more. Either use an array of bytes, or an optimized ByteBuffer class. If you need regular expresson support, RegExp can be extended for ByteBuffers or even more. - Some String operations may perform worse than might be expected from a naive user, in both the time or space domain. But we do this so the String user doesn't need to himself, and are problably better at it than the user too. - For very simple uses of String, there might be unneccessary conversions. If a String is just to be passed through somewhere, without inspecting or modifying it at all, in- and outwards conversion will still take place. You could and should use a ByteBuffer to avoid this. - This ties Ruby's String to Unicode. A safe choice IMHO, or would we really consider something else? Note that we don't commit to a particular encoding of Unicode strongly. - More work and time to implement. Some could call it over-engineered. But it will save a lot of time and troubles when shit hits the fan and users really do get unexpected foreign characters in their Strings. I could offer help implementing it, although I have never looked at ruby's source, C-extensions, or even done a lot of ruby programming yet. Close to the start of this discussion Matz asked what the problem with current strings really was for western users. Somewhere later he concluded case folding. I think it is more than that: we are lazy and expect character handling to be always as easy as with 7 bit ASCII, or as close as possible. Fixed 8-bit codepages worked quite fine most of the time in this regard, and breakage was limited to special characters only. Now let's ask the question in reverse: are eastern programmers so used to doing elaborate byte-stream to character handling by hand they don't recognize how hard this is any more? Surely it is a target for DRY if I ever saw one. Or are there actual problems not solveable this way? I looked up the mentioned Han-Unification issue, and as far as I understood this could be handled by future Unicode revisions allocating more characters, outside of Ruby, but I don't see how it requires our Strings to stay dumb byte buffers. Jürgen
on 17.06.2006 15:51
On Saturday 17 June 2006 13:08, Juergen Strobel wrote: > On Thu, Jun 15, 2006 at 07:59:54PM +0900, Michal Suchanek wrote: [...] > > The string methods should not just blindly operate on bytes but > > use the encoding information to operate on characters rather than > > bytes. Sure something like byte_length is needed when the string > > is stored somewhere outside Ruby but standard string methods > > should work with character offsets and characters, not byte > > offsets nor bytes. > > I empathically agree. I'll even repeat and propose a new Plan for > Unicode Strings in Ruby 2.0 in 10 points: Juergen, I agree with most of what you have written. I will add my thoughts. > 1. Strings should deal in characters (code points in Unicode) and > not in bytes, and the public interface should reflect this. > > 2. Strings should neither have an internal encoding tag, nor an > external one via $KCODE. The internal encoding should be > encapsulated by the string class completely, except for a few > related classes which may opt to work with the gory details for > performance reasons. The internal encoding has to be decided, > probably between UTF-8, UTF-16, and UTF-32 by the String class > implementor. Full ACK. Ruby programs shouldn't need to care about the *internal* string encoding. External string data is treated as a sequence of bytes and is converted to Ruby strings through an encoding API. > 3. Whenever Strings are read or written to/from an external source, > their data needs to be converted. The String class encapsulates the > encoding framework, likely with additional helper Modules or > Classes per external encoding. Some methods take an optional > encoding parameter, like #char(index, encoding=:utf8), or > #to_ary(encoding=:utf8), which can be used as helper Class or > Module selector. I think the encoding/decoding API should be separated from the String class. IMO, the most important change is to strictly differentiate between arbitrary binary data and character data. Character data is represented by an instance of the String class. I propose adding a new core class, maybe call it ByteString (or ByteBuffer, or Buffer, whatever) to handle strings of bytes. Given a specific encoding, the encoding API converts ByteStrings to Strings and vice versa. This could look like: my_character_str = Encoding::UTF8.encode(my_byte_buffer) buffer = Encoding::UTF8.decode(my_character_str) > 4. IO instances are associated with a (modifyable) encoding. For > stdin