In my opinion, Ruby is practically useless for many applications without proper Unicode support. How a modern language can ignore this issue is really beyond me. Is there a plan to get Unicode support into the language anytime soon?
on 2006-06-13 23:12
on 2006-06-14 00:28

Hi, In message "Re: Unicode roadmap?" on Wed, 14 Jun 2006 06:13:03 +0900, Roman Hausner <roman.hausner@gmail.com> writes: |In my opinion, Ruby is practically useless for many applications without |proper Unicode support. How a modern language can ignore this issue is |really beyond me. Define "proper Unicode support" first. |Is there a plan to get Unicode support into the language anytime soon? I'm planning enhancing Unicode support in 1.9 in a year or so (finally). But I'm not sure that conforms your definition of "proper Unicode support". Note that 1.8 handles Unicode (UTF-8) if your string operations are based on Regexp. matz.
on 2006-06-14 00:38

> Define "proper Unicode support" first.
having an unicode-equivalent for all methods of class String
like size, slice, upcase
E.g. I tried the unicode plugin... but, alas, who want's to write stuff
like 'normalize_KC' etc. if you just want the frickin' substring of a
string?!
you need to read books on unicode just to properly use the plugin...
aargg :-((
Best regards
Peter
Yukihiro Matsumoto schrieb:
on 2006-06-14 00:51

On Jun 13, 2006, at 6:34 PM, Pete wrote: >> Define "proper Unicode support" first. > > having an unicode-equivalent for all methods of class String > > like size, slice, upcase > > E.g. I tried the unicode plugin... but, alas, who want's to write > stuff like 'normalize_KC' etc. if you just want the frickin' > substring of a string?! > def substring(str, start, len) md = str.match(/\A.{#{start}}(.{#{len}})/) md[1] end def strlength(str) n = 0 str.gsub(/./m) { n += 1; $& } n end See! Regexps do everything! Just you know, set $KCODE and use these methods and you are set! (I am kidding... btw)
on 2006-06-14 01:00

From the theoretical point of view this is quite interesting. Also I understand the humor :-) Performance and memory consumption should be breathtaking using regexp just everywhere... Also there are a ____few____ methods left :-) As I am German the 'missing' unicode support is one of the greatest obstacles for me (and probably all other Germans doing their stuff seriously)... Logan Capaldo schrieb:
on 2006-06-14 01:13

From: Pete [mailto:pertl@gmx.org] Sent: Wednesday, June 14, 2006 1:58 AM > As I am German the 'missing' unicode support is one of the greatest > obstacles for me (and probably all other Germans doing their stuff > seriously)... The same is for Russians/Ukrainians. In our programming communities question "does the programming language supports Unicode as 'native'?" has very high priority. /BTW, here is one of the things where Python beats Ruby completely V.
on 2006-06-14 01:59

I suspect the Japanese posters on this list can answer better than I can, but my impression is that Unicode is, shall we say, not highly thought of outside Europe and North America. The way they dealt with "Chinese" characters was apparently more than a bit of a hack, and just doesn't work very well in the real world. Reading some of the explanations for glyphs versus characters in Unicode just makes you shake your head. What were they thinking? Sure doesn't pass the smell test, although I'll be the first to admit I haven't exactly thought deeply about the subject. There's another problem with Japanese - I've got a friend who's been dealing with some issues around the fact that Japanese apparently innovates new characters on a regular basis, and everyone is expected to use the new characters. (I believe this is called gaiji). The concept of a fixed character set apparently just isn't a good idea to start with. [Awaiting corrections from people who actually know something about this topic :-)...] - James Moore
on 2006-06-14 02:14

On 6/14/06, James Moore <banshee@banshee.com> wrote: > with some issues around the fact that Japanese apparently innovates new > characters on a regular basis, and everyone is expected to use the new > characters. (I believe this is called gaiji). The concept of a fixed > character set apparently just isn't a good idea to start with. > > [Awaiting corrections from people who actually know something about this > topic :-)...] There is a good summary of the han unification controversy on wikipedia; http://en.wikipedia.org/wiki/Han_unification
on 2006-06-14 03:16

On Jun 13, 2006, at 7:56 PM, James Moore wrote:
> topic :-)...]
I have one Japanese person here who's never heard of this gaiji
concept. But it could be new and behind a generation gap of some
kind. They do sure like to add symbols where they can, though.
Especially graphical star characters. I see that a lot.
-Mat
on 2006-06-14 04:38

Hi, In message "Re: Unicode roadmap?" on Wed, 14 Jun 2006 08:11:49 +0900, "Victor Shepelev" <vshepelev@imho.com.ua> writes: |From: Pete [mailto:pertl@gmx.org] |Sent: Wednesday, June 14, 2006 1:58 AM |> As I am German the 'missing' unicode support is one of the greatest |> obstacles for me (and probably all other Germans doing their stuff |> seriously)... | |The same is for Russians/Ukrainians. In our programming communities question |"does the programming language supports Unicode as 'native'?" has very high |priority. Alright, then what specific features are you (both) missing? I don't think it is a method to get number of characters in a string. It can't be THAT crucial. I do want to cover "your missing features" in the future M17N support in Ruby. matz.
on 2006-06-14 07:29

From: Yukihiro Matsumoto [mailto:matz@ruby-lang.org] Sent: Wednesday, June 14, 2006 5:37 AM > |The same is for Russians/Ukrainians. In our programming communities > matz. I suppose, all we (non-English-writers) need is to have all string-related methods working. Just for now, I think about plain testing each string method; also, some other classes can be affected by Unicode (possibly regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes are not: File.open with Russian letters in path don't finds the file. More generally, it can make sense to have Unicode as the "base" mode; where non-Unicode to stay "old, compatibility" mode. Something like this. V.
on 2006-06-14 07:54
Roman Hausner wrote: > In my opinion, Ruby is practically useless for many applications without > proper Unicode support. How a modern language can ignore this issue is > really beyond me. > > Is there a plan to get Unicode support into the language anytime soon? I also think that this is very important.
on 2006-06-14 08:37

Hi, In message "Re: Unicode roadmap?" on Wed, 14 Jun 2006 14:26:30 +0900, "Victor Shepelev" <vshepelev@imho.com.ua> writes: |I suppose, all we (non-English-writers) need is to have all string-related |methods working. Just for now, I think about plain testing each string |method; In that sense, _I_ am one of the non-English-writers, so that I can suppose I know what we need. And I have no problem with the current UTF-8 support. Maybe that's because Japanese don't have cases in our characters. Or maybe I'm missing something. Can you show us your concrete problems caused by Ruby's lack of "proper" Unicode support? |also, some other classes can be affected by Unicode (possibly |regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes are |not: File.open with Russian letters in path don't finds the file. Strange. Ruby does not convert encoding, so that there should be no problem opening files, if you are using strings in the encoding your OS expect. If they are differ, you have to specify (and convert) them properly, no matter how Unicode support is. matz.
on 2006-06-14 08:56

From: Yukihiro Matsumoto [mailto:matz@ruby-lang.org] Sent: Wednesday, June 14, 2006 9:35 AM > > In that sense, _I_ am one of the non-English-writers, Sorry, Matz, I know, of course. But I know too less about Japanese to see how close our tasks are. Under "non-English-writers" I, maybe, had to say "European languages" or so - which has common punctuations, LTR writing, "words" and "whitespaces" and so on. I have almost no knowledge about Japanese, Korean, Arabic, Hebrew people needs. > so that I can > suppose I know what we need. And I have no problem with the current > UTF-8 support. Maybe that's because Japanese don't have cases in our > characters. Or maybe I'm missing something. Just what I've said above. > Can you show us your > concrete problems caused by Ruby's lack of "proper" Unicode support? As mentioned in this topic, it's String#length, upcase, downcase, capitalize. BTW, does String#length works good for you? Moreover, there seems to be some huge problems with pathes having Russian letters; but I'm really not convinced, if Ruby really has to handle this. > > |also, some other classes can be affected by Unicode (possibly > |regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes > are > |not: File.open with Russian letters in path don't finds the file. > > Strange. Ruby does not convert encoding, so that there should be no > problem opening files, if you are using strings in the encoding your OS > expect. If they are differ, you have to specify (and convert) them > properly, no matter how Unicode support is. Oh, it's a bit hard theme for me. I know Windows XP must support Unicode file names; I see my filenames in Russian, but I have low knowledge of system internals to say, are they really Unicode? If not take in account those problems, the only String problems remains, but they are so base core methods! V.
on 2006-06-14 09:09

On Jun 14, 2006, at 15:56 , Victor Shepelev wrote: > As mentioned in this topic, it's String#length, upcase, downcase, > capitalize. Just to chime in, aren't upcase, downcase, and capitalize a locale/ localization issue rather than a Unicode-only issue per se? For example, different languages will have different rules for capitalization. Or am I wrong? Does Unicode in and of itself address these issues? Granted, proper support for upcase, downcase, and capitalize is important, but I think it's a separate issue, part of m17n as a whole rather than support for Unicode in particular. Michael Glaesemann grzm seespotcode net
on 2006-06-14 09:15

Hi, > As mentioned in this topic, it's String#length, upcase, downcase, > capitalize. > > BTW, does String#length works good for you? To have the length of a Unicode string, just do str.split(//).length, or "require 'jcode'" at the beginning of your code. For the other functions, try looking at the unicode library http://www.yoshidam.net/Ruby.html#unicode > Oh, it's a bit hard theme for me. I know Windows XP must support Unicode > file names; I see my filenames in Russian, but I have low knowledge of > system internals to say, are they really Unicode? Windows XP does support Unicode file names, but I'm not sure you can use them with Ruby (I do not use Ruby much under Windows). Try converting the file names to your current locale, it should work if the file names can be converted to it. What I mean is that Russian file names encoded in the Windows Russian encoding should work on a Russian PC. Hope this helps, Cheers, Vincent ISAMBART
on 2006-06-14 09:22

Hi, In message "Re: Unicode roadmap?" on Wed, 14 Jun 2006 15:56:02 +0900, "Victor Shepelev" <vshepelev@imho.com.ua> writes: |> Can you show us your |> concrete problems caused by Ruby's lack of "proper" Unicode support? | |As mentioned in this topic, it's String#length, upcase, downcase, |capitalize. OK. Case is the problem. I understand. |BTW, does String#length works good for you? I don't remember the last time I needed length method to count character numbers. Actually I don't count string length at all both in bytes and characters in my string processing. Maybe this is a special case. I am too optimized for Ruby string operations using Regexp. |Oh, it's a bit hard theme for me. I know Windows XP must support Unicode |file names; I see my filenames in Russian, but I have low knowledge of |system internals to say, are they really Unicode? Windows 32 path encoding is a nightmare. Our Win32 maintainers often troubled by unexpected OS behavior. I am sure we _can_ handle Russian path names, but we need help from Russian people to improve. matz.
on 2006-06-14 09:25

From: Michael Glaesemann [mailto:grzm@seespotcode.net] Sent: Wednesday, June 14, 2006 10:08 AM > On Jun 14, 2006, at 15:56 , Victor Shepelev wrote: > > > As mentioned in this topic, it's String#length, upcase, downcase, > > capitalize. > > Just to chime in, aren't upcase, downcase, and capitalize a locale/ > localization issue rather than a Unicode-only issue per se? For > example, different languages will have different rules for > capitalization. Really? I know about two cases: European capitalization and no capitalization. But, really, you maybe right. I suppose, Florian Gross can say something about German-specific capitalization issues. > Granted, proper support for upcase, downcase, and capitalize is > important, but I think it's a separate issue, part of m17n as a whole > rather than support for Unicode in particular. Maybe. Generally, sometimes I want Unicode, and sometimes (for "quick dirty" scripts) I'll prefer capitalization and regexps "just work" with Windows-1251 (one-byte Russian encoding). V.
on 2006-06-14 09:26

From: Vincent Isambart [mailto:vincent.isambart@gmail.com] Sent: Wednesday, June 14, 2006 10:14 AM > > As mentioned in this topic, it's String#length, upcase, downcase, > > capitalize. > > > > BTW, does String#length works good for you? > > To have the length of a Unicode string, just do str.split(//).length, > or "require 'jcode'" at the beginning of your code. > For the other functions, try looking at the unicode library > http://www.yoshidam.net/Ruby.html#unicode I know about it. But, theoretically speaking, such a "core" methods muts be in core. Not? > > > properly, no matter how Unicode support is. > Russian PC. Yes, they works. But I can't solve the problem: need Ruby Unicode support include filenames operations? V.
on 2006-06-14 09:32

From: Yukihiro Matsumoto [mailto:matz@ruby-lang.org] Sent: Wednesday, June 14, 2006 10:20 AM > OK. Case is the problem. I understand. > > |BTW, does String#length works good for you? > > I don't remember the last time I needed length method to count > character numbers. Actually I don't count string length at all both > in bytes and characters in my string processing. Maybe this is a > special case. I am too optimized for Ruby string operations using > Regexp. I can confirm. But I'm afraid that some libraries I rely on use #length and can break when #length doesn't work. > |Oh, it's a bit hard theme for me. I know Windows XP must support Unicode > |file names; I see my filenames in Russian, but I have low knowledge of > |system internals to say, are they really Unicode? > > Windows 32 path encoding is a nightmare. Our Win32 maintainers often > troubled by unexpected OS behavior. I am sure we _can_ handle Russian > path names, but we need help from Russian people to improve. In Russian encoding (Win-1251) and on Russian PC all works well. In Unicode it doesn't, but I'm not convinced it must. In any case, I'm ready to spend my time helping Ruby community (especially in Russian/Ukrainian localization issues), because I really love the language. V.
on 2006-06-14 09:45

Yukihiro Matsumoto skrev: > Hi, > > In message "Re: Unicode roadmap?" > on Wed, 14 Jun 2006 06:13:03 +0900, Roman Hausner <roman.hausner@gmail.com> writes: > |In my opinion, Ruby is practically useless for many applications without > |proper Unicode support. How a modern language can ignore this issue is > |really beyond me. > > Define "proper Unicode support" first. > I won't define "proper Unicode support" here. But there must be a problem somewhere since pure-ruby Ferret doesn't support UTF-8. You need to use the c-extension of Ferret to have it support UTF-8 (which doesn't work on Windows yet :( ). I don't know if that is just a sucky impl of Ferret or if it's Ruby that make it so. Maybe Dave Balmain can enlighten us why UTF-8 doesn't work in the pure Ruby version and what is needed of Ruby to make it work (if it's actually Ruby's fault that is)? My personal belief is that it should just work in a case like this if data in is UTF-8 and search strings is UTF-8 without the lib author and/or user having to do anything very special to make it work (apart from specifying encoding). Am I wrong in this? Regards, Marcus
on 2006-06-14 10:23

On Jun 13, 2006, at 10:26 PM, Victor Shepelev wrote: > Regexps seems to work fine (in my 1.9), but pathes are > not: File.open with Russian letters in path don't finds the file. On OS X multibyte filenames work: $ cat x.rb $KCODE = 'u' puts File.read('Cyrillic_Я.txt') $ cat Cyrillic_\320\257.txt test file with Я! $ ruby x.rb test file with Я! $ uname -a Darwin kaa.jijo.segment7.net 8.6.0 Darwin Kernel Version 8.6.0: Tue Mar 7 16:58:48 PST 2006; root:xnu-792.6.70.obj~1/RELEASE_PPC Power Macintosh powerpc $ ruby -v ruby 1.8.4 (2006-05-18) [powerpc-darwin8.6.0] $ -- Eric Hodel - drbrain@segment7.net - http://blog.segment7.net This implementation is HODEL-HASH-9600 compliant http://trackmap.robotcoop.com
on 2006-06-14 10:55

On 14/06/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote: > Windows 32 path encoding is a nightmare. Our Win32 maintainers often > troubled by unexpected OS behavior. I am sure we _can_ handle Russian > path names, but we need help from Russian people to improve. str.sub!('32 path encoding ', '') # :-) I don't use Windows much, but as I understand it, Ruby interacts with most of the Win32 API using the 'legacy code page', which is only a subset of what the filesystem can handle. (Windows NT and its successors use Unicode internally, and the filesystem is UTF-16 KC-normalised IIRC). Windows does provide Unicode API functions, but to use those, a layer of translation between UTF-16 and UTF-8 would be needed, as Ruby can't do anything useful with UTF-16 at present. I believe that Austin Ziegler was looking into this; I don't know if he's made any progress. Even if a Ruby program uses UTF-8 internally, it should be possible to access the filesystem by Iconv'ing paths to the appropriate code page - providing that they don't contain characters not in the code page. It's far from ideal, though: the real solution is for Ruby to use the Unicode functions (those suffixed with W) in the API. The upside is that UTF-8/UTF-16 conversion should be less expensive than the code page conversion that's inside each of Win32's non-Unicode functions. On the other hand, plenty of Windows programs don't support Unicode properly either. Paul.
on 2006-06-14 11:00

On 14/06/06, Victor Shepelev <vshepelev@imho.com.ua> wrote: > I can confirm. But I'm afraid that some libraries I rely on use #length and > can break when #length doesn't work. Those libraries should probably be considered broken; they can and should be patched to do any human-readable-string processing in an encoding-safe manner (e.g. by using jcode's jlength and each_char methods). Paul.
on 2006-06-14 11:09

-------- Original-Nachricht --------
Datum: Wed, 14 Jun 2006 17:58:41 +0900
Von: Paul Battley <pbattley@gmail.com>
An: ruby-talk@ruby-lang.org
Betreff: Re: Unicode roadmap?
> Paul.
That will be quite _some_ libraries, I guess...
on 2006-06-14 11:12

On 14/06/06, Victor Shepelev <vshepelev@imho.com.ua> wrote: > > Just to chime in, aren't upcase, downcase, and capitalize a locale/ > > localization issue rather than a Unicode-only issue per se? For > > example, different languages will have different rules for > > capitalization. > > Really? I know about two cases: European capitalization and no > capitalization. There is variety even within western European languages - Dutch, for example, differs from English (IJsselmeer). Paul.
on 2006-06-14 11:16

From: Paul Battley [mailto:pbattley@gmail.com]
Sent: Wednesday, June 14, 2006 12:10 PM
> example, differs from English (IJsselmeer).
I already realized. (I've said about Florian Gross, his surname last
"ss"
normally printed in something like "B" I can't type and my Outlook can't
show :) AFAIK, it is normally printed as one letter in downcase and two
letters in uppercase. So, "single general" String#upcase, #downcase are
totally impossible.
V.
on 2006-06-14 11:25

On 6/14/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote: > | > |The same is for Russians/Ukrainians. In our programming communities question > |"does the programming language supports Unicode as 'native'?" has very high > |priority. > > Alright, then what specific features are you (both) missing? I don't > think it is a method to get number of characters in a string. It > can't be THAT crucial. I do want to cover "your missing features" in > the future M17N support in Ruby. > What I want is all methods working seamlessly with unicode strings so that I do not have to think about the encoding. Regexps do work with utf-8 strings if KCODE is set to u (but it defaults to n even when locale uses UTF-8). String searches should probably work but they would retrurn wrong position. Things like split should work for utf-8, the encoding is pretty well defined. But one might want to use length and [] to work with strings. It can be simulated with unicode_string=string.scan(/./). But it is no longer a string. It is composed of characters only as long as I assign only characters using []=. The string functions should do the right thing even for utf-8. But I guess utf-32 is more useful for working with strings this way. It might be a good idea to stick encoding information into strings (it is probably the only way how internationalization can be done and the sanity of all involved preserved at the same time). The functions for comparison, etc could use it to do the right thing even if strings come in several encodings. ie. cp1251 from the system, utf-8 from a web page, ... Functions like open could convert the string correctly according to locale. One should be able to set the encoding information (ie for web page title when the meta tag for content type is found in a web page),and remove it to suppress string conversion. It should be also possible to convert the string (ie to UTF-32 to speed up character access). Things like <=>, upcase, downcase, etc make sense only in context of locale (language). Only the encoding does not define them. I guess the default <=>is based on the binary representation of the string. This would mean different sorting of the same strings in different encodings. Sorting by the unicode code point would be at least the same for any encoding. Thanks Michal
on 2006-06-14 11:35

On 6/14/06, Victor Shepelev <vshepelev@imho.com.ua> wrote: > > capitalization. > > Really? I know about two cases: European capitalization and no Really. > capitalization. There is no such thing like European capitalization. There is only <insert your language> capitalization. The german character Ã? has no uppercase version. In most languages using Latin script the uppercase of 'i' is 'I'. But Turkish has i and i without dot, and the uppercase of 'i' is, of course, I with dot. Thanks Michal
on 2006-06-14 11:41

On 14/06/06, Michal Suchanek <hramrach@centrum.cz> wrote: > It should be also > possible to convert the string (ie to UTF-32 to speed up character > access). utf8_string.unpack('U*') is pretty close to this, giving an array of codepoints. Paul.
on 2006-06-14 12:54

On 6/14/06, Paul Battley <pbattley@gmail.com> wrote: > On 14/06/06, Michal Suchanek <hramrach@centrum.cz> wrote: > > It should be also > > possible to convert the string (ie to UTF-32 to speed up character > > access). > > utf8_string.unpack('U*') is pretty close to this, giving an array of codepoints. But I want it to be string after the conversion, so that I can use the standard string functions with sane results. I do not want to think about varoius encodings myself if my application has to use them. The runtime should do that. Thanks Michal
on 2006-06-14 14:23

On 6/14/06, Victor Shepelev <vshepelev@imho.com.ua> wrote: > Oh, it's a bit hard theme for me. I know Windows XP must support Unicode > file names; I see my filenames in Russian, but I have low knowledge of > system internals to say, are they really Unicode? They are UTF-16 internally. I haven't been paying attention to Ruby 1.9 lately, but when I have time and have noticed that Matz has checked in support for m17n strings, I will be enhancing support for Windows files to use Unicode. Currently, Ruby is built using the non-Unicode form *only*. And no, using -DUNICODE is the *wrong* answer, thanks. We'd have to start using TCHAR instead of char, and it would actually mean that we'd be using wchar_t instead of char in this case. I've already done a similar (but more complex) project at work. -austin
on 2006-06-14 14:29

On 6/14/06, Vincent Isambart <vincent.isambart@gmail.com> wrote: > Windows XP does support Unicode file names, but I'm not sure you can > use them with Ruby (I do not use Ruby much under Windows). Try > converting the file names to your current locale, it should work if > the file names can be converted to it. What I mean is that Russian > file names encoded in the Windows Russian encoding should work on a > Russian PC. You can't currently use them with Ruby. The file operations in Ruby are using the likes of CreateFileA instead of CreateFileW (it's not that explicit; Ruby is compiled without -DUNICODE -- which is the correct thing to do in Ruby's case -- which means that CreateFile is CreateFileA). All files are stored on the filesystem as UTF-16, though, even if you are using "ANSI" access. By the way, there are multiple Russian encodings, so ... Unicode is better for this point. As I said in my previous message, I have already planned to enhance the Windows filesystem support when Matz gets the m17n strings in so that I can *always* force the file routines on Windows to provide either UTF-8 or UTF-16 (probably the former, since it will also make it easier to work with existing extensions) and indicate that the strings are such. -austin
on 2006-06-14 14:29

On 6/14/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote: > Windows 32 path encoding is a nightmare. Our Win32 maintainers often > troubled by unexpected OS behavior. I am sure we _can_ handle Russian > path names, but we need help from Russian people to improve. It's not that bad, Matz. I started as a Unix developer, but in the last two years I have learned *quite* a bit about how Windows handles this stuff and we can adapt what I did for work with no problem. I just need M17N strings to support this. I should look at what I can/should do to provide this as an extension, I just have no time. :( -austin
on 2006-06-14 14:36

On 6/14/06, Michal Suchanek <hramrach@centrum.cz> wrote: > What I want is all methods working seamlessly with unicode strings so > that I do not have to think about the encoding. That will *never* happen. Even with Unicode, you have to think about the encoding, because UTF-32 (the closest representation to the Platonic ideal "Unicode" you'll ever find) is unlikely to be supported in the general case. Matz's idea of m17n strings is the right one: you have a "byte stream" and an attribute which indicates how the byte stream is encoded. This will sort of be like $KCODE but on an individual string level so that you could meaningfully have Unicode (probably UTF-8) and ShiftJIS strings in the same data and still meaningfully call #length on them. You will *always* have to care about the encoding. As well as, ultimately, your locale. -austin
on 2006-06-14 23:40

On Wednesday 14 June 2006 06:52 am, Michal Suchanek wrote: > On 6/14/06, Paul Battley <pbattley@gmail.com> wrote: > > On 14/06/06, Michal Suchanek <hramrach@centrum.cz> wrote: > > > It should be also > > > possible to convert the string (ie to UTF-32 to speed up character > > > access). (RE my previous post): Oops, maybe UTF-32 is exactly what I was alluding to? Randy Kramer (Should have waited a little longer before posting.)
on 2006-06-15 02:12

Every time these unicode discussions come up my head spins like a top. You should see it. We JRubyists have headaches from the unicode question too. Since JRuby is currently 1.8-compatible, we do not have what most call *native* unicode support. This is primarily because we do not wish to create an incompatible version of Ruby or build in support for unicode now that would conflict with Ruby 2.0 in the future. It is, however, embarressing to say that although we run on top of Java, which has arguably pretty good unicode support, we don't support unicode. Perhaps you can see our conundrum. I am no unicode expert. I know that Java uses UTF16 strings internally, converted to/from the current platform's encoding of choice by default. It also supports converting those UTF16 strings into just about every encoding out there, just by telling it to do so. Java supports the Unicode specification version 3.0. So Unicode is not a problem for Java. We would love to be able to support unicode in JRuby, but there's always that nagging question of what it should look like and what would mesh well with the Ruby community at large. With the underlying platform already rich with unicode support, it would not take much effort to modify JRuby. So then there's a simple question: What form would you, the Ruby users, want unicode to take? Is there a specific library that you feel encompasses a reasonable implementation of unicode support, e.g. icu4r? Should the support be transparent, e.g. no longer treat or assume strings are byte vectors? JRuby, because we use Java's String, is already using UTF16 strings exclusively...however there's no way to get at them through core Ruby APIs. What would be the most comfortable way to support unicode now, considering where Ruby may go in the future?
on 2006-06-15 02:22

I posted this to ruby-talk, but it occurred to me that you folks implementing Rails functionality probably have a thing or two to say about unicode support in Ruby. Therefore, I would love to hear your opinions. Adding native unicode support is only a matter of time in JRuby; its usefulness as a JVM-based language depends on it. However, we continue to wrestle with how best to support unicode without stepping on the Ruby community's toes in the process. Thoughts? ---------- Forwarded message ---------- From: Charles O Nutter <headius@headius.com> Date: Jun 14, 2006 7:11 PM Subject: Re: Unicode roadmap? To: ruby-talk ML <ruby-talk@ruby-lang.org> Every time these unicode discussions come up my head spins like a top. You should see it. We JRubyists have headaches from the unicode question too. Since JRuby is currently 1.8-compatible, we do not have what most call *native* unicode support. This is primarily because we do not wish to create an incompatible version of Ruby or build in support for unicode now that would conflict with Ruby 2.0 in the future. It is, however, embarressing to say that although we run on top of Java, which has arguably pretty good unicode support, we don't support unicode. Perhaps you can see our conundrum. I am no unicode expert. I know that Java uses UTF16 strings internally, converted to/from the current platform's encoding of choice by default. It also supports converting those UTF16 strings into just about every encoding out there, just by telling it to do so. Java supports the Unicode specification version 3.0. So Unicode is not a problem for Java. We would love to be able to support unicode in JRuby, but there's always that nagging question of what it should look like and what would mesh well with the Ruby community at large. With the underlying platform already rich with unicode support, it would not take much effort to modify JRuby. So then there's a simple question: What form would you, the Ruby users, want unicode to take? Is there a specific library that you feel encompasses a reasonable implementation of unicode support, e.g. icu4r? Should the support be transparent, e.g. no longer treat or assume strings are byte vectors? JRuby, because we use Java's String, is already using UTF16 strings exclusively...however there's no way to get at them through core Ruby APIs. What would be the most comfortable way to support unicode now, considering where Ruby may go in the future? -- Charles Oliver Nutter @ headius.blogspot.com JRuby Developer @ jruby.sourceforge.net Application Architect @ www.ventera.com
on 2006-06-15 02:40

On 15-jun-2006, at 2:11, Charles O Nutter wrote: > with unicode support, it would not take much effort to modify > JRuby. So then > there's a simple question: Yukihiro Matsumoto wrote: > > Define "proper Unicode support" first. > > I'm planning enhancing Unicode support in 1.9 in a year or so > (finally). But I'm not sure that conforms your definition of "proper > Unicode support". Note that 1.8 handles Unicode (UTF-8) if your > string operations are based on Regexp. > Hello everyone, and sorry for chiming so fiercely. Got into some confusion with the ML controls. Just joined the list seeing the subject popping up once more. I am doing Unicode-aware apps in Rails and Ruby right now and it hurts. I'll try to define "proper Unicode support" as I (dream of it at night) see it. 1. All string indexing (length, index, slice, insert) works with characters instead of bytes, whatever length in bytes the characters have to be. String methods (index or =~) should _never_ return offsets that will damage the string's characters if employed for slicing - you shouldn't have to manually translate the byte offset of 2 to character offset of 1 because the second character is multibyte. Simple example: def translate_offset(str, byte_offset) chunk = str[0..byte_offset] begin chunk.unpack("U*").length - 1 rescue ArgumentError # this offset is just wrong! shift upwards and retry chunk = str[0..(byte_offset+=1)] retry end end I think it's unnecessarily painful for something as easy as string =~ /pattern/. Yes, you can get that offset you recieve from =~ and then get the slice of the string and then split it again with /./mu to get the same number etc... 2. Case-insensitive regexes actually work. Even in my Oniguruma- enabled builds of 1.8.2. it was not true (maybe changed now). At least "Unicode general" collation casefolding (such a thing exists) available built-in on every platform. 4. Locale-aware sorting, including multibyte charsets, if provided by the OS 5. Preferably separate (and strictly purposed) Bytestring that you get out of Sockets and use in Servers etc. - or the ability to "force" all strings recieved from external resources to be flagged uniformly as being of a certain encoding in _your_ program, not somewhere in someone's library. If flags have to be set by libraries, they won't be set because most developers sadly don't care: http://www.zackvision.com/weblog/2005/11/mt-unicod... http://thraxil.org/users/anders/posts/2005/11/01/u... 6. Unicode-aware strip dealing with weirdo whitespaces (hair space, thin space etc.) 7. And no, as I mentioned - it doesn't handle it properly because the /i modifier is broken, and to deal without it you need to downcase BOTH the regexp and the string itself. Closed circle - you go and get the Unicode gem with tables. All of this can be controlled either per String (then 99 out of 100 libraries I use will be getting it wrong - see above) or by a global setting such as $KCODE. As an example of something that is ridiculously backwards to do in Ruby now is this (I spent some time refactoring this today): http://dev.rubyonrails.org/browser/trunk/actionpac... helpers/text_helper.rb#L44 Here you have a major problem because the /i flag doesn't do anything (Ruby is incapable of Unicode-aware casefolding), and using offsets means that you are always one step from damaging someone's text. It's just wrong that it has to be so painful. Python3000, IMO, gets this right (as does Java) - byte array and a String are sompletely separate, and String operates with characters and characters only. That's what I would expect. Hope this makes sense somewhat :-) -- Julian 'Julik' Tarkhanov please send all personal mail to me at julik.nl
on 2006-06-15 02:40

On Jun 15, 2006, at 2:19 AM, Charles O Nutter wrote: > I posted this to ruby-talk, but it occurred to me that you folks > implementing Rails functionality probably have a thing or two to > say about unicode support in Ruby. Therefore, I would love to hear > your opinions. Adding native unicode support is only a matter of > time in JRuby; its usefulness as a JVM-based language depends on > it. However, we continue to wrestle with how best to support > unicode without stepping on the Ruby community's toes in the > process. Thoughts? Julik has done a lot of pionering in that direction for Rails. His latest suggestion is to use a proxy class on string objects to perform unicode operations: @some_unicode_string.u.length @some_unicode_string.u.reverse I tend to agree with this solution as it doesn't break any previous string operations and gives us an easy way to perform unicode aware operations. Manfred
on 2006-06-15 03:52

I agree it's a very attractive solution. I have two questions related (perhaps you are out there to answer, Julik): 1. How does performance look with the unicode string add-on versus native strings? 2. Is this the ideal way to support unicode strings in ruby? And I explain the second as follows....if we could assume that switching from treating a string as an array of bytes to a list of characters of arbitrary width, and have all existing string operations work correctly treating those characters as string, would that be a better ideal? Where are the breaking points in such a design? What's to stop the underlying implementation from actually using a UTF-16 character, passing UTF-8 to libraries and IO streams but still allowing you to access everything as UTF-16 or your encoding of choice? (Of course this is somewhat rhetorical; we do this currently with JRuby since Java's scrints are UTF-16...we just don't have any way to provide access to UTF-16 characters, and we normalize everything to UTF-8 for Ruby's sake...but what if we didn't normalize and adjusted string functions to compensate?)
on 2006-06-15 04:17

On 15-jun-2006, at 3:50, Charles O Nutter wrote: > operations work correctly treating those characters as string, > would that be a better ideal? Where are the breaking points in such > a design? What's to stop the underlying implementation from > actually using a UTF-16 character, passing UTF-8 to libraries and > IO streams but still allowing you to access everything as UTF-16 or > your encoding of choice? (Of course this is somewhat rhetorical; we > do this currently with JRuby since Java's scrints are UTF-16...we > just don't have any way to provide access to UTF-16 characters, and > we normalize everything to UTF-8 for Ruby's sake...but what if we > didn't normalize and adjusted string functions to compensate?) This is more appropriate for ruby-talk -- Julian 'Julik' Tarkhanov please send all personal mail to me at julik.nl
on 2006-06-15 04:24

I believe that Julik's way of solving the unicode problem (String#u providing access to a unicode helper) is very attractive. I have two questions related, for Julik and the rest of the peanut gallery: 1. How does performance look with the unicode string add-on versus native strings (or as compared to icu4r, which is C-based)? 2. Is this the ideal way to support unicode strings in ruby? And I explain the second as follows....if we could assume switching from treating a string as an array of bytes to a list of characters of arbitrary width, and have all existing string operations work correctly treating those characters as indexed elements of that string, would that be a better ideal? Where are the breaking points in such a design? What's to stop the underlying implementation from actually using a UTF-16 character, passing UTF-8 to libraries and IO streams but still allowing you to access everything as UTF-16 or your encoding of choice? Is it simply libraries or core APIs that explicitly need *byte* counts? (Of course this is somewhat rhetorical; we do this currently with JRuby since Java's strings are UTF-16...we just don't have any uniform way to provide access to UTF-16 character strings, and we normalize everything to UTF-8 for Ruby's sake...but what if we didn't normalize and adjusted string functions to compensate?)
on 2006-06-15 04:28

Fair enough; redirected. If any other rails-core folks want to chime in, please do so...I would expect unicode and multibyte are key issues for worldwide rails deployments.
on 2006-06-15 04:41

On 6/14/06, Charles O Nutter <headius@headius.com> wrote: > I believe that Julik's way of solving the unicode problem (String#u > providing access to a unicode helper) is very attractive. I have two > questions related, for Julik and the rest of the peanut gallery: > 1. How does performance look with the unicode string add-on versus native > strings (or as compared to icu4r, which is C-based)? > 2. Is this the ideal way to support unicode strings in ruby? No. In fact, I believe that Matz has the right idea for M17N strings in Ruby 2.0. The *reality* is that there's a *lot* of data out there that isn't Unicode. I would suggest that JRuby could offer a JavaString that acts in every way like a String except that it provides access to the native UTF-16 implementation. -austin
on 2006-06-15 04:55

On 15-jun-2006, at 4:40, Austin Ziegler wrote: > No. In fact, I believe that Matz has the right idea for M17N strings > in Ruby 2.0. The *reality* is that there's a *lot* of data out there > that isn't Unicode. It's very difficult for me to understand the implementation. What if we concat a Mojikyo string to a UTF8String? UnicodeDecodeError, ordinal not in range? I think Python folks proved that it's terrible (it is). Nothing is ideal. > I would suggest that JRuby could offer a JavaString that acts in every > way like a String except that it provides access to the native UTF-16 > implementation. Just what the ICU4R extension does. It's unusable to the point that you cannot concat a native string with a UString. To the point that you have to use special Regexp class for it. You end up having half of your Ruby script doing typecasting from one to the other. There is alot of data that isn't Unicode, indeed. Converted on input and converted on output if necessary - just as in any other case when the encoding of your system doesn't match your input or output. I don't know if it can be possible to have the "internal" encoding of a system switchable (seems to me this is what Matz wants) - then you can't safely refer to anything other than bytes. And then you get software that you can't use, because they had a different assumtpion than you had as to what encoding the user will be using.
on 2006-06-15 05:01

On 6/14/06, Austin Ziegler <halostatue@gmail.com> wrote: > in Ruby 2.0. The *reality* is that there's a *lot* of data out there > that isn't Unicode. Yes, we all understand that Ruby 2.0 will be the coolest thing since sliced bread, but those of us that are currently developing international websites with Rails don't have the luxury of waiting until Christmas of 2007. -PJ Hyett http://pjhyett.com
on 2006-06-15 05:10

On 6/14/06, PJ Hyett <pjhyett@gmail.com> wrote: > > that isn't Unicode. > Yes, we all understand that Ruby 2.0 will be the coolest thing since > sliced bread, but those of us that are currently developing > international websites with Rails don't have the luxury of waiting > until Christmas of 2007. *shrug* As far as I can tell, there will be no implementation of Ruby before then that has a "native" m17n string. So whether you have the luxury of waiting or not, Ruby 1.8.x will not *ever* have a "Unicode string". Adding a "Unicode string" would *break* behaviour, and no example is better than the extension that was proposed which would change the meaning of #size and #length to mean two different things. So, there's a point where patience is going to be necessary, whether you "have the luxury" or not. -austin
on 2006-06-15 10:47

IIRC, Matz has said that internally String won't change, and I suspect that a CharString class (or smth like) won't be ever added. Maybe just introducing String#encoding flag and addig new methods to String with prefixes, like char_array, char_slice, char_length, char_index, char_downcase, char_strcoll, char_strip, etc. that will internally look at encoding flag and process respectively bytes in this particular string without conversion (just maybe some hidden), and leaving old byte-processing methods intact, would be the way to keep older code working and enjoy M17N? Though, as for me, it is still unclear, what should happen, if one tries to perform operation on two strings with different String#encoding...
on 2006-06-15 13:02

On 6/14/06, Austin Ziegler <halostatue@gmail.com> wrote: > individual string level so that you could meaningfully have Unicode > (probably UTF-8) and ShiftJIS strings in the same data and still > meaningfully call #length on them. > > You will *always* have to care about the encoding. As well as, > ultimately, your locale. No. Since I have locale stdin can be marked with the proper encoding information so that all stings originating there have the proper encoding information. The string methods should not just blindly operate on bytes but use the encoding information to operate on characters rather than bytes. Sure something like byte_length is needed when the string is stored somewhere outside Ruby but standard string methods should work with character offsets and characters, not byte offsets nor bytes. Since my stdout can be also marked with correct encoding the strings that are output there can be converted to that encoding. Even if it originates from a source file that happens to be in a different encoding. Hmm, prehaps it will be necessary to mark source files with encoding tags as well. It could be quite tedious to assingn the tag manually to every string in a source file. When strings are compared, concatenated, .. the encoding is known so the methods should do the right thing. I do not have to care about encoding. You may make a string implemenation that forces me to care (such a the current one). But I do not have to. I can always turn to perl if I get really desperate. Thanks Michal
on 2006-06-15 13:22

On 6/15/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote: > > 5. Preferably separate (and strictly purposed) Bytestring that you > get out of Sockets and use in Servers etc. - or the ability to > "force" all strings recieved from external resources to be flagged > uniformly as being of a certain encoding in _your_ program, not > somewhere in someone's library. If flags have to be set by libraries, > they won't be set because most developers sadly don't care: > > http://www.zackvision.com/weblog/2005/11/mt-unicod... > http://thraxil.org/users/anders/posts/2005/11/01/u... Where else should the strings be flagged? If you get a web page through http request, and the library parses the response for you, it should set enconding on the web page. You would never know since you only received the page, not the header. > setting such as $KCODE. I do not see why libraries should be always wrong. After all, you can always fix them. And setting the encoding globally is a bad thing. You cannot have strings encoded in different encodings in one process then. It looks quite limiting. For one, the web pages that you get from various servers (and even the same server) can be in varoius encodings. Thanks Michal
on 2006-06-15 13:51

On 15-jun-2006, at 13:21, Michal Suchanek wrote: >> http://www.zackvision.com/weblog/2005/11/mt-unicod... >> http://thraxil.org/users/anders/posts/2005/11/01/u... > > Where else should the strings be flagged? They should nog be flagged, because some strings will be flagged and some won't and exactly in the wrong places at the wrong time. See _is_uf_8_ in Perl to witness the terrible ugliness of this. > If you get a web page > through http request, and the library parses the response for you, it > should set enconding on the web page. You would never know since you > only received the page, not the header. That's why you should distinguish between a ByteArray and a String. >> libraries I use will be getting it wrong - see above) or by a global >> setting such as $KCODE. > > I do not see why libraries should be always wrong. After all, you can > always fix them. And setting the encoding globally is a bad thing. You > cannot have strings encoded in different encodings in one process > then. It looks quite limiting. For one, the web pages that you get > from various servers (and even the same server) can be in varoius > encodings. Of course they can (and will). When I have to approach this I usually just snif the encoding of the strings I recieved and then feed them to iconv and friends before doing any processing. A library that downloads stuff off the Internet should be (IMO) aware of the charset madness and decode the strings for me. Trust me, when multibyte/Unicode handling is optional, 80% of libraries do it wrong. Re-read the links above if you don't believe. Actually it seems that the solution with an accessor is quite nice, but that I had to figure out the hard way after breaking the String class with my hacks and seeing stuff collapse. Apparently the poster of a parallel thread finds it inspiring to repeat my experiment _in vitro_ just for the academic sake of it.
on 2006-06-15 15:13

On 6/15/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote: > >> somewhere in someone's library. If flags have to be set by libraries, > >> they won't be set because most developers sadly don't care: > >> > >> http://www.zackvision.com/weblog/2005/11/mt-unicod... > >> http://thraxil.org/users/anders/posts/2005/11/01/u... > > > > Where else should the strings be flagged? > They should nog be flagged, because some strings will be flagged and > some won't and exactly > in the wrong places at the wrong time. See _is_uf_8_ in Perl to > witness the terrible ugliness of this. You can certainly get the things wrong. But if you get a string that is wrongly flagged you have the choice to fix the code where the string originates or work arond it by flagging it right. If you have a code that gets the encoding wrong, and it tries to convert the string to some 'universal' encoding you want to use everywhere in your application, you get a broken string. > > > If you get a web page > > through http request, and the library parses the response for you, it > > should set enconding on the web page. You would never know since you > > only received the page, not the header. > > That's why you should distinguish between a ByteArray and a String. How does it help you here? > >> All of this can be controlled either per String (then 99 out of 100 > Of course they can (and will). When I have to approach this I usually > just snif the encoding of the strings I recieved and then feed them > to iconv and friends before doing any processing. A library that > downloads stuff off the Internet should be (IMO) aware of > the charset madness and decode the strings for me. If it can decode them, it can flag them. It has to be aware - that's it. > > Trust me, when multibyte/Unicode handling is optional, 80% of > libraries do it wrong. Re-read the links above if you don't believe. But they get the very foundation wrong. In Python functions that take multiple strings can only thake them in one encoding. It is impossible to concatenate differently encoded strings. Of course, this is bound to fail. In the other case they use a database with poor support for unicode, and mysql that does exactly the same thing ruby does right now - works with strings as arrays of bytes. Of course, this is going to break. Neither is the case when the strings carry information about their encoding, and the string functions can handle strings encoded differently. The fact that there are libraries and languages with poor unicode support does not mean it must be always poor. Thanks Michal
on 2006-06-17 13:11

On Thu, Jun 15, 2006 at 07:59:54PM +0900, Michal Suchanek wrote: > >stream is encoded. This will sort of be like $KCODE but on an > > The string methods should not just blindly operate on bytes but use > the encoding information to operate on characters rather than bytes. > Sure something like byte_length is needed when the string is stored > somewhere outside Ruby but standard string methods should work with > character offsets and characters, not byte offsets nor bytes. I empathically agree. I'll even repeat and propose a new Plan for Unicode Strings in Ruby 2.0 in 10 points: 1. Strings should deal in characters (code points in Unicode) and not in bytes, and the public interface should reflect this. 2. Strings should neither have an internal encoding tag, nor an external one via $KCODE. The internal encoding should be encapsulated by the string class completely, except for a few related classes which may opt to work with the gory details for performance reasons. The internal encoding has to be decided, probably between UTF-8, UTF-16, and UTF-32 by the String class implementor. 3. Whenever Strings are read or written to/from an external source, their data needs to be converted. The String class encapsulates the encoding framework, likely with additional helper Modules or Classes per external encoding. Some methods take an optional encoding parameter, like #char(index, encoding=:utf8), or #to_ary(encoding=:utf8), which can be used as helper Class or Module selector. 4. IO instances are associated with a (modifyable) encoding. For stdin, stdout this can be derived from the locale settings. String-IO operations work as expected. 5. Since the String class is quite smart already, it can implement generally useful and hard (in the domain of Unicode) operations like case folding, sorting, comparing etc. 6. More exotic operations can easily be provided by additional libraries because of Ruby's open classes. Those operations may be coded depending on on String's public interface for simplicissity, or work with the internal representation directly for performance. 7. This approach leaves open the possibility of String subclasses implementing different internal encodings for performance/space tradeoff reasons which work transparently together (a bit like FixInt and BigInt). 8. Because Strings are tightly integrated into the language with the source reader and are used pervasively, much of this cannot be provided by add-on libraries, even with open classes. Therefore the need to have it in Ruby's canonical String class. This will break some old uses of String, but now is the right time for that. 9. The String class does not worry over character representation on-screen, the mapping to glyphs must be done by UI frameworks or the terminal attached to stdout. 10. Be flexible. <placeholder for future idea> This approach has several advantages and a few disadvantages, and I'll try to bring in some new angles to this now too: *Advantages* -POL, Encapsulation- All Strings behave exactly the same everywhere, are predictable, and do the hard work for their users. -Cross Library Transparency- No String user needs to worry which Strings to pass to a library, or worry which Strings he will get from a library. With Web-facing libraries like rails returning encoding-tagged Strings, you would be likely to get Strings of all possible encodings otherwise, and isthe String user prepared to deal with this properly? This is a *big* deal IMNSHO. -Limited Conversions- Encoding conversions are limited to the time Strings are created or written or explicitly transformed to an external representation. -Correct String Operations- Even basic String operations are very hard in the world of Unicode. If we leave the String users to look at the encoding tags and sort it out themselves, they are bound to make mistakes because they don't care, don't know, or have no time. And these mistakes may be _security_ _sensitive_, since most often credentials are represented as Strings too. There already have been exploits related to Unicode. *Disadvantages* (with mitigating reasoning of course) - String users need to learn that #byte_length(encoding=:utf8) >= #size, but that's not too hard, and applies everywhere. Users do not need to learn about an encoding tag, which is surely worse to handle for them. - Strings cannot be used as simple byte buffers any more. Either use an array of bytes, or an optimized ByteBuffer class. If you need regular expresson support, RegExp can be extended for ByteBuffers or even more. - Some String operations may perform worse than might be expected from a naive user, in both the time or space domain. But we do this so the String user doesn't need to himself, and are problably better at it than the user too. - For very simple uses of String, there might be unneccessary conversions. If a String is just to be passed through somewhere, without inspecting or modifying it at all, in- and outwards conversion will still take place. You could and should use a ByteBuffer to avoid this. - This ties Ruby's String to Unicode. A safe choice IMHO, or would we really consider something else? Note that we don't commit to a particular encoding of Unicode strongly. - More work and time to implement. Some could call it over-engineered. But it will save a lot of time and troubles when shit hits the fan and users really do get unexpected foreign characters in their Strings. I could offer help implementing it, although I have never looked at ruby's source, C-extensions, or even done a lot of ruby programming yet. Close to the start of this discussion Matz asked what the problem with current strings really was for western users. Somewhere later he concluded case folding. I think it is more than that: we are lazy and expect character handling to be always as easy as with 7 bit ASCII, or as close as possible. Fixed 8-bit codepages worked quite fine most of the time in this regard, and breakage was limited to special characters only. Now let's ask the question in reverse: are eastern programmers so used to doing elaborate byte-stream to character handling by hand they don't recognize how hard this is any more? Surely it is a target for DRY if I ever saw one. Or are there actual problems not solveable this way? I looked up the mentioned Han-Unification issue, and as far as I understood this could be handled by future Unicode revisions allocating more characters, outside of Ruby, but I don't see how it requires our Strings to stay dumb byte buffers. Jürgen
on 2006-06-17 15:51

On Saturday 17 June 2006 13:08, Juergen Strobel wrote: > On Thu, Jun 15, 2006 at 07:59:54PM +0900, Michal Suchanek wrote: [...] > > The string methods should not just blindly operate on bytes but > > use the encoding information to operate on characters rather than > > bytes. Sure something like byte_length is needed when the string > > is stored somewhere outside Ruby but standard string methods > > should work with character offsets and characters, not byte > > offsets nor bytes. > > I empathically agree. I'll even repeat and propose a new Plan for > Unicode Strings in Ruby 2.0 in 10 points: Juergen, I agree with most of what you have written. I will add my thoughts. > 1. Strings should deal in characters (code points in Unicode) and > not in bytes, and the public interface should reflect this. > > 2. Strings should neither have an internal encoding tag, nor an > external one via $KCODE. The internal encoding should be > encapsulated by the string class completely, except for a few > related classes which may opt to work with the gory details for > performance reasons. The internal encoding has to be decided, > probably between UTF-8, UTF-16, and UTF-32 by the String class > implementor. Full ACK. Ruby programs shouldn't need to care about the *internal* string encoding. External string data is treated as a sequence of bytes and is converted to Ruby strings through an encoding API. > 3. Whenever Strings are read or written to/from an external source, > their data needs to be converted. The String class encapsulates the > encoding framework, likely with additional helper Modules or > Classes per external encoding. Some methods take an optional > encoding parameter, like #char(index, encoding=:utf8), or > #to_ary(encoding=:utf8), which can be used as helper Class or > Module selector. I think the encoding/decoding API should be separated from the String class. IMO, the most important change is to strictly differentiate between arbitrary binary data and character data. Character data is represented by an instance of the String class. I propose adding a new core class, maybe call it ByteString (or ByteBuffer, or Buffer, whatever) to handle strings of bytes. Given a specific encoding, the encoding API converts ByteStrings to Strings and vice versa. This could look like: my_character_str = Encoding::UTF8.encode(my_byte_buffer) buffer = Encoding::UTF8.decode(my_character_str) > 4. IO instances are associated with a (modifyable) encoding. For > stdin, stdout this can be derived from the locale settings. > String-IO operations work as expected. I propose one of: 1) A low level IO API that reads/writes ByteBuffers. String IO can be implemented on top of this byte-oriented API. The basic binary IO methods could look like: binfile = BinaryIO.new("/some/file", "r") buffer = binfile.read_buffer(1024) # read 1K of binary data binfile = BinaryIO.new("/some/file", "w") binfile.write_buffer(buffer) # Write the byte buffer The standard File class (or IO module, whatever) has an encoding attribute. The default value is set by the constructor by querying OS settings (on my Linux system this could be $LANG): # read strings from /some/file, assuming it is encoded # in the systems default encoding. text_file = File.new("/some/file", "r") contents = text_file.read # alternatively one can explicitely set an encoding before # the first read/write: text_file = File.new("/some/file", "r") text_file.encoding = Encoding::UTF8 The File class (or IO module) will probably use a BinaryIO instance internally. 2) The File class/IO module as of current Ruby just gets additional methods for binary IO (through ByteBuffers) and an encoding attribute. The methods that do binary IO don't need to care about the encoding attribute. I think 1) is cleaner. > 5. Since the String class is quite smart already, it can implement > generally useful and hard (in the domain of Unicode) operations > like case folding, sorting, comparing etc. If the strings are represented as a sequence of Unicode codepoints, it is possible for external libraries to implement more advanced Unicode operations. Since IMO a new "character" class would be overkill, I propose that the String class provides codepoint-wise iteration (and indexing) by representing a codepoint as a Fixnum. AFAIK a Fixnum consists of 31 bits on a 32 bit machine, which is enough to represent the whole range of unicode codepoints. > 6. More exotic operations can easily be provided by additional > libraries because of Ruby's open classes. Those operations may be > coded depending on on String's public interface for simplicissity, > or work with the internal representation directly for performance. > > 7. This approach leaves open the possibility of String subclasses > implementing different internal encodings for performance/space > tradeoff reasons which work transparently together (a bit like > FixInt and BigInt). I think providing different internal String representations would be too much work, especially for maintenance in the long run. > 10. Be flexible. <placeholder for future idea> The advantages of this proposal over the current situation and tagging a string with an encoding are: * There is only one internal string (where string means a string of characters) representation. String operations don't need to be written for different encodings. * No need for $KCODE. * Higher abstraction. * Separation of concerns. I always found it strange that most dynamic languages simply mix handling of character and arbitrary binary data (just think of pack/unpack). * Reading of character data in one encoding and representing it in other encoding(s) would be easy. It seems that the main argument against using Unicode strings in Ruby is because Unicode doesn't work well for eastern countries. Perhaps there is another character set that works better that we could use instead of Unicode. The important point here is that there is only *one* representation of character data Ruby. If Unicode is choosen as character set, there is the question which encoding to use internally. UTF-32 would be a good choice with regards to simplicity in implementation, since each codepoint takes a fixed number of bytes. Consider indexing of Strings: "some string"[4] If UTF-32 is used, this operation can internally be implemented as a simple, constant array lookup. If UTF-16 or UTF-8 is used, this is not possible to implement as an array lookup, since any codepoint before the fifth could occupy more than one (8 bit or 16 bit) unit. Of course there is the argument against UTF-32 that it takes to much memory. But I think that most text-processing done in Ruby spends much more memory on other data structures than in actual character data (just consider an REXML document), but I haven't measured that ;) An advantage of using UTF-8 would be that for pure ASCII files no conversion would be necessary for IO. Thank you for reading so far. Just in case Matz decides to implement something similar to this proposal, I am willing to help with Ruby development (although I don't know much about Ruby's internals and not too much about Unicode either). I do not have a CS degree and I'm not a Unicode expert, so perhaps the proposal is garbage, in this case please tell me what is wrong about it or why it is not realistic to implement it.
on 2006-06-17 15:54

On 6/17/06, Juergen Strobel <strobel@secure.at> wrote: > I empathically agree. I'll even repeat and propose a new Plan for > Unicode Strings in Ruby 2.0 in 10 points: > > 1. Strings should deal in characters (code points in Unicode) and not > in bytes, and the public interface should reflect this. Agree, mostly. Strings should have a way to indicate the buffer size of the String. > 2. Strings should neither have an internal encoding tag, nor an > external one via $KCODE. The internal encoding should be encapsulated > by the string class completely, except for a few related classes which > may opt to work with the gory details for performance reasons. > The internal encoding has to be decided, probably between UTF-8, > UTF-16, and UTF-32 by the String class implementor. Completely disagree. Matz has the right choice on this one. You can't think in just terms of a pure Ruby implementation -- you *must* think in terms of the Ruby/C interface for extensions as well. > 3. Whenever Strings are read or written to/from an external source, > their data needs to be converted. The String class encapsulates the > encoding framework, likely with additional helper Modules or Classes > per external encoding. Some methods take an optional encoding > parameter, like #char(index, encoding=:utf8), or > #to_ary(encoding=:utf8), which can be used as helper Class or Module > selector. Conversion should be possible at any time. An "external source" may be an extension that your Ruby program can't distinguish. Again, this point fails because your #2 is unacceptable. > 4. IO instances are associated with a (modifyable) encoding. For > stdin, stdout this can be derived from the locale settings. String-IO > operations work as expected. Agree, realising that the internal implementation of String must be completely different than you've suggested. It is also important to retain *raw* reading; a JPEG should not be interpreted as Unicode. > 5. Since the String class is quite smart already, it can implement > generally useful and hard (in the domain of Unicode) operations like > case folding, sorting, comparing etc. Agreed, but this would be expected regardless of the actual encoding of a String. > 6. More exotic operations can easily be provided by additional > libraries because of Ruby's open classes. Those operations may be > coded depending on on String's public interface for simplicissity, or > work with the internal representation directly for performance. Agreed. > 7. This approach leaves open the possibility of String subclasses > implementing different internal encodings for performance/space > tradeoff reasons which work transparently together (a bit like FixInt > and BigInt). Um. Disagree. Matz's proposed approach does this; yours does not. Yours, in fact, makes things *much* harder. > 8. Because Strings are tightly integrated into the language with the > source reader and are used pervasively, much of this cannot be > provided by add-on libraries, even with open classes. Therefore the > need to have it in Ruby's canonical String class. This will break some > old uses of String, but now is the right time for that. "Now" isn't; Ruby 2.0 is. Maybe Ruby 1.9.1. > 9. The String class does not worry over character representation > on-screen, the mapping to glyphs must be done by UI frameworks or the > terminal attached to stdout. The String class doesn't worry about that now. > 10. Be flexible. <placeholder for future idea> And little is more flexible than Matz's m17n String. > This approach has several advantages and a few disadvantages, and I'll > try to bring in some new angles to this now too: > > *Advantages* > > -POL, Encapsulation- > > All Strings behave exactly the same everywhere, are predictable, > and do the hard work for their users. Remember: POLS is not an acceptable reason for anything. Matz's m17n Strings would be predictable, too. a + b would be possible if and only if a and b are the same encoding or one of them is "raw" (which would mean that the other is treated as the defined encoding) *or* there is a built-in conversion for them. > -Cross Library Transparency- > No String user needs to worry which Strings to pass to a library, or > worry which Strings he will get from a library. With Web-facing > libraries like rails returning encoding-tagged Strings, you would be > likely to get Strings of all possible encodings otherwise, and isthe > String user prepared to deal with this properly? This is a *big* deal > IMNSHO. This will be true with m17n strings. However, your proposal does *not* work for Ruby/C interfaced items. Sorry. > -Limited Conversions- > > Encoding conversions are limited to the time Strings are created or > written or explicitly transformed to an external representation. This is a mistake. I may need to know the internal representation of a particular encoding of a String inside of a program. Trust me on this one: I *have* done some low-level encoding work. Additionally, even though I might have marked a network object as "UTF-8", I may not know whether it's *actually* UTF-8 or not until I get HTTP headers -- or worse, a <meta http-equiv> tag. Assuming UTF-8 reading in today's world is doomed to failure. > -Correct String Operations- > Even basic String operations are very hard in the world of Unicode. If > we leave the String users to look at the encoding tags and sort it out > themselves, they are bound to make mistakes because they don't care, > don't know, or have no time. And these mistakes may be _security_ > _sensitive_, since most often credentials are represented as Strings > too. There already have been exploits related to Unicode. This is a misunderstanding on your part. Nothing about Matz's m17n Strings suggests that String users would have to look at the encoding tags. Merely that they *could*. I suspect that there will be pragma- like behaviours to enforce a particular internal representation at all times. > *Disadvantages* (with mitigating reasoning of course) > - String users need to learn that #byte_length(encoding=:utf8) >= > #size, but that's not too hard, and applies everywhere. Users do not > need to learn about an encoding tag, which is surely worse to handle > for them. True, but the encoding tag is not worse. Anyone who assumes that developers can ignore encoding at any time simply *doesn't* know about the level of problems that can be encountered. > - Strings cannot be used as simple byte buffers any more. Either use > an array of bytes, or an optimized ByteBuffer class. If you need > regular expresson support, RegExp can be extended for ByteBuffers or > even more. I see no reason for this. > - Some String operations may perform worse than might be expected from > a naive user, in both the time or space domain. But we do this so the > String user doesn't need to himself, and are problably better at it > than the user too. This is a wash. > - For very simple uses of String, there might be unneccessary > conversions. If a String is just to be passed through somewhere, > without inspecting or modifying it at all, in- and outwards conversion > will still take place. You could and should use a ByteBuffer to avoid > this. This is a wash. > - This ties Ruby's String to Unicode. A safe choice IMHO, or would we > really consider something else? Note that we don't commit to a > particular encoding of Unicode strongly. This is a wash. I think that it's better to leave the options open. After all, it *is* a hope of mine to have Ruby running on iSeries (AS/400) and *that* still uses EBCDIC. > - More work and time to implement. Some could call it over-engineered. > But it will save a lot of time and troubles when shit hits the fan and > users really do get unexpected foreign characters in their Strings. I > could offer help implementing it, although I have never looked at > ruby's source, C-extensions, or even done a lot of ruby programming > yet. I would call it the amount of work necessary. But the work needs to be done for a *variety* of encodings, and not just Unicode. *Especially* because of C extensions. > Close to the start of this discussion Matz asked what the problem with > current strings really was for western users. Somewhere later he > concluded case folding. I think it is more than that: we are lazy and > expect character handling to be always as easy as with 7 bit ASCII, or > as close as possible. Fixed 8-bit codepages worked quite fine most of > the time in this regard, and breakage was limited to special > characters only. > Now let's ask the question in reverse: are eastern programmers so used > to doing elaborate byte-stream to character handling by hand they > don't recognize how hard this is any more? Surely it is a target for > DRY if I ever saw one. Or are there actual problems not solveable this > way? I looked up the mentioned Han-Unification issue, and as far as I > understood this could be handled by future Unicode revisions > allocating more characters, outside of Ruby, but I don't see how it > requires our Strings to stay dumb byte buffers. No one has ever suggested that Ruby Strings stay byte buffers. However, blindly choosing Unicode *adds* unnecessary complexity to the situation. -austin
on 2006-06-17 16:16

On 17-jun-2006, at 15:52, Austin Ziegler wrote: >> 8. Because Strings are tightly integrated into the language with the >> source reader and are used pervasively, much of this cannot be >> provided by add-on libraries, even with open classes. Therefore the >> need to have it in Ruby's canonical String class. This will break >> some >> old uses of String, but now is the right time for that. > > "Now" isn't; Ruby 2.0 is. Maybe Ruby 1.9.1. Most probably wise, but I need casefolding and character classes to work since yesteryear. Oniguruma is there but even if you complie with it (which is not the default, still) you don't get char classes (AFAIK) and you don't get casefolding. Case-insensitive search/replace quickly becomes bondage. I am maintaining a gem whose test fails due to different regexps in Oniguruma, but I would be able to quickly fix it knowing that Oniguruma is in stable now. >> 10. Be flexible. <placeholder for future idea> > > And little is more flexible than Matz's m17n String. I couldn't find a proper description of that - as I told already, the thing I'd least prefer would be # get a string from the database p str + my_unicode_chars # Ok, bail out with an ugly exception because the author of the DB adaptor didn't care to send me proper Strings... If strings in the system are allowed to have varying encodings, I don't understand how the engine is going to upgrade/downgrade strings automatically. Especially remembering that the receiver is on the left, so I actually might get different exceptions going as I do p my_unicode_chars + mojikyo_str # who wins? or p mojikyo_str + my_unicode_chars # who wins? or (especially) p mojikyo_str + bytestring_that_i_just_grabbed_by_http_and_i_know_it_is_mojikyo_but_its_ not # who wins?
on 2006-06-17 16:19

On 6/17/06, Stefan Lang <langstefan@gmx.at> wrote: > Full ACK. Ruby programs shouldn't need to care about the > *internal* string encoding. External string data is treated as > a sequence of bytes and is converted to Ruby strings through > an encoding API. This is incorrect. *Most* Ruby programs won't need to care about the internal string encoding. Experience suggests, however, that it is *most*. Definitely not all. > Given a specific encoding, the encoding API converts > ByteStrings to Strings and vice versa. > > This could look like: > > my_character_str = Encoding::UTF8.encode(my_byte_buffer) > buffer = Encoding::UTF8.decode(my_character_str) Unnecessarily complex and inflexible. Before you go too much further, I *really* suggest that you look in the archives and Google to find more about Matz's m17n String proposal. It's a really good one, as it allows developers (both pure Ruby and extension) to choose what is appropriate with the ability to transparently convert as well. >> 4. IO instances are associated with a (modifyable) encoding. For >> stdin, stdout this can be derived from the locale settings. >> String-IO operations work as expected. > > I propose one of: > > 1) A low level IO API that reads/writes ByteBuffers. String IO > can be implemented on top of this byte-oriented API. [...] > 2) The File class/IO module as of current Ruby just gets > additional methods for binary IO (through ByteBuffers) and > an encoding attribute. The methods that do binary IO don't > need to care about the encoding attribute. > > I think 1) is cleaner. I think neither is necessary and both would be a mistake. It is, as I indicated to Juergen, sometimes *impossible* to determine the encoding to be used for an IO until you have some data from the IO already. >> 5. Since the String class is quite smart already, it can implement >> generally useful and hard (in the domain of Unicode) operations like >> case folding, sorting, comparing etc. > If the strings are represented as a sequence of Unicode codepoints, it > is possible for external libraries to implement more advanced Unicode > operations. This would be true regardless of the encoding. > Since IMO a new "character" class would be overkill, I propose that > the String class provides codepoint-wise iteration (and indexing) by > representing a codepoint as a Fixnum. AFAIK a Fixnum consists of 31 > bits on a 32 bit machine, which is enough to represent the whole range > of unicode codepoints. This does not match what Matz will be doing. str = "Fran\303\247ais" str[5] # -> "\303\247" This is better than doing a Fixnum representation. It is character iteration, but each character is, itself, a String. >> 7. This approach leaves open the possibility of String subclasses >> implementing different internal encodings for performance/space >> tradeoff reasons which work transparently together (a bit like >> FixInt and BigInt). > I think providing different internal String representations > would be too much work, especially for maintenance in the long > run. If you're depending on classes to do that, especially given that Ruby's String, Array, and Hash classes don't inherit well, you're right. > The advantages of this proposal over the current situation and > tagging a string with an encoding are: The problem, of course, is that this proposal -- and your take on it -- don't account for the m17n String that Matz has planned. The current situation is a mess. But the current situation is *not* what is planned. I've had to do some encoding work for work in the last two years, and while I *prefer* a UTF-8/UTF-16 internal representation, I also know that's *impossible* in some situations and you have to be flexible. I also know that POSIX handles this situation worse than any other setup. With the work that I've done on this, Matz is *right* about this, and the people claiming that Unicode is the Only Way ... are wrong. In an ideal world, Unicode would be the correct and only way. In the real world, however, it's a lot messier, and Ruby has to be aware of that. We can *still* make it as easy as possible for the common case (which will be UTF-8 encoding data and filenames). But we shouldn't make the mistake of assuming that the common case is all that Ruby should handle. > * There is only one internal string (where string means a > string of characters) representation. String operations > don't need to be written for different encodings. This is still (mostly) correct under the m17n String proposal. > * No need for $KCODE. This is true under the m17n String. > * Higher abstraction. This is true under the m17n String. > * Separation of concerns. I always found it strange that most dynamic > languages simply mix handling of character and arbitrary binary data > (just think of pack/unpack). The separation makes things harder most of the time. > * Reading of character data in one encoding and representing it in > other encoding(s) would be easy. This is true under the m17n String. > It seems that the main argument against using Unicode strings in Ruby > is because Unicode doesn't work well for eastern countries. Perhaps > there is another character set that works better that we could use > instead of Unicode. The important point here is that there is only > *one* representation of character data Ruby. This is a mistake. > If Unicode is choosen as character set, there is the question which > encoding to use internally. UTF-32 would be a good choice with regards > to simplicity in implementation, since each codepoint takes a fixed > number of bytes. Consider indexing of Strings: Yes, but this would be very hard on memory requirements. There are people who are trying to get Ruby to fit into small-memory environments. This would destroy any chance of that. [...] > Thank you for reading so far. Just in case Matz decides to implement > something similar to this proposal, I am willing to help with Ruby > development (although I don't know much about Ruby's internals and not > too much about Unicode either). I would suggest that you look for discussions about m17n Strings in Ruby. Matz has this one right. > I do not have a CS degree and I'm not a Unicode expert, so perhaps the > proposal is garbage, in this case please tell me what is wrong about > it or why it is not realistic to implement it. I don't have a CS degree either, but I have been in the business for a *long* time and I've been immersed in Unicode and encoding issues for the last two years. If everyone used Unicode -- and POSIX weren't stupid -- your proposal would be much more realistic. I *agree* that Ruby should encourage the use of Unicode as much as is practical. But it also shouldn't tie our hands like other programming languages do. -austin
on 2006-06-17 16:26

On 6/17/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote: > (AFAIK) and you don't get casefolding. Case-insensitive search/replace > quickly becomes bondage. I don't disagree. But you're *not* going to get those features, in all likelihood, in a Ruby 1.8.x release. It would be a breaking release. Oniguruma is the default for Ruby 1.9+. If there are things missing, work with the developer. > I am maintaining a gem whose test fails due to different regexps in > Oniguruma, but I would be able to quickly fix it knowing that > Oniguruma is in stable now. I don't think that Oniguruma is in stable (1.8.x); I *don't* think it will be enabled as default in stable. Again, it's a breaking change. >>> 10. Be flexible. <placeholder for future idea> >> And little is more flexible than Matz's m17n String. > I couldn't find a proper description of that - as I told already, the > thing I'd least prefer would be > # get a string from the database > p str + my_unicode_chars # Ok, bail out with an ugly exception > because the author of the DB adaptor didn't care to send me proper > Strings... The DB adaptor, of course, will have to look at the encoding that the DB is using. > p mojikyo_str + my_unicode_chars # who wins? > > or (especially) > > p mojikyo_str + > bytestring_that_i_just_grabbed_by_http_and_i_know_it_is_mojikyo_but_its_ > not # who wins? Consider coersion in Numerics (ri Numeric#coerce). A similar framework can be built for Strings. -austin
on 2006-06-17 17:00

On Jun 17, 2006, at 9:50 AM, Stefan Lang wrote: > *internal* string encoding. External string data is treated as > a sequence of bytes and is converted to Ruby strings through > an encoding API. I don't claim to be an Unicode export but shouldn't the goal be to have Ruby work with *any* text encoding on a per-string basis? Why would you want to force all strings into Unicode for example in a context where you aren't using Unicode? (The internal encoding has to be....). And of course even in the Unicode world you have several different encodings (UTF-8, UTF-16, and so on). Juergen, when you say 'internal encoding' are you talking about the text encoding of Ruby source code? It seems to me that irrespective of any particular text encoding scheme you need clean support of a simple byte vector data structure completely unencumbered with any notion of text encoding or locale. Right now that is done by the String class, whose name I think certainly creates much confusion. If the class had been called Vector and then had methods like: Vector#size # size in bytes Vector#str_size # size in characters (encoding and locale considered) I think this discussion would be clearer because it would be the behavior of the str* methods that would need to understand text encodings and/or locale settings while the underlying byte vector methods remained oblivious. The #[] method is the most confusing since sometimes you want to extract bytes and sometimes you want to extract sub-strings (i.e consider the encoding). One method, two interpretations, bad headache. It seems that three distinct behaviors are being shoehorned (with good reason) into a single class framework (String): byte vector text encoding (encoded sequence of code points) locale (cultural interpretations of the encoded sequence of code points) I'm just suggesting that these distinctions seem to be lost in much of this discussion, especially for folks (like myself) who have a practical interest in this but certainly aren't text-encoding gurus. Gary Wright
on 2006-06-17 18:04

On 17/06/06, Austin Ziegler <halostatue@gmail.com> wrote: > > - This ties Ruby's String to Unicode. A safe choice IMHO, or would we > > really consider something else? Note that we don't commit to a > > particular encoding of Unicode strongly. > > This is a wash. I think that it's better to leave the options open. > After all, it *is* a hope of mine to have Ruby running on iSeries > (AS/400) and *that* still uses EBCDIC. Not to mention that Matz has explicitly stated in the past that he wants Ruby to support other encodings (TRON, Mojikyo, etc.) that aren't compatible with a Unicode internal representation. Not tying String to Unicode is also the right thing to do: it allows for future developments. Java's weird encoding system is entirely down to the fact that it standardised on UCS-2; when codepoints beyond 65535 arrived, they had to be shoehorned in via an ugly hack. As far as possible, Ruby should avoid that trap. Paul.
on 2006-06-17 18:17

On Saturday 17 June 2006 16:58, gwtmp01@mac.com wrote: > > Full ACK. Ruby programs shouldn't need to care about the > when you say 'internal encoding' are you talking about the text > encoding of Ruby source code? I'm not Juergen, but since you responded to my message... First of all Unicode is a character set and UTF-8, UTF-16 etc. are encodings, that is they specify how a Unicode character is represented as a series of bits. At least *I* am not talking about the encoding of Ruby source code. The main point of the proposal is to use a single universal character encoding for all Ruby character strings (instances of the String class). Assuming there is an ideal character set that is really sufficient to represent any text in this world, it could be used to construct a String class that abstracts the underlying representation completely away. Consider the "float" data type you will find in most programming languages: The programmer doesn't think in terms of the bits that represent a floating point value. He just uses the operators provided for floats. He can choose between different serialization strategies if he needs to serialize floats. But the *operators* on floats the programming language provides don't care about the different serialization formats, they all work using the same internal representation. Conversion is done on IO. Ideally, the same level of abstraction should be there for character data. If you have a universal character set (Unicode is an attempt at this), and an encoding for it, the programming language can abstract the underlying String representation away. For IO, it provides methods (i.e. through Encoding objects) that serialize Strings to a stream of bytes and vice versa. > It seems to me that irrespective of any particular text encoding > scheme you need clean support of a simple byte vector data > structure completely unencumbered with any notion of text encoding > or locale. I have proposed that further below as Buffer or ByteString. > Right now that is done by the String class, whose name I > think certainly creates much confusion. If the class had been > called Vector and then had methods like: > > Vector#size # size in bytes > Vector#str_size # size in characters (encoding and locale > considered) By providing str_size you are already mixing up the purpose of your simple byte vector and character strings.
on 2006-06-17 18:38

On Jun 17, 2006, at 12:16 PM, Stefan Lang wrote: > Assuming there is an ideal > character set that is really sufficient to represent any > text in this world, it could be used to construct a String > class that abstracts the underlying representation completely > away. So all we need is an ideal character set? That sounds simple. :-) > By providing str_size you are already mixing up the purpose of > your simple byte vector and character strings. Yes. I was pointing out that there were multiple concerns that were being solved by a single class and I said that there were good reasons for this. My point was that even if you choose to handle all those concerns in a single class it was important to keep the concerns distinct during discussion. Something that I thought wasn't happening in this discussion. I think this is another example of the Humane Interface discussion started by Martin Fowler (http://www.martinfowler.com/bliki/ HumaneInterface.html) In Ruby arrays have an interface that allow them to be used as pure arrays, as lists, as queue, as stacks and so on instead of having lots of additional classes. Similarly I think it makes sense for all M17N issues to be packaged up in a single class (String) instead of breaking up those concerns into a class hierarchy. Gary Wright
on 2006-06-17 19:37

On Saturday 17 June 2006 16:16, Austin Ziegler wrote: > On 6/17/06, Stefan Lang <langstefan@gmx.at> wrote: > > Full ACK. Ruby programs shouldn't need to care about the > > *internal* string encoding. External string data is treated as > > a sequence of bytes and is converted to Ruby strings through > > an encoding API. > > This is incorrect. *Most* Ruby programs won't need to care about > the internal string encoding. Experience suggests, however, that it > is *most*. Definitely not all. As long as one treats a character string as a character string, the internal encoding is irrelevant, and as soon as a decision for an internal string encoding is made, every programmer can read in the docs "Ruby internally encodes strings using the XYZ encoding". [...] > Unnecessarily complex and inflexible. Before you go too much > further, I *really* suggest that you look in the archives and > Google to find more about Matz's m17n String proposal. It's a > really good one, as it allows developers (both pure Ruby and > extension) to choose what is appropriate with the ability to > transparently convert as well. I couldn't find much (in English, I don't understand Japanese), do you have a link at hand? [...] > already. That is easy to handle with the proposed scheme: Read as much as you need with the binary interface until you know the encoding and then do the conversion of the byte buffer to string. For file input, you can close the file when you have determined the encoding and reopen it using the "normal" (character oriented) interface. Or do you mean Ruby should determine the encoding automatically? IMO, that would be bad magic and error-prone. [...] > > If the strings are represented as a sequence of Unicode > > codepoints, it is possible for external libraries to implement > > more advanced Unicode operations. > > This would be true regardless of the encoding. But a conversion from [insert arbitrary encoding here] to unicode codepoints would be needed. > > This is better than doing a Fixnum representation. It is character > iteration, but each character is, itself, a String. I wouldn't mind additionally having: str.codepoint_at(5) => a Fixnum [...] > and the people claiming that Unicode is the Only Way ... are wrong. > > string of characters) representation. String operations > > don't need to be written for different encodings. > > This is still (mostly) correct under the m17n String proposal. How does the regular expression engine work then? And all String methods that have to combine two or more strings in some way? [...] > > * Separation of concerns. I always found it strange that most > > dynamic languages simply mix handling of character and arbitrary > > binary data (just think of pack/unpack). > > The separation makes things harder most of the time. Why? In which cases? [...] > > It seems that the main argument against using Unicode strings in > > Ruby is because Unicode doesn't work well for eastern countries. > > Perhaps there is another character set that works better that we > > could use instead of Unicode. The important point here is that > > there is only *one* representation of character data Ruby. > > This is a mistake. OK, Unicode was enough for me until now, but I see that Unicode is not enough for everyone. > > If Unicode is choosen as character set, there is the question > > which encoding to use internally. UTF-32 would be a good choice > > with regards to simplicity in implementation, since each > > codepoint takes a fixed number of bytes. Consider indexing of > > Strings: > > Yes, but this would be very hard on memory requirements. There are > people who are trying to get Ruby to fit into small-memory > environments. This would destroy any chance of that. I can hardly believe that. There is still the binary IO interface and ByteString that I proposed. And I still think that the memory used for pure character data is a small fraction of the overall memory consumption of typical Ruby programs.
on 2006-06-17 22:34

On Sun, Jun 18, 2006 at 01:02:39AM +0900, Paul Battley wrote: > On 17/06/06, Austin Ziegler <halostatue@gmail.com> wrote: > >> - This ties Ruby's String to Unicode. A safe choice IMHO, or would we > >> really consider something else? Note that we don't commit to a > >> particular encoding of Unicode strongly. > > > >This is a wash. I think that it's better to leave the options open. > >After all, it *is* a hope of mine to have Ruby running on iSeries > >(AS/400) and *that* still uses EBCDIC. AFAIK, EBCDIC can be losslessly converted to Unicode and back. Right? On the other hand, do you really trust all ruby library writers to accept your strings tagged with EBCDIC encoding? Or do you look forward to a lot of manual conversions? > Paul. That's why I explicitly stated it ties Ruby's String class to Unicode Character Code Points, but not to a particular Unicode encoding or character class, and *that* was Java's main folly. (UCS-2 is a strictly 16 bit per character encoding, but new Unicode standards specify 21 bit characters, so they had to "extend" it). I am unaware of unsolveable problems with Unicode and Eastern languages, I asked specifically about it. If you think Unicode is unfixably flawed in this respect, I guess we all should write off Unicode now rather than later? Can you detail why Unicode is unacceptable as a single world wide unifying character set? Especially, are there character sets which cannot be converted to Unicode and back, which is the main requirement to have Unicode Strings in a non-Unicode environment? Jürgen
on 2006-06-17 22:37

On Sun, Jun 18, 2006 at 01:16:12AM +0900, Stefan Lang wrote: > > > > > several different encodings (UTF-8, UTF-16, and so on). Juergen, > code. The main point of the proposal is to use a single > universal character encoding for all Ruby character strings > (instances of the String class). Assuming there is an ideal > character set that is really sufficient to represent any > text in this world, it could be used to construct a String > class that abstracts the underlying representation completely > away. That's what I meant, yes. And that is the most important point too. Jürgen
on 2006-06-17 23:02

On 17/06/06, Juergen Strobel <strobel@secure.at> wrote: > I am unaware of unsolveable problems with Unicode and Eastern > languages, I asked specifically about it. If you think Unicode is > unfixably flawed in this respect, I guess we all should write off > Unicode now rather than later? Can you detail why Unicode is > unacceptable as a single world wide unifying character set? > Especially, are there character sets which cannot be converted to > Unicode and back, which is the main requirement to have Unicode > Strings in a non-Unicode environment? They aren't so much unsolvable problems as mutually incompatible approaches. Unicode is concerned with the semantic meaning of a character, and ignores glyph variations through the 'Han unification' process. TRON encoding doesn't use Han unification: it encodes the historically-same Chinese character differently for different languages/regions where they are written differently today. Mojikyo encodes each graphically distinct character differently and includes a very wide range of historical characters, and is therefore particularly suited to certain linguistic and literary niches. In spite of this, I think that Unicode is an excellent choice for everyday usage. Unicode does have a solution to the problem of character variants, but it's not a universal back end for all encodings. Incidentally, it is said that TRON is the world's most widely-used operating system, so supporting that encoding is not necessarily a minor concern. Paul.
on 2006-06-17 23:51

On Sat, Jun 17, 2006 at 10:52:24PM +0900, Austin Ziegler wrote: > >2. Strings should neither have an internal encoding tag, nor an > >external one via $KCODE. The internal encoding should be encapsulated > >by the string class completely, except for a few related classes which > >may opt to work with the gory details for performance reasons. > >The internal encoding has to be decided, probably between UTF-8, > >UTF-16, and UTF-32 by the String class implementor. > > Completely disagree. Matz has the right choice on this one. You can't > think in just terms of a pure Ruby implementation -- you *must* think > in terms of the Ruby/C interface for extensions as well. I admit I don't know about Ruby's C extensions. Are they unable to access String's methods? That is all that is needed to work with them. And since this String class does not have a parametric encoding attribute, it should be easier to crunch in C even. > fails because your #2 is unacceptable. Note that explict conversion to characters, arrays, etc, is possible for any supported character set and encodig. I have even given method examples. "External" is to be seen in the context of the String class. > >case folding, sorting, comparing etc. > > Agreed, but this would be expected regardless of the actual encoding of > a String. I am unaware of Matz's exact plan. Any good english language links? I was under the impression users of Matz' String instances need to look at the encoding tag to implement eg. #version_sort. If that is not the case our proposals are not that much different, only Matz' one is even more complex to implement than mine. > >tradeoff reasons which work transparently together (a bit like FixInt > >and BigInt). > > Um. Disagree. Matz's proposed approach does this; yours does not. Yours, > in fact, makes things *much* harder. If Matz's approach requires looking at the encoding tag from the outside, it is not as transparent as mine. If it isn't it just boils down to a parametric class versus subclass hierarchy design decision, and I don't see much difference and would be happy with either one. > > >8. Because Strings are tightly integrated into the language with the > >source reader and are used pervasively, much of this cannot be > >provided by add-on libraries, even with open classes. Therefore the > >need to have it in Ruby's canonical String class. This will break some > >old uses of String, but now is the right time for that. > > "Now" isn't; Ruby 2.0 is. Maybe Ruby 1.9.1. My original title, somewhere snipped out, was "A Plan for Unicode Strings in Ruby 2.0". I don't want to rush things or break 1.8 either. > > >9. The String class does not worry over character representation > >on-screen, the mapping to glyphs must be done by UI frameworks or the > >terminal attached to stdout. > > The String class doesn't worry about that now. I was just playing safe here. > >10. Be flexible. <placeholder for future idea> > > And little is more flexible than Matz's m17n String. I've had flexibility with respect to Unicode Standards in mind, to not fall into traps similiar to Java. A simple to use String class, powerful enough to include every character of the world was my goal, with the ability to convert to and from other external (from the String class'es point of view) representations. The flexibility to have parametric String encodings inside the String class was not what I was going for, rather I would have that inaccessible or at least unneccessary to access for the common String user, and I provided a somewhat weaker but maybe still sufficient technique via subclassing. > Remember: POLS is not an acceptable reason for anything. Matz's m17n > Strings would be predictable, too. a + b would be possible if and only > if a and b are the same encoding or one of them is "raw" (which would > mean that the other is treated as the defined encoding) *or* there is a > built-in conversion for them. Since I probably cannot control which Strings I get from libraries, and dont't want to worry which ones I'll have to provide to them, this is weaker than my approach in this respect, see my next point. > work for Ruby/C interfaced items. Sorry. Please elaborate this or provide pointers. I cannot believe C cannot crunch at my Strings, which are less parametric than Matz's ones are. > whether it's *actually* UTF-8 or not until I get HTTP headers -- or > worse, a <meta http-equiv> tag. Assuming UTF-8 reading in today's world > is doomed to failure. Read it as binary, and decide later. These problems should be locally containable, and methods are still able to return Strings after determining the encoding. > tags. Merely that they *could*. I suspect that there will be pragma- > like behaviours to enforce a particular internal representation at all > times. Previously you stated users need to look at the encoding to determine if simple operations like a + b work. Can you point to more info? I am interested how this pragma stuff works, and if not doing it "right" can break things. > >*Disadvantages* (with mitigating reasoning of course) > >- String users need to learn that #byte_length(encoding=:utf8) >= > >#size, but that's not too hard, and applies everywhere. Users do not > >need to learn about an encoding tag, which is surely worse to handle > >for them. > > True, but the encoding tag is not worse. Anyone who assumes that > developers can ignore encoding at any time simply *doesn't* know about > the level of problems that can be encountered. For String concatenates, substring access, search, etc, I expect to be able to ignore encoding totally. Only when interfacing with non-String-class objects (I/O and/or explicit conversion) would I need encoding info. > >- Strings cannot be used as simple byte buffers any more. Either use > >an array of bytes, or an optimized ByteBuffer class. If you need > >regular expresson support, RegExp can be extended for ByteBuffers or > >even more. > > I see no reason for this. In my proposal, Unicode Strings cannot represent arbitrary binary data in their internal representation, since not everything would be valid characters. In fact, you cannot set the internal representation directly. The interface could accept a code point sequence of values (0..255), but that would be wasteful compared to an array of bytes. > >- Some String operations may perform worse than might be expected from > >a naive user, in both the time or space domain. But we do this so the > >String user doesn't need to himself, and are problably better at it > >than the user too. > > This is a wash. Only trying to refute weak arguments in advance. > >- For very simple uses of String, there might be unneccessary > >conversions. If a String is just to be passed through somewhere, > >without inspecting or modifying it at all, in- and outwards conversion > >will still take place. You could and should use a ByteBuffer to avoid > >this. > > This is a wash. Not a big problem either, but someone was bound to bring it up. > >users really do get unexpected foreign characters in their Strings. I > >concluded case folding. I think it is more than that: we are lazy and > >understood this could be handled by future Unicode revisions > * austin@zieglers.ca The way I see it we have to choose a character set. I proposed Unicode, because their official goal is to be the one unifying set, and if they ain't yet, I hope they'll be sometime. If that is not enough we will effectively create our own character set, let's call it RubyCode, which will contain characters from the union of Unicode and a few other sets. Each String will have a particular encoding, which will determine which characters of RubyCode are valid in this particular String instance. Hopefully many characters will be valid in multiple encodings. But it doesn't sound like a very clear design to me. Jürgen
on 2006-06-17 23:57

On 6/17/06, Stefan Lang <langstefan@gmx.at> wrote: > > As long as one treats a character string as a character > string, the internal encoding is irrelevant, and as soon as a No, it is not. First for reasons of efficiency. If an application is going to perform lots of slicing and poking on strings it will want some encoding that is suiatble for that such as UTF-32. If an application runs on system with little memory it will want space-efficient encoding (ie UTF-8 or UTF-16 for Asian languages). And if an appliaction runs on system that uses some legacy codepage it can read, write, and process all strings in that codepage. And in JRuby it will be useful to convert strings to UTF-16 so that the native Java functions can be used for manipulation. Second, not all characters are equal. If you lived in world where everything was Unicode you would be fine. But it is not so. Unicode is suboptimal for encoding CJK characters. So some people might want to use another encoding for their texts (iirc TRON mentioned earlier is one of such encodings). In your model you can modify Ruby to use strings composed of TRON characters instead of Unicode characters. But how would Unicode Ruby and TRON Ruby exchange strings? And how would you write an application that handles _both_ TRON and Unicode? (I suspect TRON would not be much good ie for Runic script) Such appliaction has to be written very carefully because neither character set would be subset of the other so it is not possible converting strings forth and back without thinking. But in your model such application is not possible at all. > decision for an internal string encoding is made, every > programmer can read in the docs "Ruby internally encodes > strings using the XYZ encoding". > > [...] > > I indicated to Juergen, sometimes *impossible* to determine the > > encoding to be used for an IO until you have some data from the IO > > already. > > That is easy to handle with the proposed scheme: Read as much > as you need with the binary interface until you know the > encoding and then do the conversion of the byte buffer to > string. For file input, you can close the file when you have > determined the encoding and reopen it using the "normal" > (character oriented) interface. Why reopening or converting if you can simply tag a string that you had to read anyway? > > Or do you mean Ruby should determine the encoding > automatically? IMO, that would be bad magic and error-prone. No. But if you read part of html/xml document before the encoding was specified there is no reason why that part hes to be converted or reread. You apparently got it right if you were able to determine the encoding from what you read. > > [...] > > > If the strings are represented as a sequence of Unicode > > > codepoints, it is possible for external libraries to implement > > > more advanced Unicode operations. > > > > This would be true regardless of the encoding. > > But a conversion from [insert arbitrary encoding here] to > unicode codepoints would be needed. That will be needed anyway. You cannot expect all libraries to use the arbitrary encoding you chose for Ruby strings. But if you can choose the encoding of your strings there is nothing stopping you from converting your strings so that they best suit your library of choice. > > > > > * There is only one internal string (where string means a > > > string of characters) representation. String operations > > > don't need to be written for different encodings. > > > > This is still (mostly) correct under the m17n String proposal. > > How does the regular expression engine work then? And all > String methods that have to combine two or more strings in > some way? If they are both subset of Unicode I see no problem with converting both to Unicode. If they are incompatible things may break. But that is because of real incompatibility, not because of some restriction of the approach. > > [...] > > > * Separation of concerns. I always found it strange that most > > > dynamic languages simply mix handling of character and arbitrary > > > binary data (just think of pack/unpack). > > > > The separation makes things harder most of the time. > > Why? In which cases? Such as when you have to read sthe start of a HTML page as ByteBuffer and then convert it to String once you determine the encoding. Especially if string operations do not exist on the ByteBuffer to allow parsing it. > > I can hardly believe that. There is still the binary IO > interface and ByteString that I proposed. And I still think > that the memory used for pure character data is a small > fraction of the overall memory consumption of typical Ruby > programs. It depends on the program. For programs that do only text processing the portion of memory taken by text may be large. Michal
on 2006-06-18 00:16

On 6/17/06, Stefan Lang <langstefan@gmx.at> wrote: > internal string encoding is made, every programmer can read in the > docs "Ruby internally encodes strings using the XYZ encoding". And I'm saying that it's a mistake to do that (standardize on a single encoding). Every programmer will instead be able to read: "Ruby supports encoded strings in a variety of encodings. The default behaviour for all strings is XYZ, but this can be changed and individual strings may be recoded for performance or compatibility reasons." Language and character encodings are hard. Hiding that fact is a mistake. That doesn't mean we have to make the APIs difficult, but that we aren't going to be buzzworded into compliance, either. > [...] >> Unnecessarily complex and inflexible. Before you go too much further, >> I *really* suggest that you look in the archives and Google to find >> more about Matz's m17n String proposal. It's a really good one, as it >> allows developers (both pure Ruby and extension) to choose what is >> appropriate with the ability to transparently convert as well. > I couldn't find much (in English, I don't understand Japanese), do you > have a link at hand? I do not. I've been reading about this, talking about this, and discussing it with Matz for the last two years or so, and I've been dealing with Unicode and other character encoding issues extensively at work. However, the gist of it is that every String is still a byte vector. Each string will also have an encoding flag. Substrings of a single character width will always return the String required for the *character*. The supported encodings will probably start with UTF-8, UTF-16, various ISO-8859-* encodings, EUC-JP, SJIS, and other Asian encodings. > > Or do you mean Ruby should determine the encoding automatically? IMO, > that would be bad magic and error-prone. I mean that what you're suggesting *exposes* problems with encoding stuff extensively and unnecessarily. I certainly wouldn't want to program in it if the API involved were as stupid as you're suggesting it should be. > [...] >>> If the strings are represented as a sequence of Unicode codepoints, >>> it is possible for external libraries to implement more advanced >>> Unicode operations. >> This would be true regardless of the encoding. > But a conversion from [insert arbitrary encoding here] to unicode > codepoints would be needed. Why? What if the library that I'm interfacing with requires EUC-JP? Sorry, but Unicode is *not necessarily* the right answer. >> This is better than doing a Fixnum representation. It is character >> iteration, but each character is, itself, a String. > I wouldn't mind additionally having: > > str.codepoint_at(5) => a Fixnum Since Ruby isn't *only* using Unicode, this isn't necessarily going to be possible or meaningful. > [...] >>> * There is only one internal string (where string means a >>> string of characters) representation. String operations >>> don't need to be written for different encodings. >> This is still (mostly) correct under the m17n String proposal. > How does the regular expression engine work then? And all > String methods that have to combine two or more strings in > some way? Matz will have that figured and detailed before he starts writing it. > [...] >>> * Separation of concerns. I always found it strange that most >>> dynamic languages simply mix handling of character and arbitrary >>> binary data (just think of pack/unpack). >> The separation makes things harder most of the time. > Why? In which cases? In *reality*, the separation is not nearly as clean as people who advocate such separations would like to pretend. It's less of a problem in dynamic languages like Ruby, but it's also far less necessary in dynamic languages like Ruby. I have found it far more useful to not have to care whether I'm reading a binary or string value. I despise dealing with C++ and Java where I am forced to care because of stupid API design. > [...] >>> It seems that the main argument against using Unicode strings in >>> Ruby is because Unicode doesn't work well for eastern countries. >>> Perhaps there is another character set that works better that we >>> could use instead of Unicode. The important point here is that there >>> is only *one* representation of character data Ruby. >> This is a mistake. > OK, Unicode was enough for me until now, but I see that Unicode is not > enough for everyone. Thank you. Unicode needs to -- will -- work *very* well. I know enough about Unicode handling to make sure that what I deal with *will*. But I have come to believe that choosing a single encoding as your String representation is a mistake, even if it means making your job harder by defining and implementing rules for mixed-encoding handling. > consumption of typical Ruby programs. I can believe it; it's very domain and program specific, but you've just proposed multiplying the memory usage of that amount of space by four. (Rails would suffer terribly under your proposal to use UTF-32.) -austin
on 2006-06-18 00:22

On 6/17/06, Juergen Strobel <strobel@secure.at> wrote: > On Sun, Jun 18, 2006 at 01:02:39AM +0900, Paul Battley wrote: >> On 17/06/06, Austin Ziegler <halostatue@gmail.com> wrote: >>>> - This ties Ruby's String to Unicode. A safe choice IMHO, or would >>>> we really consider something else? Note that we don't commit to a >>>> particular encoding of Unicode strongly. >>> This is a wash. I think that it's better to leave the options open. >>> After all, it *is* a hope of mine to have Ruby running on iSeries >>> (AS/400) and *that* still uses EBCDIC. > AFAIK, EBCDIC can be losslessly converted to Unicode and back. Right? Which code page? EBCDIC has as many code pages (including a UTF-EBCDIC) as exist in other 8-byte encodings. > On the other hand, do you really trust all ruby library writers to > accept your strings tagged with EBCDIC encoding? Or do you look > forward to a lot of manual conversions? It depends on the purpose of the library. Very few libraries end up using byte vectors for strings or completely treat them as such. I would expect that some of the libraries that I've written would work without any problems in EBCDIC. > Character Code Points, but not to a particular Unicode encoding or > character class, and *that* was Java's main folly. (UCS-2 is a > strictly 16 bit per character encoding, but new Unicode standards > specify 21 bit characters, so they had to "extend" it). Um. Do you mean UTF-32? Because there's *no* binary representaiton of Unicode Character Code Points that isn't an encoding of some sort. If that's the case, that's unacceptable from a memory representation. > I am unaware of unsolveable problems with Unicode and Eastern > languages, I asked specifically about it. If you think Unicode is > unfixably flawed in this respect, I guess we all should write off > Unicode now rather than later? Can you detail why Unicode is > unacceptable as a single world wide unifying character set? > Especially, are there character sets which cannot be converted to > Unicode and back, which is the main requirement to have Unicode > Strings in a non-Unicode environment? Legacy data and performance. -austin
on 2006-06-18 00:25

On Jun 17, 2006, at 5:48 PM, Juergen Strobel wrote:
> The way I see it we have to choose a character set.
What leads you to this conclusion? I don't think it can be refuted
that there exists today an almost endless number of character sets
and text encodings in use. I don't understand why the core facilities
of a language should be intimately tied to any one of those
representations. Once you do that you've decided that all other
representations are second class citizens. Why not have the language
be agnostic about these things but still provide a coherent framework
for building libraries and applications that can be locale and
encoding-aware?
Gary Wright
on 2006-06-18 00:49

On 6/17/06, Juergen Strobel <strobel@secure.at> wrote: > On Sat, Jun 17, 2006 at 10:52:24PM +0900, Austin Ziegler wrote: > > On 6/17/06, Juergen Strobel <strobel@secure.at> wrote: > > mean that the other is treated as the defined encoding) *or* there is a > > built-in conversion for them. > > Since I probably cannot control which Strings I get from libraries, > and dont't want to worry which ones I'll have to provide to them, this > is weaker than my approach in this respect, see my next point. It's apparent from the explanation above. You do not have to look at string encoding or worry which encoding they are as long as they are compatible (ie iso-8859-1 and utf-8) - there is a conversion for them. The string methods have to use (internally) the encoding tag, and you can look if you are interested. If the strings are incomatible it is a real problem. Not one created by the implmentation but one originating form the fact that the strings cannot be automatically converted from one ecoding to another. But you can keep all your strings, even if they are in several incompatible encodings. You are not limited to using just one encoding. Michal
on 2006-06-18 01:17

On 17-jun-2006, at 23:55, Michal Suchanek wrote: > > First for reasons of efficiency. If an application is going to perform > lots of slicing and poking on strings it will want some encoding that > is suiatble for that such as UTF-32. I would much rather prefer UTF-8 in a language such as Ruby which is often used as glue between other systems. UTF-8 is used for interchange and it's indisputable. If you go for UTF-16 or UTF-32, you are most likely to convert every single character of text files you read (in text files present in the wild AFAIK UTF-16 and UTF-32 are a minority, thanks to the BOM and other setbacks). > If an application runs on system > with little memory it will want space-efficient encoding (ie UTF-8 or > UTF-16 for Asian languages). And if an appliaction runs on system that > uses some legacy codepage it can read, write, and process all strings > in that codepage. And in JRuby it will be useful to convert strings to > UTF-16 so that the native Java functions can be used for manipulation. > > n your model you can modify Ruby to use > strings composed of TRON characters instead of Unicode characters. But > how would Unicode Ruby and TRON Ruby exchange strings? I think Alan Little summed it up very well. The problem with Unicode in Ruby is strive for perfection (i.e. satisfy the users of every conceivable or needed encoding). It's very noble and I personally can't imagine it (even with the "democratic coerce" approach Austin cited). The only thing I don't know if a system having this type of handling can be built at all and how it will interoperate. Up until now all scripting languages I used somewhat (Perl, Python, Ruby) allowed all encodings in strings and doing Unicode in them hurts. Bluntly put, I am selfish and I don't believe in the "saving grace" of the M17N (because I just can't wrap it around my head and I sure as hell know it's going to be VERY complex). It's also something that bothers me the most about Ruby's "unicode discussions" (I've read all of them on this list dating back to 2002 because I need it to work NOW) and they always transcend into this kind of religious discussion in the spirit of "but your encoding is not good enough", "but my bad encoding isn't that one and I still need it to work" etc. While for me the greatest thing about Unicode is that it's Just Good Enough. And it doesn't seem Unicode is indeed THAT useless for CJK languages either (although I'm sure Paul can correct me - all the 4 languages I am in control of use only 2 scripting systems with some odd additions here and there). And no, I didn't have a chance to see a TRON system in the wild. If someone would show me one within 200 km distance I would be glad to take a look.
on 2006-06-18 01:20

On 18-jun-2006, at 0:21, Austin Ziegler wrote:
> Legacy data and performance.
Yes, you will spend those cycles to count the letters in my language
RIGHT :-)) (evil grin)
It's actually the most common case when apps damage strings in my
language - their authors wanted to be smart
and _conserve_. And yes, normalization etc. is complex and you DO
need to have a case-conversion table in memory. Please do have one
(Ruby doesn't).
No offense, just observation.
on 2006-06-18 02:18

On Saturday 17 June 2006 23:55, Michal Suchanek wrote: > On 6/17/06, Stefan Lang <langstefan@gmx.at> wrote: [...] > And if an appliaction runs on system that uses some legacy codepage > it can read, write, and process all strings in that codepage. And > in JRuby it will be useful to convert strings to UTF-16 so that the > native Java functions can be used for manipulation. If you really need this level of efficiency, Ruby is probably the wrong language anyway. Regarding JRuby: Of course each implementation would be free to choose an internal Unicode encoding. If somebody has enough time and motivation he can even implement support for multiple encodings and let the user choose at build-time. [...] > > Or do you mean Ruby should determine the encoding > > automatically? IMO, that would be bad magic and error-prone. > > No. But if you read part of html/xml document before the encoding > was specified there is no reason why that part hes to be converted > or reread. You apparently got it right if you were able to > determine the encoding from what you read. The conversion would be done anyway, iff a single internal encoding was choosen and iff the encoding of the input doesn't match the internal encoding. > > That will be needed anyway. You cannot expect all libraries to use > the arbitrary encoding you chose for Ruby strings. I assume you mean C libraries here.
on 2006-06-18 04:38

On 6/17/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote: > Yes, you will spend those cycles to count the letters in my language > RIGHT :-)) (evil grin) It's actually the most common case when apps > damage strings in my language - their authors wanted to be smart and > _conserve_. And yes, normalization etc. is complex and you DO need to > have a case-conversion table in memory. Please do have one (Ruby > doesn't). I think you're overthinking the problem. Let's consider the guarantees that an m17n String would make: * #size and #length would return the number of glyphs * #[] would return glyphs Presumably, in Regexen with an m17n String, \w would indicate only "word" glyphs. Other guarantees *would* be made along that line. Therefore, if your input data is UTF-8, anything that deals with #size, #length, and character-based indexing *will just work*. The same will apply to SJIS or any other encoding. The number of times that people are dealing with mixed-encoding data is vanishingly small, and even when a developer must, they will probably use a Unicode encoding to deal with that. But if you're using SJIS, you're just going to want use *that*. That's what the m17n String is about. It's not about dictating a single encoding, but enabling people to use Strings intelligently. > No offense, just observation. I agree -- we *need* full Unicode support. But not at the cost of legacy code pages in favour of Unicode. It's not always appropriate. -austin
on 2006-06-18 06:19

On Jun 17, 2006, at 4:08 AM, Juergen Strobel wrote: > 1. Strings should deal in characters (code points in Unicode) and not > in bytes, and the public interface should reflect this. Be careful. People who care about this stuff might want to read http://www.w3.org/TR/2005/REC-charmod-20050215/ It turns out that characters do not correspond one-to-one with units of sound, or units of input, or units of display. Except for low-level stuff like regexps, it's very difficult to write any code that goes character-at- a-time that doesn't contain horrible i18n bugs. For practical purposes, a String is a more useful basic tool than a character. > 5. Since the String class is quite smart already, it can implement > generally useful and hard (in the domain of Unicode) operations like > case folding, sorting, comparing etc. Be careful. Case folding is a horrible can of worms, is rarely implemented correctly, and when it is (the Java library tries really hard) is insanely expensive. The reason is that case conversion is not only language-sensitive but jurisdiction sensitive (in some respects different in France & Québec). Trying to do case-folding on text that is not known to be ASCII is likely a symptom of a bug. > - This ties Ruby's String to Unicode. A safe choice IMHO, or would we > really consider something else? Note that we don't commit to a > particular encoding of Unicode strongly. For information: The XML view is that Shift-JIS, KOI8-R, EBCDIC, and many others are all encodings of Unicode and a best effort should be made to accept and emit all sane encodings on demand. Most XML software sticks to a single encoding, internally. -Tim
on 2006-06-18 06:29

On Jun 17, 2006, at 6:50 AM, Stefan Lang wrote: > It seems that the main argument against using Unicode strings > in Ruby is because Unicode doesn't work well for eastern > countries. Point of information: there are highly successful word-processing products selling well in countries whose writing systems include Han characters, which internally use Unicode. So while the Han- unification problems have been much discussed and are regarded as important by people who are not fools, in fact there is existence proof that Unicode does work well enough for wide deployment in commercial software. > If Unicode is choosen as character set, there is the > question which encoding to use internally. UTF-32 would be a > good choice with regards to simplicity in implementation, UTF-32 has a practical problem in that in C code, you can't use strcmp () and friends because it's full of null bytes. Of course if you're careful to code everything using wchar_t you'll be OK, but lots of code isn't. (UTF-8 doesn't have this problem and is much more compact). > Consider > indexing of Strings: > > "some string"[4] > > If UTF-32 is used, this operation can internally be > implemented as a simple, constant array lookup. If UTF-16 or > UTF-8 is used, this is not possible to implement as an array Correct. But in practice this seems not to be too huge a problem, since in practice text is most often accessed sequentially. The times that you really need true random access to the N'th character are rare enough that for some problems, the advantages of UTF-8 are big enough to compensate for this problem. Note that in a variable- length character encoding, there's no trouble whatever with a table of pointers into text; the *only* problem is when you need to find the Nth character cheaply. > An advantage of using UTF-8 would be that for pure ASCII files > no conversion would be necessary for IO. Be careful. There are almost no pure ASCII files left. Café. Ordoñez. ?Smart quotes? -Tim
on 2006-06-18 06:36

On Jun 17, 2006, at 6:52 AM, Austin Ziegler wrote: >> The internal encoding has to be decided, probably between UTF-8, >> UTF-16, and UTF-32 by the String class implementor. > > Completely disagree. Matz has the right choice on this one. You can't > think in just terms of a pure Ruby implementation -- you *must* think > in terms of the Ruby/C interface for extensions as well. Point of information: Of all the widely-used methods of encoding international strings, UTF-8 is by far the easiest to deal with in C. > Trust me on this > one: I *have* done some low-level encoding work. Additionally, even > though I might have marked a network object as "UTF-8", I may not know > whether it's *actually* UTF-8 or not until That's an incredibly important point in a networked world. One of the reasons XML has had so much success, probably more than it deserves, is that its encoding is self-descriptive. To quote Larry Wall: "An XML document knows what encoding it's in." Since HTTP headers are (sigh) known to be wrong on occasion, this is a pretty big value-add. >> - This ties Ruby's String to Unicode. A safe choice IMHO, or would we >> really consider something else? Note that we don't commit to a >> particular encoding of Unicode strongly. > > This is a wash. I think that it's better to leave the options open. > After all, it *is* a hope of mine to have Ruby running on iSeries > (AS/400) and *that* still uses EBCDIC. EBCDIC is in fact an encoding of Unicode. Just saying that it's necessary to be clear both as to what character set is being supported, and what limitations on encoding are enforced. -Tim
on 2006-06-18 06:45

On Jun 17, 2006, at 10:34 AM, Stefan Lang wrote: > Or do you mean Ruby should determine the encoding > automatically? IMO, that would be bad magic and error-prone. Not possible in the general case. There are a few data formats including XML and ASN.1, which make it possible to reliably infer the encoding from the instance, but a lot of Web processing these days is best-guess, and often fails. > How does the regular expression engine work then? The two sane options are (a) have a fixed encoding for Strings and compile the regex in such a way that it runs directly on the encoding. This has been done for both UTF-8 and UTF-16 and is insanely efficient, but it locks you into the fixed encoding. (b) have an iterator which produces abstract characters from whatever encoding is in use and run the regex over the characters, not the bytes of the representation. The implementation is trickier and performance is an issue, but you're not locked to an encoding. -Tim
on 2006-06-18 06:51

On Jun 17, 2006, at 2:55 PM, Michal Suchanek wrote: > First for reasons of efficiency. If an application is going to perform > lots of slicing and poking on strings it will want some encoding that > is suiatble for that such as UTF-32. If an application runs on system > with little memory it will want space-efficient encoding (ie UTF-8 or > UTF-16 for Asian languages). Um, the practical experience is that the code required to unpack a UTF-8 stream into a sequence of integer codepoints (and reverse the process) is easy and very efficient; to the point that for "slicing and poking", UTF-8 vs UTF-16 vs UTF-32 is pretty well a wash. -Tim
on 2006-06-18 06:57

On Jun 17, 2006, at 3:15 PM, Austin Ziegler wrote: > Why? What if the library that I'm interfacing with requires EUC-JP? > Sorry, but Unicode is *not necessarily* the right answer. Indeed it's not, but this argument escapes me. If you try feed that library an Arabic string, something will break, because EUC-JP can't represent Arabic. So what? Whatever character set(s) you standardize on, there is going to be existing software that won't be able to handle all of it... I'm just not following your argument. -Tim
on 2006-06-18 07:00

On Jun 17, 2006, at 3:22 PM, gwtmp01@mac.com wrote:
> be locale and encoding-aware?
I'm not close enough to Ruby to have a useful opinion, but for many
other software systems, the designers decided that the performance
and interoperability gains achievable by limiting themselves to
Unicode were a compelling enough argument, and so chose.
In particular, these days, both the W3C and the IETF overwhelmingly
specify the use of Unicode characters when text is to be included in
protocols or data delivery formats. So even if you can handle lots
of non-Unicode stuff, the Net may have difficulty getting it to you. -
Tim
on 2006-06-18 07:03

I'll chime back in with my not-so-expert opinion, so it's known where I stand. Take it for whatever it's worth. - I almost entirely agree with Juergen's longer post on what unicode support should look like in 2.0. I won't go into the details of what I disagree with because I'm a little squishy in those areas. - I believe that supporting encoding-tagged strings would be a horrible, horrible mess for both Ruby VM/interpreter implementers and extension implementers while not adding any serious benefits for Ruby the language. When it comes down to it, you're going to have string A using encoding X and string B using encoding Y and in order to work with them both together you'll have to find some common ground. Settle on common ground early or you pay the price to do it EVERY time you work with strings later. - I have no intention to ever write a C extension for Ruby. I know many out there do. However, I think the important thing about Ruby is Ruby, and making the language bend over backwards to make life easier for C hackers is absurd. Making unicode support needlessly complex in Ruby (the language) only ends up hurting its usability. I for one would not want to sacrifice the beauty and simplicity of Ruby solely to apease the C community. Flame on if you will, but The Ruby Way should rule here. - In the end, I should not have to care what encoding strings use internally unless I absolutely have to know. Every time questions come up about unicode support in Java, I have to look it up...UTF-8? UTF-16? UCS-2? I rarely need to know this information, and I rarely remember it. That's exactly the point. Make the one internal encoding whatever is deemed most flexible, most performant, and above all *most global*. Nobody writing Ruby code should have to care. - I so rarely work with Strings on a character-by-character basis, and when I do all I should have to say is get_character and know that what I have represents a full and complete character representation. If you're dealing with bytes, call it what it is--the aforementioned ByteBuffer. Ruby needs to support the concepts of Strings and ByteBuffers independently. I think it all comes back to a simple question: Which method of supporting unicode would feel the most "Ruby"? Which one is DRY and KISS and all the other lovely acronyms this community holds so dear? Figure that out, and there's your answer. I'd be willing to bet it's not every-string-can-encode-differently, because I don't see how that would ever help me write better Ruby code...and improving Ruby is the point of all this, right?
on 2006-06-18 07:07

On Jun 17, 2006, at 4:15 PM, Julian 'Julik' Tarkhanov wrote: > I would much rather prefer UTF-8 in a language such as Ruby which > is often used as glue between > other systems. UTF-8 is used for interchange and it's indisputable. > If you go for UTF-16 or UTF-32, you are most likely > to convert every single character of text files you read (in text > files present in the wild AFAIK UTF-16 and UTF-32 are a minority, > thanks to the BOM and other setbacks). There's a lot of UTF-16 out there. There's more ISO-8859-* than that, and more Microsoft code-page-* text than everything else put together. Yes, with UTF-16 & -32 you do a lot of byte swapping but it's pretty cheap and pretty reliable. (I like UTF-8 too, but it's not without issues). -Tim
on 2006-06-18 12:54

On 6/18/06, Stefan Lang <langstefan@gmx.at> wrote: > > encoding that is suiatble for that such as UTF-32. If an > encoding. If somebody has enough time and motivation he can > even implement support for multiple encodings and let the user > choose at build-time. Why? It can already handle utf-8 strings or arrays of unicode codepoints. They just do not feel like strings with ruby 1.8. What I want is a glue in string class that does make them feel so. > > had to read anyway? > encoding was choosen and iff the encoding of the input doesn't > match the internal encoding. However, if you can choose the encoding there is no need to recode at all. You just keep the string as is, and there is a good chance the output encoding will match the input encoding. And in case you need to recode the string you got the encoding information, and the recoding can be done automatically, and only when needed. Michal
on 2006-06-18 13:09

On 6/18/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote: > If you go for UTF-16 or UTF-32, you are most likely > to convert every single character of text files you read (in text > files present in the wild AFAIK UTF-16 and UTF-32 are a minority, > thanks to the BOM and other setbacks). Here you go. You can have the strings in UTF-8, and I can heve them in UTF-32. That is the flexibility of the solution without a fixed encoding. > > how would Unicode Ruby and TRON Ruby exchange strings? > > I think Alan Little summed it up very well. The problem with Unicode > in Ruby is strive for perfection > (i.e. satisfy the users of every conceivable or needed encoding). > It's very noble and I personally can't imagine it > (even with the "democratic coerce" approach Austin cited). The only > thing I don't know if a system having this type of handling can be > built at all and how it will interoperate. But quite a few people here look like they do know. I do not know much about regexes but I can imagine just about any other string operation. And the current regexes already do operate on multiple encodings. > > Up until now all scripting languages I used somewhat (Perl, Python, > Ruby) allowed all encodings in strings and doing Unicode in them hurts. And how that leads to the conclusion that there should be only one encoding? > > Bluntly put, I am selfish and I don't believe in the "saving grace" > of the M17N (because I just can't wrap it around my head and I sure > as hell know it's going to be VERY complex). That's the point. If it is wrapped into the string class you do not have to wrap it around your head. > It's also something that bothers me the most about Ruby's "unicode > discussions" (I've read all of them on this list dating back to 2002 > because I need it to work NOW) and they > always transcend into this kind of religious discussion in the spirit > of "but your encoding is not good enough", "but my bad encoding isn't > that one and I still need it to work" etc. And that is eaxctly why a fixed encoding is bad. If strings can be encoded in any way there is no point i religious discussions which encoding you like the most. > > While for me the greatest thing about Unicode is that it's Just Good > Enough. And it doesn't seem Unicode is indeed THAT useless for CJK > languages either > (although I'm sure Paul can correct me - all the 4 languages I am in > control of use only 2 scripting systems with some odd additions here > and there). It is JustGoodEnouhg for most cases but not for all. It is not useless for CJK, just suboptimal because of the Han unification. And it also does not try to include the historic characters. > > And no, I didn't have a chance to see a TRON system in the wild. If > someone would show me one within 200 km distance I would be glad to > take a look. I do not care. Some poeple find that encoding useful. Since the potential to support any encoding including TRON does not get in the way when I deal with my text I am fine with that. Michal
on 2006-06-18 16:18

On Sun, Jun 18, 2006 at 07:21:25AM +0900, Austin Ziegler wrote: > > Which code page? EBCDIC has as many code pages (including a UTF-EBCDIC) > as exist in other 8-byte encodings. Obviously, EBCDIC -> UNICODE -> same EBCDIC Codepage as before. > >>Not to mention that Matz has explicitly stated in the past that he > >character class, and *that* was Java's main folly. (UCS-2 is a > >strictly 16 bit per character encoding, but new Unicode standards > >specify 21 bit characters, so they had to "extend" it). > > Um. Do you mean UTF-32? Because there's *no* binary representaiton of > Unicode Character Code Points that isn't an encoding of some sort. If > that's the case, that's unacceptable from a memory representation. Yes, I do mean the String *interface* to be UTF-32, or pure code points which is the same but less suscept to to standard changes, if accessed at character level. If accessed at substring level, a substring of a String is obviously a String, and you don't need a bitwise representation at all. According to my proposal, Strings do not need an encoding from the String user's point of view when working just with Strings, and users won't care apart from memory/performance consumption, which I believe can be made good enough with a totally encapsulted, internal storage format to be decided later. I will avoid a premature optimization debate here now. Of course encoding matters when Strings are read or written somewhere, or converted to bit-/bytewise representation explicitly. The Encoding Framework, however it'll look, needs to be able to convert to and from Unicode code points for these operations only, and not between arbitrary encodings. (You *may* code this to recode directly from the internal storage format for performance reasons, but that'll be transparent to the String user.) This breaks down for characters not represented in Unicode at all, and is a nuisance for some characters affected by the Han Unification issue. But Unicode set out to prevent exactly this, and if we beleieve in Unicode at all, we can only hope they'll fix this in an upcoming revision. Meanwhile we could map any additional characters (or sets of) we need to higher, unused Unicode plains, that'll be no worse than having different, possibly incompatible kinds of Strings. We'll need an additional class for pure byte vectors, or just use Array for this kind of work, and I think this is cleaner. Regarding Java, they switched from UCS-2 to UTF-16 (mostly). UCS-2 is a pure 16 bit per character encoding and cannot represent codepoints above 0xffff. UTF-16 works alike UTF-8, but with 16 bit chunks. But their abstraction of a single character, the class Char(acter), is still only 16 bit wide which leads to confusion and similiar to the C type char, which cannot represent all real characters either. It is even worse than in C, because C explicitly defines char to be a memory cell of 8 bits or more, whereas Java really meant Char to be a character. > >I am unaware of unsolveable problems with Unicode and Eastern > >languages, I asked specifically about it. If you think Unicode is > >unfixably flawed in this respect, I guess we all should write off > >Unicode now rather than later? Can you detail why Unicode is > >unacceptable as a single world wide unifying character set? > >Especially, are there character sets which cannot be converted to > >Unicode and back, which is the main requirement to have Unicode > >Strings in a non-Unicode environment? > > Legacy data and performance. Map legacy data, that is characters still not in Unicode, to a high Plane in Unicode. That way all characters can be used together all the time. When Unicode includes them we can change that to the official code points. Note there are no files in String's internal storage format, so we don't have to worry about reencoding them. I am not worried about performance. I'd code in C if I were, or Lisp. For one, Moore's law is at work and my whole proposal was for 2.0. My proposal only adds a constant factor to String handling, it doesn't have higher order complexity. On the other hand, conversions needs to be done at other times with my proposal than for M17N Strings, and it depends on the application if that is more or less often. String-String operations never need to do recoding, as opposed to M17N Strings. I/O always needs conversion, and may need conversion with M17N too. I havea a hunch that allowing different kinds of Strings around (as in M17N presumely) should require recoding far more often. Jürgen
on 2006-06-18 16:49

On Sun, Jun 18, 2006 at 07:22:34AM +0900, gwtmp01@mac.com wrote: > be agnostic about these things but still provide a coherent framework > for building libraries and applications that can be locale and > encoding-aware? > > Gary Wright > Maybe I was unclear. I did't mean Ruby has too choose an existing standard, but Ruby has to choose which set of characters to handle in Strings, in the mathematical sense. Language implementation, and usage of the String class should be easier if this set is - well defined Unicode code points are pretty good in this respect, better than the union of all characters in all encodings of possible M17N Strings. And we may use private extensions to Unicode for legacy characters not included in Unicode already. - All characters are equally allowed in all Strings. M17N fails this one. a[5] = b[3] if their encodings are incompatible? At best it'll coerce a to an encoding which can handle both, which would be Unicode 98% of the time any way, 1% something else, and 1% totally fail. Don't nail me down on the numbers. Mathematically, String functions should be defined on the whole set, not subsets, or their application becomes a chore. Jürgen
on 2006-06-18 16:56

On 18-jun-2006, at 6:17, Tim Bray wrote: > > Be careful. Case folding is a horrible can of worms, is rarely > implemented correctly, and when it is (the Java library tries > really hard) is insanely expensive. The reason is that case > conversion is not only language-sensitive but jurisdiction > sensitive (in some respects different in France & Québec). Trying > to do case-folding on text that is not known to be ASCII is likely > a symptom of a bug. Let's write a specification.
on 2006-06-18 17:32

On 6/18/06, Juergen Strobel <strobel@secure.at> wrote: > On Sun, Jun 18, 2006 at 07:21:25AM +0900, Austin Ziegler wrote: >> Um. Do you mean UTF-32? Because there's *no* binary representaiton of >> Unicode Character Code Points that isn't an encoding of some sort. If >> that's the case, that's unacceptable from a memory representation. > Yes, I do mean the String *interface* to be UTF-32, or pure code > points which is the same but less suscept to to standard changes, if > accessed at character level. If accessed at substring level, a > substring of a String is obviously a String, and you don't need a > bitwise representation at all. Again, this is completely unacceptable from a memory usage perspective. I certainly don't want my programs taking up 4x the additional memory for string handling. But "pure code points" is a red herring and a mistake in any case. Code points aren't sufficient. You need glyphs, and some glyphs can be produced with multiple code points (e.g., LOWERCASE A + COMBINING ACUTE ACCENT as opposed to A ACUTE). Indeed, some glyphs can *only* be produced with multiple code points. Dealing with this intelligently requires a *lot* of smarts, but it's precisely what we should do. > According to my proposal, Strings do not need an encoding from the > String user's point of view when working just with Strings, and users > won't care apart from memory/performance consumption, which I believe > can be made good enough with a totally encapsulted, internal storage > format to be decided later. I will avoid a premature optimization > debate here now. Again, you are incorrect. I *do* care about the encoding of each String that I deal with, because only that allows me (or String) to deal with conversions appropriately. Granted, *most* of the time, I won't care. But I do work with legacy code page stuff from time to time, and pronouncements that I won't care are just arrogance or ignorance. > Of course encoding matters when Strings are read or written somewhere, > or converted to bit-/bytewise representation explicitly. The Encoding > Framework, however it'll look, needs to be able to convert to and from > Unicode code points for these operations only, and not between > arbitrary encodings. (You *may* code this to recode directly from > the internal storage format for performance reasons, but that'll be > transparent to the String user.) I prefer arbitrary encoding conversion capability. > This breaks down for characters not represented in Unicode at all, and > is a nuisance for some characters affected by the Han Unification > issue. But Unicode set out to prevent exactly this, and if we > beleieve in Unicode at all, we can only hope they'll fix this in an > upcoming revision. Meanwhile we could map any additional characters > (or sets of) we need to higher, unused Unicode plains, that'll be no > worse than having different, possibly incompatible kinds of Strings. Those choices aren't ours to make. > We'll need an additional class for pure byte vectors, or just use > Array for this kind of work, and I think this is cleaner. I don't. Such an additional class adds unnecessary complexity to interfaces. This is the *main* reason that I oppose the foolish choice to pick a fixed encoding for Ruby Strings. >> Legacy data and performance. > Map legacy data, that is characters still not in Unicode, to a high > Plane in Unicode. That way all characters can be used together all the > time. When Unicode includes them we can change that to the official > code points. Note there are no files in String's internal storage > format, so we don't have to worry about reencoding them. Um. This is the statement of someone who is ignoring legacy issues. Performance *is* a big issue when you're dealing with enough legacy data. Don't punish people because of your own arrogance about encoding choices. Again: Unicode Is Not Always The Right Choice. Anyone who tells you otherwise is selling you a Unicode toolkit and only has their wallet in mind. Unicode is *often* the right choice, but it's *not* the only choice and there are times when having the *flexibility* to work in other encodings without having to work through Unicode as an intermediary is the right choice. And from an API perspective, separating String and "ByteVector" is a mistake. > On the other hand, conversions needs to be done at other times with my > proposal than for M17N Strings, and it depends on the application if > that is more or less often. String-String operations never need to do > recoding, as opposed to M17N Strings. I/O always needs conversion, and > may need conversion with M17N too. I havea a hunch that allowing > different kinds of Strings around (as in M17N presumely) should > require recoding far more often. Unlikely. Mixed-encoding data handling is uncommon. -austin
on 2006-06-18 17:32

On 18-jun-2006, at 13:08, Michal Suchanek wrote: > > But quite a few people here look like they do know. I do not know much > about regexes but I can imagine just about any other string operation. > And the current regexes already do operate on multiple encodings. Oh, lord... Have you at least tried that to make such assumtpions? In other words, tell me, can Ruby's regexes cope with the following: /[а-Ñ]/ /[а-Ñ]/i or something like this: http://rubyforge.org/cgi-bin/viewvc.cgi/icu4r/samp...? revision=1.2&root=icu4r&view=markup > > > And how that leads to the conclusion that there should be only one > encoding? Very simply - I use many pieces of software written in many languages all the time, with non-Latin text. I know that when they want to get "historically compatible" problems arise. And the software that settles on Unicode internally or somehow enforces it on the programmer usually works best (all Cocoa and all C#. And to a certain extens yes, Java). > >> >> Bluntly put, I am selfish and I don't believe in the "saving grace" >> of the M17N (because I just can't wrap it around my head and I sure >> as hell know it's going to be VERY complex). > > That's the point. If it is wrapped into the string class you do not > have to wrap it around your head. This is rather naive. > > And that is eaxctly why a fixed encoding is bad. If strings can be > encoded in any way there is no point i religious discussions which > encoding you like the most. Yes, it just becomes hard and error prone to process them. > > It is JustGoodEnouhg for most cases but not for all. It is not useless > for CJK, just suboptimal because of the Han unification. And it also > does not try to include the historic characters. I think this thread is going to end the same as the one in 2002 did.
on 2006-06-18 18:37

Hi, In message "Re: Unicode roadmap?" on Sun, 18 Jun 2006 23:46:40 +0900, Juergen Strobel <strobel@secure.at> writes: |Language implementation, and usage of the String class should be |easier if this set is | |- well defined |- All characters are equally allowed in all Strings. I understand these attributes might make implementation easier. But who cares if I don't care. And I am not sure how these make usage easier, really. Somebody who owns gigabytes of text data in legacy encoding (e.g. me), wants to avoid encoding conversion back and forth between Unicode and legacy encoding everytime. Another somebody want text processing on historical text which character set is far bigger than Unicode. The "well-defined" simple implementation just prohibits those demands. On the contrary, M17N approach does not bother Universal Character Set solution. You just need to choose Unicode (UTF-8 or UTF-16) as internal string representation, and convert encoding on I/O as you might have done in Unicode centric languages. Nothing lost. You may worry about implementation difficulty (and performance), but don't. It's _my_ concern. I made a prototype, and have convinced that I can implement it with acceptable performance. |Unicode code points are pretty good in this respect, better than the |union of all characters in all encodings of possible M17N Strings. |And we may use private extensions to Unicode for legacy characters not |included in Unicode already. "private extensions". No. It just cause another nightmare. matz.
on 2006-06-18 19:27

On Jun 18, 2006, at 8:29 AM, Austin Ziegler wrote: > You need glyphs, and some glyphs can be > produced with multiple code points (e.g., LOWERCASE A + COMBINING > ACUTE > ACCENT as opposed to A ACUTE). This is another thing you need your String class to be smart about. You want an equality test between "más" and "más" to always be true even their "á" characters are encoded differently. The right way to solve this is called "Early Uniform Normalization" (see http:// www.w3.org/TR/2003/WD-charmod-20030822/#sec-Normalization); the idea is you normalize the composed characters at the time you create the string, then the internal equality test can be done with strcmp() or equivalent. >> Map legacy data, that is characters still not in Unicode, to a high >> Plane in Unicode. That way all characters can be used together all >> the >> time. When Unicode includes them we can change that to the official >> code points. Note there are no files in String's internal storage >> format, so we don't have to worry about reencoding them. > > Um. This is the statement of someone who is ignoring legacy issues. > Performance *is* a big issue when you're dealing with enough legacy > data. Note that you don't have to use a high plane. The Private Use Area in the Basic Multilingual Pane has 6,400 code points, which is quite a few. Even if you did use a high plane, it's not obvious there'd be a detectable runtime performance penalty. > Unicode is *often* the right choice, but it's *not* the only > choice and there are times when having the *flexibility* to work in > other encodings without having to work through Unicode as an > intermediary is the right choice. That may be the case. You need to do a cost-benefit analysis; you could buy a lot of simplicity by decreeing all-Unicode-internally; would the benefits of allowing non-Unicode characters be big enough to to compensate for the loss of simplicity? I don't know the answer, but it needs thinking about. -Tim
on 2006-06-18 21:21

Tim Bray <tbray@textuality.com> writes: > is you normalize the composed characters at the time you create the > string, then the internal equality test can be done with strcmp() or > equivalent. Does that mean that binary.to_unicode.to_binary != binary is possible? That could turn out pretty bad, no?
on 2006-06-18 21:33

On 18-jun-2006, at 21:17, Christian Neukirchen wrote: >> solve this is called "Early Uniform Normalization" (see http:// >> www.w3.org/TR/2003/WD-charmod-20030822/#sec-Normalization); the idea >> is you normalize the composed characters at the time you create the >> string, then the internal equality test can be done with strcmp() or >> equivalent. > > Does that mean that binary.to_unicode.to_binary != binary is > possible? > That could turn out pretty bad, no? And it does as long as you are not careful. One of the things I do is normalize all that come IN into something that is suitable and predictable.
on 2006-06-18 22:53

On Jun 18, 2006, at 12:17 PM, Christian Neukirchen wrote: > possible? > That could turn out pretty bad, no? Yes, but having "más" != "más" is pretty bad too; the alternative is normalizing at comparison time, which would really hurt for example in a big sort, so you'd need to cache the normalized form, which would be a lot more code. binary.to_unicode looks a little weird to me... can you do that without knowing what the binary is? If it's text in a known encoding, no breakage should occur. If it's unknown bit patterns, you can't really expect anything sensible to happen... or am I missing an obvious scenario? -Tim
on 2006-06-18 22:53

On Sat, Jun 17, 2006 at 11:24:45PM +0900, Austin Ziegler wrote: > On 6/17/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote: > >On 17-jun-2006, at 15:52, Austin Ziegler wrote: > >>>8. Because Strings are tightly integrated into the language with the > >>>source reader and are used pervasively, much of this cannot be > >>>provided by add-on libraries, even with open classes. Therefore the > >>>need to have it in Ruby's canonical String class. This will break > >>>some old uses of String, but now is the right time for that. > >>"Now" isn't; Ruby 2.0 is. Maybe Ruby 1.9.1. My title was "A Plan for Unicode Strings in Ruby 2.0". I don't want to rush things or break 1.8. Jürgen
on 2006-06-18 23:15

On Mon, Jun 19, 2006 at 01:33:54AM +0900, Yukihiro Matsumoto wrote: > > solution. You just need to choose Unicode (UTF-8 or UTF-16) as > internal string representation, and convert encoding on I/O as you > might have done in Unicode centric languages. Nothing lost. > > You may worry about implementation difficulty (and performance), but > don't. It's _my_ concern. I made a prototype, and have convinced > that I can implement it with acceptable performance. I never worried about performance much, that's Austin. :P Thanks for clarifying that. So far I could not find much info on how exactly M17N will work, especially on the role of the encoding tag, so I had to guess a lot. Given your explanation, it seems our ways are quite similiar on the interface side of things, so far as Unicode is concerned. You chose a more powerful (and more complex) parametric class design for where I would have left open only the possiblity of transparently useable subclasses for performance reasons. I am happy we've worked that out now. And you are right, I am not that much interested in the implementation, thank you for doing it. My concern was with the interface of the String class, but several posters misunderstood me and tried to draw me into implementation issues. Jürgen
on 2006-06-19 01:02

Hi, In message "Re: Unicode roadmap?" on Mon, 19 Jun 2006 00:29:46 +0900, Julian 'Julik' Tarkhanov <listbox@julik.nl> writes: |In other words, tell me, can Ruby's regexes cope with the following: | |/[Á-Ñ]/ |/[Á-Ñ]/i 1.9 Oniguruma regexp engine should handle these, otherwise it's a bug. matz.
on 2006-06-19 01:11

On 19-jun-2006, at 1:00, Yukihiro Matsumoto wrote: > > 1.9 Oniguruma regexp engine should handle these, otherwise it's a bug. I'll try to check. Oniguruma on 1.8.4. didn't cope, but maybe it just weren't hooked in properly.
on 2006-06-19 01:57

Hi, In message "Re: Unicode roadmap?" on Mon, 19 Jun 2006 08:09:29 +0900, Julian 'Julik' Tarkhanov <listbox@julik.nl> writes: |> |/[Á-Ñ]/ |> |/[Á-Ñ]/i |> |> 1.9 Oniguruma regexp engine should handle these, otherwise it's a bug. | |I'll try to check. Oniguruma on 1.8.4. didn't cope, but maybe it just |weren't hooked in properly. If you have any problem, send us a report with what you expect and what you get. matz.
on 2006-06-19 03:34

On 19-jun-2006, at 1:56, Yukihiro Matsumoto wrote: > a bug. > | > |I'll try to check. Oniguruma on 1.8.4. didn't cope, but maybe it just > |weren't hooked in properly. > > If you have any problem, send us a report with what you expect and > what you get. Well, I tried on the CVS latest (1.9) and I get: irb(main):011:0> "ÐÐ?Ð?лагодаÑ?ÐаÑ" =~ /[а-Ñ]/i => 6 (should be zero) That is - character classes work, casefolding doesn't.
on 2006-06-19 06:06

Hi, In message "Re: Unicode roadmap?" on Mon, 19 Jun 2006 10:32:08 +0900, Julian 'Julik' Tarkhanov <listbox@julik.nl> writes: |Well, I tried on the CVS latest (1.9) and I get: | |irb(main):011:0> "îåâÌÁÇÏÄÁÒîÁÑ" =~ /[Á-Ñ]/i |=> 6 (should be zero) | |That is - character classes work, casefolding doesn't. I found out that Oniguruma casefolding works only for characters within iso8869-*. Considering the size of the casefolding table it is compromise for the time being. I will fix this in the future. matz.
on 2006-06-19 07:24

On 19-jun-2006, at 6:05, Yukihiro Matsumoto wrote: > | > |That is - character classes work, casefolding doesn't. > > I found out that Oniguruma casefolding works only for characters > within iso8869-*. Considering the size of the casefolding table it is > compromise for the time being. I will fix this in the future. Thanks for the clarification :-)
on 2006-06-19 07:58

Correct me,if I'm wrong, but for Matz's plan on M17N, summary is: 1. String internally will remain the same : char *ptr, long len - in bytes 2. String instances will have encoding tag 3. All String/Regexp methods will respect that encoding tag and return char(glyph) indexes 4. Methods like byte_size, codepoints, each_char, each_codepoint will be introduced(?) 5. slice will always accept chars indices and return substrings I'd say that WOULD BE GOOD, and with methods like String#enforce_encoding!(encoding) and String#coerce_encoding!(otherstring) it won't require developers (for C extensions also) to look at encoding tag, just set it when needed. But, I can see several imlementation issues and possible options, that should be considered: - what will happen if one tries to perfom str1.operation(str2) on two strings with different encodings: a) raise exception b) silent coerce one or both strings to some "compatible" charset/encoding, update encoding of result, replacing non-convertable chars using fallback mappings? (ouch, this can be split to set of options) c) same as b) but raise exception if non-loss conversion is not possible? d) same as b) but warn if non-loss conversion is not possible? e) downgrade encoding tag of acceptor to "raw/bytes" and process it? - what will happen if one changes encoding tag for String instance: a) check and raise exception if current bytes don't represent valid encoding sequence? b) just set new tag? c) convert byte sequence to given encoding, using fallback mappings? - what to do with IO: a) IO will return strings in "raw/bytes"? b) IO can be tagged and will return Strings with given econding tag? c) IO can be tagged and is by default tagged with global encoding tag? d) IO can be tagged, but is not tagged by default, although methods returning strings (such as read, readlines) will use global encoding tag? e) if IO is tagged and one tries to write to it a String with different encoding, what will happen? - what will be default encoding tag for new Strings: a) "raw/bytes" b) derived from system properties of host platform c) option b) and can be overriden in application (btw, $KCODE, as present, must definitely go away!!!) - how to process source code files: a) restrict them to ASCII and require all non-ASCII strings to be externalized? b) process them as "raw/bytes"? c) introduce some kind of commented pragma for source files allowing to set encoding, - at present time Ruby parser can parse only sources in ASCII compatible encoding. Would it change? - what encodings will have Numeric.to_s, Time.to_s etc., or String has to have/conform for String#to_f, String#to_i? On Unicode: - case-independent canonical string matches/searches DO MATTER. And even for encodings, that code variants of glyphs with different codepoints "variant-insensitive" search, as for me, is desired. Will there be such functionality? - string comparison: will <=> use at least UCA rules for Unicode strings, or only byte-order comparisons will stay? - is_digit, is_space, is_alpha, is_foobarbaz etc. could matter, when writing a custom parser. Will those methods be provided for one-char strings? Yes, this is short and incomplete list, but, you should get my point: it's not that easy -- there are dozens of decisions, with their pros and cons, to be done and implemented :(
on 2006-06-19 09:57

Hi, In message "Re: Unicode roadmap?" on Mon, 19 Jun 2006 14:57:22 +0900, "Dmitry Severin" <dmitry.severin@gmail.com> writes: |But, I can see several imlementation issues and possible options, that |should be considered: Thank you for the ideas. |- what will happen if one tries to perfom str1.operation(str2) on two |strings with different encodings: | a) raise exception | b) silent coerce one or both strings to some "compatible" |charset/encoding, update encoding of result, replacing non-convertable chars |using fallback mappings? (ouch, this can be split to set of options) | c) same as b) but raise exception if non-loss conversion is not possible? | d) same as b) but warn if non-loss conversion is not possible? | e) downgrade encoding tag of acceptor to "raw/bytes" and process it? a), unless either of strings is "ascii" and the other is "ascii" compatible. This point is arguable. |- what will happen if one changes encoding tag for String instance: | a) check and raise exception if current bytes don't represent valid |encoding sequence? | b) just set new tag? | c) convert byte sequence to given encoding, using fallback mappings? b), encoding conformance check shall done lazily. I think there's a need for explicit encoding conformance check method. |- what to do with IO: | a) IO will return strings in "raw/bytes"? | b) IO can be tagged and will return Strings with given econding tag? | c) IO can be tagged and is by default tagged with global encoding tag? | d) IO can be tagged, but is not tagged by default, although methods |returning strings (such as read, readlines) will use global encoding tag? | e) if IO is tagged and one tries to write to it a String with different |encoding, what will happen? c), the global default shall be set from locale setting. |- what will be default encoding tag for new Strings: | a) "raw/bytes" | b) derived from system properties of host platform | c) option b) and can be overriden in application (btw, $KCODE, as present, |must definitely go away!!!) Encoding for literal strings are set by pragma. |- how to process source code files: | a) restrict them to ASCII and require all non-ASCII strings to be |externalized? | b) process them as "raw/bytes"? | c) introduce some kind of commented pragma for source files allowing to |set encoding, 1.9 already has encoding pragma a la Python PEP263. |- at present time Ruby parser can parse only sources in ASCII compatible |encoding. Would it change? No. Ruby would not allow scripts in EBCDIC, nor UTF-16, although it allows processing of those encoding. |- what encodings will have Numeric.to_s, Time.to_s etc., or String has to |have/conform for String#to_f, String#to_i? Good point. Currently, I think they should work on ASCII. |On Unicode: |- case-independent canonical string matches/searches DO MATTER. And even for |encodings, that code variants of glyphs with different codepoints |"variant-insensitive" search, as for me, is desired. Will there be such |functionality? Casefold search/match will be provided for Regexp. "variant insensitive" search should be accomplished by explicit normalization or collation. |- string comparison: will <=> use at least UCA rules for Unicode strings, or |only byte-order comparisons will stay? Byte order comparison. UCA rules or such should be done explicitly via normalization or collation. |- is_digit, is_space, is_alpha, is_foobarbaz etc. could matter, when writing |a custom parser. Will those methods be provided for one-char strings? Those functions will be provided via Regexp. I am not sure if we will provide character classification methods for strings. matz.
on 2006-06-19 13:17

Tim Bray <tbray@textuality.com> writes: >> > without knowing what the binary is? If it's text in a known > encoding, no breakage should occur. If it's unknown bit patterns, > you can't really expect anything sensible to happen... or am I > missing an obvious scenario? -Tim Those were just fictive method calls. But let's say I read from a pipe and I know it contains UTF-16 with BOM, then .to_unicode would make perfect sense, no? In case of binary bit patterns, I sooner or later would expect some kind of EncodingError, given this API. (I haven't seen yet drafts of how the API really will be.)
on 2006-06-19 14:40

On 6/19/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote: > |- what will happen if one tries to perfom str1.operation(str2) on two > compatible. This point is arguable. What is "ascii"? Specifically I would like string operations to suceed in cases when both strings are encoded as different subset of Unicode (or anything else). ie concatenating an ISO-8859-2 and an ISO-8859-1 string sould result in UTF-* string, not an error. However, this would make the errors from incompatible encodings more surprising as they would be very infrequent. I wonder what operations on raw strings (ones without specified encoding) would do. Or where one of the strings is raw, and the other is not. > c), the global default shall be set from locale setting. > I am not sure this is good for network IO as well. For diagnostics it might be useful to set the default to none, and have string raise an exception when such strings are combined with other strings. It is only obvious for STDIN and STDOUT that they should follow the locale setting. hmm, but it would need to carefully consider which operations should work on raw strings and which not. Perhaps it is not as nice as it looks at the first glance. Thanks Michal
on 2006-06-19 15:02

Hi, In message "Re: Unicode roadmap?" on Mon, 19 Jun 2006 21:39:33 +0900, "Michal Suchanek" <hramrach@centrum.cz> writes: |> a), unless either of strings is "ascii" and the other is "ascii" |> compatible. This point is arguable. | |What is "ascii"? Specifically I would like string operations to suceed |in cases when both strings are encoded as different subset of Unicode |(or anything else). ie concatenating an ISO-8859-2 and an ISO-8859-1 |string sould result in UTF-* string, not an error. Every encoding has an attribute named ascii_compat. EUC_JP, SJIS, ISO-8859-* and UTF-8 are declared ascii compatible, where EBCDIC, UTF-16 and UTF-32 are not. No other auto conversion shall be done, since we don't particularly encourage mixed encoding model. |> |- what to do with IO: |> | a) IO will return strings in "raw/bytes"? |> | b) IO can be tagged and will return Strings with given econding tag? |> | c) IO can be tagged and is by default tagged with global encoding tag? |> | d) IO can be tagged, but is not tagged by default, although methods |> |returning strings (such as read, readlines) will use global encoding tag? |> | e) if IO is tagged and one tries to write to it a String with different |> |encoding, what will happen? |> |> c), the global default shall be set from locale setting. | |I am not sure this is good for network IO as well. For diagnostics it |might be useful to set the default to none, and have string raise an |exception when such strings are combined with other strings. | |It is only obvious for STDIN and STDOUT that they should follow the |locale setting. Restricting default encoding from locale to STDIO may be a good idea. There's still open issues, since default encoding from locale is not covered by the prototype, so we need more experience. matz.
on 2006-06-19 15:28

On 6/19/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote: > |(or anything else). ie concatenating an ISO-8859-2 and an ISO-8859-1 > |string sould result in UTF-* string, not an error. > > Every encoding has an attribute named ascii_compat. EUC_JP, SJIS, > ISO-8859-* and UTF-8 are declared ascii compatible, where EBCDIC, > UTF-16 and UTF-32 are not. No other auto conversion shall be done, > since we don't particularly encourage mixed encoding model. > I wonder. Why cannot Strings throughout Ruby be _always_ represented as Unicode and why no let ICU handle the conversion between various encodings for incoming and outgoing data? (http://www.ibm.com/software/globalization/icu/). I know, it is a long-stanbding issue on Unicode's Han unification process, but without proper Unicode support Ruby is destined to be a toy for English-speaking and Japanese communities only. (And as I'm gearing up to prepare a web-site in Russian, Turkish and English, I feel that using Ruby could prove to be a major pain in the nether regions of my body :) )
on 2006-06-19 15:34

On 6/19/06, Dmitrii Dimandt <dmitriid@gmail.com> wrote: > I wonder. Why cannot Strings throughout Ruby be _always_ represented > as Unicode and why no let ICU handle the conversion between various > encodings for incoming and outgoing data? > (http://www.ibm.com/software/globalization/icu/). I know, it is a > long-stanbding issue on Unicode's Han unification process, but without > proper Unicode support Ruby is destined to be a toy for > English-speaking and Japanese communities only. (And as I'm gearing up > to prepare a web-site in Russian, Turkish and English, I feel that > using Ruby could prove to be a major pain in the nether regions of my > body :) ) This entire discussion is centered around a proposal to do exactly that. There are many *very good* reasons to avoid doing this. Unicode Is Not Always The Answer. It's *usually* the answer, but there are times when it's just easier to work with data in an established code page. -austin
on 2006-06-19 15:47

On 6/19/06, Austin Ziegler <halostatue@gmail.com> wrote: > > body :) ) > > This entire discussion is centered around a proposal to do exactly > that. There are many *very good* reasons to avoid doing this. Unicode > Is Not Always The Answer. > > It's *usually* the answer, but there are times when it's just easier > to work with data in an established code page. > I totally agree with that. IMO, the point lies exactly in this "*usually* an answer". What was the last time 90% of developers had to wonder what encoding their data was in ;-) And with the advent of Unicode (and storage becoming cheaper and cheaper and developers becoming more and more lazy and lazy) more and more of that data is going to be Unicode. So, since Unicode is *usually* the answer, make it as painless as possible. Make all String methods and any other functions that work with strings accept Unicode straight out of the box without any worries on the developer's part. And provide alternatives (or optional parameters?) that would allow the few more encoding-aware gurus :) do whatever they want with encodings. Because otherwise we are in a risk of ending up with incompatible extensions to strings that "simplfy" a developer's life (and the trend's already begun). I wouldn't want a C/C++ scenario with a string class upon string class upon extension upon extension that aim to do something String should do from the start. All is IMHO, of course :)
on 2006-06-19 16:35

On 6/19/06, Dmitrii Dimandt <dmitriid@gmail.com> wrote: > Because otherwise we are in a risk of ending up with incompatible > extensions to strings that "simplfy" a developer's life (and the > trend's already begun). I wouldn't want a C/C++ scenario with a string > class upon string class upon extension upon extension that aim to do > something String should do from the start. I think that's more likely with (a) what we have now and (b) a Unicode-internal approach. (Indeed, a Unicode-internal approach *requires* separating a byte vector from String, which doubles interface complexity.) I would suggest that you look through the whole discussion and particular attention to Matz's statements. -austin
on 2006-06-19 18:09

On Jun 19, 2006, at 4:16 AM, Christian Neukirchen wrote: >> without knowing what the binary is? If it's text in a known >> encoding, no breakage should occur. If it's unknown bit patterns, >> you can't really expect anything sensible to happen... or am I >> missing an obvious scenario? -Tim > > Those were just fictive method calls. But let's say I read from > a pipe and I know it contains UTF-16 with BOM, then .to_unicode > would make perfect sense, no? Yep. And yes, calling to_unicode on it might in fact change the bit patterns if you adopted Early Uniform Normalization (which would be a good thing to do). -Tim
on 2006-06-19 19:22

On 6/19/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote: > |(or anything else). ie concatenating an ISO-8859-2 and an ISO-8859-1 > |string sould result in UTF-* string, not an error. > > Every encoding has an attribute named ascii_compat. EUC_JP, SJIS, > ISO-8859-* and UTF-8 are declared ascii compatible, where EBCDIC, > UTF-16 and UTF-32 are not. No other auto conversion shall be done, > since we don't particularly encourage mixed encoding model. Reading what you said it appears it would be only possible to add ascii strings to ascii-compatible sttings. That does not sound very useful. If the intended meanig was rather that operations on two ascii-compatible strings should always be possible, and that the result is again ascii-compatible that would sound better. But it makes these "ascii" encodings a special case. In particular, it makes UTF-32 less convenient to use. I guess that for calculation so complex that it would really benefit form the fast random access of UTF-32 it is reasonable to create a wrapper that converts the arguments and results. However, If one wants to perform several such (different) consecutive calculations there are going to be several useless conversions. It is certainly possible to make the input interface clever enough to get it right for both UTF-32 and ascii strings but requiring the user to do the conversion on results does not look nice. The compatibility could also be just general value that specifies the encoding family. ie " ".compatibility => :ascii ASCII="".encode(:utf8).compatibility raise "Incompatible encoding #{str.encoding}" unless str.compatibility == ASCII But different families could be possible. I am not sure if any other encoding families of any significance exist, though. Thanks Michal
on 2006-06-19 19:48

On Jun 19, 2006, at 6:31 AM, Austin Ziegler wrote: > This entire discussion is centered around a proposal to do exactly > that. There are many *very good* reasons to avoid doing this. Unicode > Is Not Always The Answer. > > It's *usually* the answer, but there are times when it's just easier > to work with data in an established code page. To enlighten the ignorant, could you describe one or two scenarios where a Unicode-based String class would get in the way? To use your words, make things less easy? I would probably not agree that there are "*many good*" reasons to avoid this, but probably that's just because I've been fortunate enough to not encounter the problem scenarios. This material would have application in a far larger domain than just Ruby, obviously. -Tim
on 2006-06-19 20:35

On 6/19/06, Tim Bray <tbray@textuality.com> wrote: > are "*many good*" reasons to avoid this, but probably that's just > because I've been fortunate enough to not encounter the problem > scenarios. This material would have application in a far larger > domain than just Ruby, obviously. -Tim I've found that a Unicode-based string class gets in the way when it forces you to work around it. For most text-processing purposes, it *isn't* an issue. But when you've got text that you don't *know* the origin encoding (and you're probably working in a different code page), a Unicode-based string class usually guesses wrong. Transparent Unicode conversion only works when it is guaranteed that the starting code page and the ending code page are identical. It's *definitely* a legacy data issue, and doesn't affect most people, but it has affected me in dealing with (in a non-Ruby context) NetWare. Additionally, the overhead of converting to Unicode if your entire data set is in ISO-8859-1 is unnecessary; again, this is a specialized case. More problematic, from the Ruby perspective, is the that a Unicode-based string class would require that there be a wholly separate byte vector class; I am not sure that is necessary or wise. The first time I read a JPG into a String, I was delighted -- the interface presented was so clean and nice as opposed to having to muck around in languages that force multiple interfaces because of such a presentation. Like I said, I'm not anti-Unicode, and I want Ruby's Unicode support to be the best, bar none. I'm not willing to compromise on API or flexibility to gain that, though. -austin
on 2006-06-20 01:40

Hi, In message "Re: Unicode roadmap?" on Tue, 20 Jun 2006 02:20:10 +0900, "Michal Suchanek" <hramrach@centrum.cz> writes: |Reading what you said it appears it would be only possible to add |ascii strings to ascii-compatible sttings. That does not sound very |useful. You will have all your strings in the encoding you choose as a internal encoding in the usual case, so that you will have a few compatibility problem. Only if you want to handle multiple encodings at a time, you need explicit code conversion for mix encoding operations. |I guess that for calculation so complex that it would really benefit |form the fast random access of UTF-32 it is reasonable to create a |wrapper that converts the arguments and results. However, If one wants |to perform several such (different) consecutive calculations there are |going to be several useless conversions. I am not sure what you mean. I feel like that my plan does not have anything against UTF-32 in this regard. Perhaps, I am missing something. What is going to cause useless conversions? matz.
on 2006-06-20 14:13

On 6/20/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote: > internal encoding in the usual case, so that you will have a few > compatibility problem. Only if you want to handle multiple encodings > at a time, you need explicit code conversion for mix encoding > operations. If I read pieces of text from web pages they can be in different encodings. I do not see any reason why such pieces of text could not be automatically concatenated as long as they are all subset of unicode. It was the complaint of one of the people here that in Python strings with different encodings exist but the operations on tham fail. And it makes the life of anybody working with such strings unneccessarily hard. They have to be converted explicitly. > > |I guess that for calculation so complex that it would really benefit > |form the fast random access of UTF-32 it is reasonable to create a > |wrapper that converts the arguments and results. However, If one wants > |to perform several such (different) consecutive calculations there are > |going to be several useless conversions. > > I am not sure what you mean. I feel like that my plan does not have > anything against UTF-32 in this regard. Perhaps, I am missing > something. What is going to cause useless conversions? If automatic conversions aren't implemented at all, utf-32 does not really stand out in this regard. Thanks Michal
on 2006-06-20 15:57

On 6/20/06, Michal Suchanek <hramrach@centrum.cz> wrote: > > > If I read pieces of text from web pages they can be in different > encodings. I do not see any reason why such pieces of text could not > be automatically concatenated as long as they are all subset of > unicode. Having different encodings on one web page is a good way to make sure that the page won't display correctly, since all the browsers I know of display all text on a page using just one encoding. Granted, if the encoding is a subset of unicode, it may still manage to work out, but personally I keep running in to pages that display some of the characters as garbage no matter what encoding I instruct my browser to use. So, no, I don't think it should be valid to concatenate strings with different encodings.
on 2006-06-20 16:34

On Jun 20, 2006, at 14:54, Timothy Bennett wrote: > sure that > be valid to concatenate strings with different encodings. So we shouldn't do it because it doesn't work in web browsers? Hopefully we don't apply that criteria globally, or we'd never get anything done.
on 2006-06-20 16:41

On 6/20/06, Timothy Bennett <timothy.s.bennett@gmail.com> wrote: > the page won't display correctly, since all the browsers I know of display > all text on a page using just one encoding. Granted, if the encoding is a > subset of unicode, it may still manage to work out, but personally I keep > running in to pages that display some of the characters as garbage no matter > what encoding I instruct my browser to use. So, no, I don't think it should > be valid to concatenate strings with different encodings. No, I meant that the strings are, of course, converted to a common encoding such as utf-8 before they are concatenated. The point is that you do not have to care in which encoding you obtained the pieces and convert them manually to a common encoding if the string class can do it automatically for you. Thanks Michal
on 2006-06-20 17:46

On Jun 20, 2006, at 8:09 AM, Michal Suchanek wrote: > If I read pieces of text from web pages they can be in different > encodings. I do not see any reason why such pieces of text could not > be automatically concatenated as long as they are all subset of > unicode. I'm not sure I understand what 'subset of unicode' means. Do you mean two different encodings of Unicode code points? As in 'UTF-8 and UTF-16 are subsets of Unicode'? That usage seems unusual to me. Are you using 'subset' and 'encoding' as synonyms or am I missing subtle difference? Gary Wright
on 2006-06-20 18:08

On Jun 20, 2006, at 6:54 AM, Timothy Bennett wrote: > Having different encodings on one web page is a good way to make > sure that > the page won't display correctly ... > So, no, I don't think it should > be valid to concatenate strings with different encodings. Well, unless you had a String class that took care of the encoding details and, when you were ready to output, allowed you to say "Give me that in ISO-8859 or UTF-8 or whatever". -Tim
on 2006-06-20 18:20

Hi, In message "Re: Unicode roadmap?" on Tue, 20 Jun 2006 23:33:43 +0900, "Michal Suchanek" <hramrach@centrum.cz> writes: |No, I meant that the strings are, of course, converted to a common |encoding such as utf-8 before they are concatenated. |The point is that you do not have to care in which encoding you |obtained the pieces and convert them manually to a common encoding if |the string class can do it automatically for you. If you choose to convert all input text data into Unicode (and convert them back at output), there's no need for unreliable automatic conversion. matz.
on 2006-06-20 19:52

On 6/20/06, gwtmp01@mac.com <gwtmp01@mac.com> wrote: > As in 'UTF-8 and UTF-16 are subsets of Unicode'? > > That usage seems unusual to me. Are you using 'subset' and 'encoding' > as synonyms or am I missing subtle difference? > I mean that iso-8859-1 and iso-8859-2 encodings (as well as many other) encode a subset of characters available in Unicode, and any of its utf-* encodings. Thus any string that is encoded using such encoding can be losslessly and automatically converted to an encoding of full unicode such as utf-8, and operations on several such converted strings make sense even if the strings were encoded using different encodings before the conversion. The automatic conversion would simplify things if you get strings in different encodings from outside sources such as various web pages, databases, etc. Thanks Michal
on 2006-06-21 13:46

On 6/20/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote: > > If you choose to convert all input text data into Unicode (and convert > them back at output), there's no need for unreliable automatic > conversion. Well, it's actually you who chose the conversion on input for me. Since the strings aren't automatically converted I have to ensure that I have always strings encoded using the same encoding. And the only reasonable way I can think of is to convert any string that enters my application (or class) to an arbitrary encoding I choose in advance. This is no more reliable than automatic conversion. The reliability or (un)reliability of the conversion is based on the (un)reliability with which the actual encoding of the string is determined when it is obtained. If the encoding tag is wrong the string will be converted incorrectly. It is the only cause for incorrect conversion wether it happens manually or automatically. If conversion was done automatically by the string class it could be performed lazily. The strings are kept in the encoding in which the were obtained, and only converted when it is needed because they are combined with a string in a different encoding. And users of the srings still have the choice to convert them explicitly when they see fit. When such automatic conversion is not available it makes interfacing with libraries that fetch external data more difficult. a) I could instruct the library that fetches data from a database or the web to return them always in the encoding I chose for reperesenting strings in my application, irregardless of the encoding the data was originally obtained in. The disadvantage is that if the encoding was determined incorrectly on input to the library the data is already garbled. b) I could get the data from the library in the original encoding in which it was obtained. Either because I would like to check that the encoding is correct before converting the data or because the library does not implement the interface for (a). The disadvantage is that I have to traverse a potentially complex data structure and convert all strings so that they work with the other strings inside my application. c) Every time I perform a string operation I should first check (manually) that the two strings are compatible (or catch the exception very near the opration so that I can convert the arguments and retry). I do not think this is a reasonable option for the common case that should be made as simple as possible: the strings can be represented in Unicode. This may be necessary to some extent in applications dealing with encodings that are incompatible with Unicode but it should not be required for the common case. The people with experience from other languages are complaining that they have to do (b) or (c) because (a) is usually not implemented. And ensuring either of the three does look like additional problems that could be solved elsewhere - in the string class. Thanks Michal
on 2006-06-21 16:04

Hi, In message "Re: Unicode roadmap?" on Wed, 21 Jun 2006 20:45:38 +0900, "Michal Suchanek" <hramrach@centrum.cz> writes: |> If you choose to convert all input text data into Unicode (and convert |> them back at output), there's no need for unreliable automatic |> conversion. | |Well, it's actually you who chose the conversion on input for me. |Since the strings aren't automatically converted I have to ensure that |I have always strings encoded using the same encoding. And the only |reasonable way I can think of is to convert any string that enters my |application (or class) to an arbitrary encoding I choose in advance. Agreed. It is me. Perhaps you don't know how terrible code conversion can be. In the ideal world, lazy conversion seems attractive, but reality bites. Conversions fail so easily. Characters lost, text broken. Failures can not be avoided for various reasons, mostly historical reasons we can't fix anymore. When error happens (often) it's good to detect errors as early as possible, i.e. on input/output. So I encourage universal character set model as far as it is applicable. You may use UTF-8 or ISO8859-1 for universal character set. I may use EUC-JP for it. For only rare case, there might be need to handle multiple encoding in an application. I do want to allow it. But I am not sure how we can help that kind of applications, since they are fundamentally complex. And we don't have enough experience to design a framework for such applications. matz.
on 2006-06-21 16:59

On 6/21/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote: > > > For only rare case, there might be need to handle multiple encoding in > an application. I do want to allow it. But I am not sure how we can > help that kind of applications, since they are fundamentally complex. > And we don't have enough experience to design a framework for such > applications. > > I can see one more problem with setting encoding per file and tagging accordingly string literals in it. If operations on strings with different encodings will always throw an exception, problems can raise when one calls such third-party library from script with different encoding. Here's small example: library code in file some_utility.rb: # -*- coding: EUC-JP -*- module SomeUtility def SomeUtility.fancy_format(str) "<text>" + str + "</text>" # these literals are tagged as EUC-JP, right? end end application code in file my_app.rb: # -*- coding: UTF-8 -*- require 'some_utility' puts SomeUtility.fancy_format("an utf8 string") # this literal is tagged as UTF8 If the last call will throw some kind of EncodingMismatchError, how to deal with that?
on 2006-06-21 17:19

Hi, In message "Re: Unicode roadmap?" on Wed, 21 Jun 2006 23:56:47 +0900, "Dmitry Severin" <dmitry.severin@gmail.com> writes: |I can see one more problem with setting encoding per file and tagging |accordingly string literals in it. Indeed. |Here's small example: | |library code in file some_utility.rb: |# -*- coding: EUC-JP -*- |module SomeUtility | def SomeUtility.fancy_format(str) | "<text>" + str + "</text>" # these literals are tagged as EUC-JP, right? | end |end | |application code in file my_app.rb: |# -*- coding: UTF-8 -*- |require 'some_utility' |puts SomeUtility.fancy_format("an utf8 string") # this literal is tagged as |UTF8 | |If the last call will throw some kind of EncodingMismatchError, how to deal |with that? I recommend using "ascii" encoding, which is default, for library files, unless you are sure in what encoding your input data are. For localization, tools like gettext would help dealing with strings in the native encoding. matz.
on 2006-06-21 17:36

On 6/21/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote: > I recommend using "ascii" encoding, which is default, for library > files, unless you are sure in what encoding your input data are. > For localization, tools like gettext would help dealing with strings > in the native encoding. Just a thought. Might it be possible to have a new String literal for what will be, I think, the most common encoding chosen (UTF-8)? That is, in addition to: # -*- coding: EUC-JP -*- "<text>" # tagged as EUC-JP We allow: # -*- coding: EUC-JP -*- "<text>" # tagged as EUC-JP u"<text>" # tagged as UTF-8 Despite my belief that we should avoid an enforced universal encoding as the String representation, I *do* plan on making most of my applications and libraries UTF-8 friendly and aware. It's extremely important that we be able to work with this cleanly, and if I can simply do either u"foo" or U"foo" I would find it much easier to deal with in those places where I need UTF-8/Unicode support. -austin
on 2006-06-21 17:42

On 21-jun-2006, at 17:18, Yukihiro Matsumoto wrote: > |If the last call will throw some kind of EncodingMismatchError, > how to deal > |with that? > > I recommend using "ascii" encoding, which is default, for library > files, unless you are sure in what encoding your input data are. > For localization, tools like gettext would help dealing with strings > in the native encoding. Matz, this would be a disaster (if in such a situation a library throws). It's gonna be like python. Because it means that 99 percent of the libraries will throw.
on 2006-06-21 18:21

Hi, In message "Re: Unicode roadmap?" on Thu, 22 Jun 2006 00:41:02 +0900, Julian 'Julik' Tarkhanov <listbox@julik.nl> writes: |Matz, this would be a disaster (if in such a situation a library |throws). It's gonna be like python. |Because it means that 99 percent of the libraries will throw. Can you elaborate? I don't want to see disaster whatever it is. matz.
on 2006-06-21 18:47

Hi, In message "Re: Unicode roadmap?" on Thu, 22 Jun 2006 00:34:27 +0900, "Austin Ziegler" <halostatue@gmail.com> writes: |Just a thought. Might it be possible to have a new String literal for |what will be, I think, the most common encoding chosen (UTF-8)? That is, |in addition to: | | # -*- coding: EUC-JP -*- | "<text>" # tagged as EUC-JP | |We allow: | | # -*- coding: EUC-JP -*- | "<text>" # tagged as EUC-JP | u"<text>" # tagged as UTF-8 I am not sure this is a good idea or not (yet). If your "u" text contains only ASCII characters, I see no need to tag it "UTF-8", and if it's not, how do we prepare them? I think, for example, u"\346\235\276\346\234\254" => my family name in Kanji is too ugly. matz.
on 2006-06-21 19:18

On 6/21/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote: > > Can you elaborate? I don't want to see disaster whatever it is. > > matz. > > Single scripts and small self-contained applications almost always are written in the same codepage. Usually text data processing also is done for the same codepage, that simplifies life a lot even with current String as byte vector. So recoding is an overhead here, and external data is only recoded on input/output in relativey small number of well-defined places, using known subset of source and target encodings. In this case when you know what to expect from your file/network IO, things are OK. It is also OK, when part of script is extracted and evolves to a library, as long as you use it in the same environment. But let's view a case when several third-party libraries are used, all returning strings with different encodings. gettext for libraries won't solve everything, as even externalized strings will have some particular encoding. E.g. localization libraries can't fit in only ASCII. And now calls to methods will behave like some kind of IO in respect to encoding of passed parameters. Number of i/o points grows drastically. How can it be solved in consistent and reliable manner? a) just simply declare in documentation: "Methods in these classes *require* strings to be in UTF16, you've been warned!!!" So users of that code will have to remember those constrains and enforce encoding of their data before calling those methods. With dynamic nature of Ruby things will break in unexpected places. No, i dislike idea to write: str.enforce_encoding!(BooClass::INTERNAL_ENCODING) b = BooClass.new(str) b) take care in called methods to enforce encoding def process_formatting(str) str.enforce_encoding!(MY_INTERNAL_ENCODING) # now it is compatible with rest of my code # and i can do something with it end This is also too error-prone :( And what about processing results of calls? To take care about it in caller code? res_str = SomeUtil.fancy_format( str ) res_str.enforce_encoding!(MY_INTERNAL_ENCODING) On input parameters and returned results which represent complex structures with some String fields things will go even worse. Who will ever cope with this issues? Probably this is what Julik meant by "disaster"? Things shouldn't be that complicated.
on 2006-06-21 21:20

On 21-jun-2006, at 18:20, Yukihiro Matsumoto wrote:
> Can you elaborate? I don't want to see disaster whatever it is.
I imagine that in the case mentioned the encoding assumed for a
library will depend on the pragma in the source.
Fr instance, I am writing a program that needs to work wuth UTF8
data, but one of the libraries I am using has ASCII in the pragma.
What is going to haveppen if I ship this library UTF8 strings? Python
libraries just throw, because they do all kinds of no-unicode aware
operations on strings
or request Unicode strings explicitly. So anytime you want to ship
something to a library (or get something from STDIN) you have to
decode and encode.
As soon as you forget to, you get exceptions everywhere.
on 2006-06-21 21:23

On 21-jun-2006, at 19:17, Dmitry Severin wrote: > . > > Who will ever cope with this issues? > Probably this is what Julik meant by "disaster"? > > Things shouldn't be that complicated. What I meant is the desritption how you get a Python program wielded from different libraries to be Unicode-aware. If Ruby works like that I won't be happy. Basically, some libraries accept Unicode in Python's 16bit form, some accept utf-8 bytestrings and some can only grok ASCII and will throw up anyway. These are not going to work on Python 3000 as I understand.
on 2006-06-22 01:47

On 6/21/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote: > |Since the strings aren't automatically converted I have to ensure that > i.e. on input/output. So I encourage universal character set model as > far as it is applicable. You may use UTF-8 or ISO8859-1 for universal > character set. I may use EUC-JP for it. I do not see how converting the strings on input will make the situation better than converting them later. The exact place where the text is garbled because it is converted incorrectly does not change the fact it is no longer usable, does it? well, it may be possible to detect characters that are invalid for certain encoding either by scanning the string or by attempting a conversion. But I would rather like optional checks that can be added when something breaks or is likely to break rather than forced conversion. Or to put it another way: If I get a string from somewhere where the encoding is marked incorrectly it is wrong and it should be expected to fail. And I can do some checks if I think my source of data is not reliable in this respect. But if I get string that is marked correctly and it fails because I did not manually convert it it is frustrating. And needlessly so. > > For only rare case, there might be need to handle multiple encoding in > an application. I do want to allow it. But I am not sure how we can > help that kind of applications, since they are fundamentally complex. > And we don't have enough experience to design a framework for such > applications. I do no think it is that rare. Most people want new web (or any other) stuff in utf-8 but there is need to interface legacy databases or applications. Sometimes converting the data to fit the new application is not practical. For one, the legacy application may be still used as well. Anyway, Ruby being as dynamic as it is I should be able to add support for automatic recoding myself quite easily. The problem is I would not be able to use it in libraries (should I ever write some) without risking a clash with similar feature added by somebody else. Thanks Michal
on 2006-06-22 04:36

Hi, In message "Re: Unicode roadmap?" on Thu, 22 Jun 2006 02:17:53 +0900, "Dmitry Severin" <dmitry.severin@gmail.com> writes: |Things shouldn't be that complicated. Agreed in principle. But it seems to be fundamental complexity of the world of multiple encoding. I don't think automatic conversion would improve the situation. It would cause conversion error almost randomly. Do you have any idea to simplify things? I am eager to hear. matz.
on 2006-06-22 04:46

Hi, In message "Re: Unicode roadmap?" on Thu, 22 Jun 2006 08:46:08 +0900, "Michal Suchanek" <hramrach@centrum.cz> writes: |I do not see how converting the strings on input will make the |situation better than converting them later. The exact place where the |text is garbled because it is converted incorrectly does not change |the fact it is no longer usable, does it? It does. But if you convert encoding lazily, you will have hard time to track down the source of the error causing data. It may be input data from IO, or from some GUI toolkit, or the result of operation with variety of sources. |> For only rare case, there might be need to handle multiple encoding in |> an application. I do want to allow it. But I am not sure how we can |> help that kind of applications, since they are fundamentally complex. |> And we don't have enough experience to design a framework for such |> applications. | |I do no think it is that rare. Most people want new web (or any other) |stuff in utf-8 but there is need to interface legacy databases or |applications. Sometimes converting the data to fit the new application |is not practical. For one, the legacy application may be still used as |well. I understand the challenge, but I don't think it is common to run some part of your program in legacy encoding (without conversion), and other part in UTF-8. You need to convert them into universal encoding anyway for most of the cases. That's why I said it rare. matz.
on 2006-06-22 08:58

2006/6/22, Yukihiro Matsumoto <matz@ruby-lang.org>: > randomly. Do you have any idea to simplify things? > > I am eager to hear. > So what will be semantic for encoding tag: a) weak suggestion? b) strong assertion? If encoding tag is only weak suggestion (and for now I see it will be just that), it will imply: - performance win (no need to check conformance to told encoding) - win in having less complexity (most tasks use source code, text data input and output all in the same [default host] encoding) - portability drawbacks (assumtions made by original coders will be implicit, but they have to be figured out, when porting to another environement) - reliability drawbacks (weak suggestions are too often ignored, and you don't know when, where and why they will hit your app, but someday they will!) If encoding tag is strong assertion, it will imply: - probable performance loss: * to assure this string with encoding = "none" (raw) represents valid encoding sequence of bytes, at the same price as String#length * need to recode bytes, when changing tag - slightly more complexity (developers will have to declare these assertions explicitly) - portability win - reliability win What compromise on this issues would be acceptable? I'd prefer encoding tag as strong assertion, mostly for reliability reasons. And for operations on Strings with different encodings, I'd like implicit automatic encoding coercion: ------------------------------- # # NOTES: # a) String#recode!(new_encoding) replaces current internal byte representation with new byte sequence, # that is recoded current. must raise IncompatibleCharError, if can't convert char to destination encoding # b) downgrading string from some stated encoding to "none" tag must be done only explicitly. # it is not an option for implicit conversion # c) $APPLICATION_UNIVERSAL_ENCODING is a global var, allowed to be set once and only once per application run. # Intent: we want all strings which aren't raw bytes to be in one single predefined encoding, # so all operations on string must return string in conformant encoding. # Desired encoding is value of $APPLICATION_UNIVERSAL_ENCODING. # If $APPLICATION_UNIVERSAL_ENCODING is nil, we go in "democracy mode", see below. # def coerce_encodings(str1, str2) enc1 = str1.encoding enc2 = str2.encoding # simple case, same encodings, will return fast in most cases return if enc1 == enc2 # another simple but rare case, totally incompatible encodings, as they represent incompatible charsets if fully_incompatible_charsets?(enc1, enc2) raise(IncompatibleCharError, "incompatible charsets %s and %s", enc1, enc2) end # uncertainity, handling "none" and preset encoding if enc1 == "none" || enc2 == "none" raise(UnknownIntentEncodingError, "can't implicitly coerce encodings %s and %s, use explicit conversion", enc1, enc2) end # Tirany mode: # we want all strings which aren't raw bytes to be in one single predefined encoding if $APPLICATION_UNIVERSAL_ENCODING str1.recode!($APPLICATION_UNIVERSAL_ENCODING) str2.recode!($APPLICATION_UNIVERSAL_ENCODING) return end # Democracy mode: # first try to perform non-loss conversion from one encoding to another: # 1) direct conversion, without loss, to another encoding, e.g. UTF8 + UTF16 if exists_direct_non_loss_conversion?(enc1, enc2) if exists_direct_non_loss_conversion?(enc2, enc1) # performance hint if both available if str1.byte_length < str2.byte_length str1.recode!(enc2) else str2.recode!(enc1) end else str1.recode!(enc2) end return end if exists_direct_non_loss_conversion?(enc2, enc1) str2.recode!(enc1) return end # 2) non-loss conversion to superset # (I see no reason to raise exception on KOI8R + CP1251, returning string in Unicode will be OK) if superset_encoding = find_superset_non_loss_conversion?(enc1, enc2) str1.recode!(superset_encoding) str2.recode!(superset_encoding) return end # A case for incomplete compatibility: # Check if subset of enc1 is also subset of enc2, # so some strings in enc1 can be safely recoded to enc2, # e.g. two pure ASCII strings, whatever ASCII-compatible encoding they have if exists_partial_loss_conversion?(enc1, enc2) if exists_partial_loss_conversion?(enc2, enc1) # performance hint if both available if str1.byte_length < str2.byte_length str1.recode!(enc2) else str2.recode!(enc1) end else str1.recode!(enc2) end return end # the last thing we can try str2.recode!(enc1) end --------------------------- So, when operation involves two Strings or String and Regexp, with different encodings, automatic coercion should be done, as described above. That will, probably, solve coding problems (no need to think about encodings most time), but can have following impacts: 1) after several operations, when one sends string to external IO, it might be internally encoded in superset of that IO encoding. One has to remember that and perform external IO accordingly, i.e. to resolve - to fail on invalid chars or use replacement chars (like U+FFFD),- but that is unavoidable. 2) some performance hits, which I expect to be rare. Besides, there can be another class of problems with automatic coercion: how to ensure consistent work of character ranges in Regexps and String methods like [count, delete, squeeze, tr, succ, next, upto] when encodings are coerced? What I, as Ruby user, wish for Unicode/M17N support: 1) reliability and consistency: a) String should be abstraction for character sequence, b) String methods shouldn't allow me to garble internal representation; c) treating String as byte sequence is handy, but must be explict stated. 2) coding comfort: a) no need to care what encodings have strings while working with them; b) no need to care what encodings have strings returned from third-party code; c) using explicit stated conversion options for external IO. 3) on Unicode and i18n : at least to have a set of classes for Unicode-specific tasks (collation, normalization, string search, locale-aware formatting etc.) that would efficiently work with Ruby strings. And, for all out there, just ask "Which charset/encoding will fit all the [present and future] needs?". You know the exact answer: "NONE". > I understand the challenge, but I don't think it is common to run some > part of your program in legacy encoding (without conversion), and > other part in UTF-8. You need to convert them into universal encoding > anyway for most of the cases. That's why I said it rare. uhm, how to convert compiled extension library?
on 2006-06-22 10:18

Hi, In message "Re: Unicode roadmap?" on Thu, 22 Jun 2006 15:55:18 +0900, "Lugovoi Nikolai" <meadow.nnick@gmail.com> writes: |> I am eager to hear. | |So what will be semantic for encoding tag: | a) weak suggestion? | b) strong assertion? Weak suggestion, if I understand you correctly. |I'd prefer encoding tag as strong assertion, mostly for reliability reasons. Hmm, your idea of combination of strong assertion and automatic conversion seems too complex for me, but it may be worth considering. Thank you for idea. |uhm, how to convert compiled extension library? Every extension that does input/output need to specify (either explicitly or implicitly) encoding it uses anyway. I will add an encoding option to rb_tainted_str_new() and its family. If it's possible, I'd like to allow extensions to declare their default encoding in their initialize function (Init_xxx). matz.
on 2006-06-22 12:42

On 6/22/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote: > Weak suggestion, if I understand you correctly. > > |I'd prefer encoding tag as strong assertion, mostly for reliability reasons. > > Hmm, your idea of combination of strong assertion and automatic > conversion seems too complex for me, but it may be worth considering. > Thank you for idea. What I had in mind was much simpler. If the strings do not match just try to recode to the default encoding which would be unicode most of the time. Or just try to find a superset. > > |uhm, how to convert compiled extension library? > > Every extension that does input/output need to specify (either > explicitly or implicitly) encoding it uses anyway. I will add > an encoding option to rb_tainted_str_new() and its family. If it's > possible, I'd like to allow extensions to declare their default > encoding in their initialize function (Init_xxx). > But if recoding is not automatic you still have to recode the strings manually. Both the input to the extension and the results. That is an annoyance an repetitive code everywhere. Thanks Michal
on 2006-06-22 17:32

On Wed, Jun 21, 2006 at 01:04:55AM +0900, Tim Bray wrote: > details and, when you were ready to output, allowed you to say "Give > me that in ISO-8859 or UTF-8 or whatever". -Tim That's what I suggested basically. The problem seems to be non-Unicode demands mainly, and performance issues on the other hand. And it makes Strings useless as byte buffers, since you have to specifiy the encoding of the external representation you create the String from at creation time. To recap: Private extensions to Unicode are deemed too complex to implement (Matz). Transforming legacy or special (non Unicode) data to a ruby-private internal storage format on I/O is too performance/space intensive (Matz). Strings as byte buffers are important to some people, and they don't want to use another class or array for it, even if RegExp et al would be extended to handle these too. While it would be proper OO design, encapsulating the internal String implementation hampers direct access to the "raw" data for C-hackers, creating unwanted hurdles, and again performance issues. I am still not convinced the arguments against this approach really will hold in the long run, but since I am not the one implementing it and can't really participate there due to language barriers, I can only lean back and wait for the first realease of M17N. Learning English was hard enough for me. -Jürgen
on 2006-06-25 16:41
Yukihiro Matsumoto wrote: > Alright, then what specific features are you (both) missing? I don't > think it is a method to get number of characters in a string. It > can't be THAT crucial. I do want to cover "your missing features" in > the future M17N support in Ruby. Sorry for maybe getting into, but here are my 5 cents. When I first found out about ruby, I practically almost fell in love with the language. Unfortunately, after some studying and experimenting I suddenly found that it lacks proper unicode support on win32, in particular with file IO and ole automation, i.e. in two cases where I had to interoperate with the rest of the world. Win32 really differs from Linux and maybe other Unixes in API because in *nix you don't have to worry about unicode/whatever, because all of the system depends on your current locale. In win32 there are two sets of API, ansi and unicode, maybe that was a bad microsoft's decision, but that's a reality. Now I am a Russian, and when I write scripts I have to worry that not only Russian characters don't get messed up, but characters of other languages as well. So that if I receive, say, excel file with a lot of languages in that, and I have to process that file somehow I have to be sure that no letters will be lost, nor messed up, thus converting it to current codepage (1251) is no option for me. The same is with filenames, the fact that I'm running russian winxp doesn't mean that I have only filenames that fall in 1251 codepage, I also have filenames with european characters (umlauts and such), as well as japanese, and when I want to write some script that processes these files, I have to be able to work with them. At that time this caused me to move to Tcl (it has utf-8 encoding everywhere, and it converts to required encoding when interoperating with the world). Since then I'm still waiting for proper unicode support in ruby (read: proper interoperability with operating system and its components using unicode API versions: the ones ending with W) and maybe a way to define in which locale (specific code page, utf-8, etc) the current script is running. Hope that clarifies what is currently missing for me (and maybe others, I don't know).
on 2006-06-25 17:11

Hi, In message "Re: Unicode roadmap?" on Sun, 25 Jun 2006 23:41:48 +0900, Snaury Miyoto <snaury@gmail.com> writes: |Hope that clarifies what is currently missing for me (and maybe others, |I don't know). Unfortunately, not. I understand Russian people having problem with multiple encoding, but I don't know how can we help you. You said Tcl has Unicode support that works well with you. So that I think treating all of them in UTF-8 is OK for you. Then how can it determine which should be in the current code page, or in Unicode? Or using Win32 API ending with W could allow you living in the Unicode? matz.
on 2006-06-25 19:19

On 22.6.2006, at 10:17, Yukihiro Matsumoto wrote: > > |I'd prefer encoding tag as strong assertion, mostly for > reliability reasons. > > Hmm, your idea of combination of strong assertion and automatic > conversion seems too complex for me, but it may be worth considering. Strong assertion + auto conversion is the only solution which will relieve programmers from manually checking/changing string encodings in their programs. Remember, string input/output points in a program are not only system IO classes, but also all the third party libraries/classes which deal with strings. So most of the existing Ruby and other external (e.g. Java) libraries, which can be used from Ruby. The assumption that only system IO is the entry/exit point for string encoding is very wrong. This assumption holds only for scripts which use no third party libraries. So we have two possibilities: a) every programmer is forced to implement the above solution in every program (this is starting to happen already, and current experience tells us that the future in this direction is disaster!) b) Ruby interpreter implements this solution, and programmers happily ignore all the complexity. So, it is true that we move the complexity into Ruby, but this is (IMHO) much less complicated and much more needed than e.g. infinitely big integers which we already have. If Ruby wants to move forward, it needs transparent String support and hopefully separation of String and ByteArray, since this un- separation brought us code which is mostly wrong (currently most of existing Ruby code breaks if string encoding is honoured, as can be seen from experience of brave people who modified String class). Ruby is my favourite language, and if it would have String support as suggested, software development would be just pure joy... Please listen to the people which tell of disastrous experience in other languages. And for good experience, I develop in Cocoa in Mac OS X for many many years, and it has great String class (ok, the suggested Ruby class would be even better, but still). Plus it has separated String and Byte array. The results are superb. There is no problems, and nobody ever worries about strings and encodings. Ever. You can check the mailing lists. izidor
on 2006-06-25 19:28

On 25-jun-2006, at 19:18, Izidor Jerebic wrote: > > Please listen to the people which tell of disastrous experience in > other languages. And for good experience, I develop in Cocoa in Mac > OS X for many many years, and it has great String class (ok, the > suggested Ruby class would be even better, but still). Plus it has > separated String and Byte array. The results are superb. There is > no problems, and nobody ever worries about strings and encodings. > Ever. You can check the mailing lists. The greatest about Cocoa is that I'm able to suspect that 99 percent of the programs I use do The Right Thing when I want to input russian text in there, and NOT because the programmer did something special to make it work. Because if he had to, he wouldn't. In contrast, 70 percent of Carbon applications are not even capable of displaying the text properly (let alone letting me type it in).
on 2006-06-25 21:08

On 6/25/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote: > Or using Win32 API ending with W could allow you living in the > Unicode? Matz, I've mentioned it before, but I will be happy to make the Windows APIs work with Unicode once the m17n Strings exist. Yes, I will be making them use either UTF-8 (conversion required, most likely to be compatible with existing code) or UTF-16 (no conversion required). It will work well: I have done a similar implementation for code that I have written at work. -austin
on 2006-06-25 21:15

On 6/25/06, Izidor Jerebic <ij.rubylist@gmail.com> wrote: > If Ruby wants to move forward, it needs transparent String support and > hopefully separation of String and ByteArray, since this un- > separation brought us code which is mostly wrong (currently most of > existing Ruby code breaks if string encoding is honoured, as can be > seen from experience of brave people who modified String class). This is an incorrect and unsupportable statement. It is completely unnecessary to separate unencoded (e.g., binary) String support into String and ByteArray. Please don't try to assume that the problem is this completley unnecessary division. The problem is that existing strings are completely unencoded and have no way of being flagged with an encoding that is supported in any way across all of Ruby. People are making really *stupid* assumptions based on what choices other development teams have made, and it's irritating. Ruby does not need a String with an internal representation in Unicode; Ruby does not need a separate byte vector. An unencoded string can be treated as a byte vector with no problems; if it is determined to have textual meaning, it can be tagged with an encoding very simply and from that point be treated as a meaningful string. There are times when the encoding is *not* best treated in Unicode, especially if there are potential conversion errors. -austin
on 2006-06-25 22:40

On 6/25/06, Austin Ziegler <halostatue@gmail.com> wrote: > > Ruby does not need a String with an internal representation in Unicode; > Ruby does not need a separate byte vector. An unencoded string can be > treated as a byte vector with no problems; if it is determined to have > textual meaning, it can be tagged with an encoding very simply and from > that point be treated as a meaningful string. There are times when the > encoding is *not* best treated in Unicode, especially if there are > potential conversion errors. > When is a ByteArray not a ByteArray? When is a String not a String? Is it correct to mingle the two concepts perpetually, when they each have fairly specific definitions? My problem with continuing to treat String as a byte vector is that it forces two somewhat incompatible concepts on the same class and the same methods. If you can use a String as both a byte vector and as a sequence of characters by calling the same methods, then setting or clearing encoding suddenly has the side-effect of changing how elements of String are to be treated. If you are providing separate methods for working with bytes as opposed to working with characters, then you are already splitting the two concepts. (As an aside, does it make sense that I read from a binary file into a String? Can I reliably assume that binary content in a String should be logically manipulable as text strings are? Should my binary String work anywhere and everywhere a text-based String does? I would think that binary content neither walks nor quacks like a String.) By your definition, a String can be treated as a ByteArray so long as its internal string does not have an encoding. What do I use if I want to have an encoding and still use byte vector semantics? It is appropriate that a String is no longer usable as a ByteArray as a result of changing some state? If there exists any state where String cannot be logically treated as a byte array, then String != ByteArray in the general case either. The encoding of a String's internal representation should not dictate the outward behavior of the String. If, however, you completely separate the two concepts, there's no dichotomy. In that case, a String deals with characters, and you do not have guarantees about byte-boundaries or indexed elements. You only have guarantees about characters, as it should be. Simultaneously, ByteArray would allow you to always work with a vector (array) of bytes, regardless of what those bytes contain. I'll end it off saying this: I think it's a no-brainer that for dealing with streams of bytes, there should be a non-string byte vector class. If folks are insistent on keeping them the same class, you can't logically continue to call it a String and have it fulfill the dual purposes of byte vector and character vector at the same time. If you plan to provide methods for supporting both behaviors, you're putting two distinct behaviors into the same type. I understand the unwillingness to move away from String as a byte vector, but with multibyte support coming you really can't have String == ByteArray without causing problems somewhere. They simply don't have the same behavior, and trying to pretend they do is asking for trouble.
on 2006-06-25 22:46

On 25.6.2006, at 21:12, Austin Ziegler wrote: > String and ByteArray. Well, if it is a byte array, it is not a String (an array of characters), is it? If Ruby would have RegEx operations on byte arrays, there would be no need for untyped quasi String. API that has two incompatible things as one class is just plain ugly and wrong. Reading jpeg image in a String is totally wrong. You need bytes. You get characters, but they aren't really characters, they are bytes. Until something happens (maybe) and they are characters (maybe), or they are not (maybe). img_var[5] is what? 6th byte? 9th 2 bytes if encoding is utf8? What exactly? This is a clear API? There is no need for bytes masquerading as Strings. None. This practice just confuses the writer and the reader of the code. You need either bytes or Strings. Never both in the same variable. They are semantically totally different. At least they should be (we would not have problems if people would honour this distinction). > > Please don't try to assume that the problem is this completley > unnecessary division. The problem is that existing strings are > completely unencoded and have no way of being flagged with an encoding > that is supported in any way across all of Ruby. The problem is exactly this: the separation between bytes and characters. This is the general problem we have and discuss right now. API should help us solve the problem. And you apparently missed all the attempts to extend String (also with encodings a la 1.9) that failed because of existing software, not because of Ruby. > > Ruby does not need a String with an internal representation in > Unicode; Nobody says at this point of conversation that we need internal representation in unicode for all strings. We just want to avoid thinking about ANY encoding. We have other things to do. So having a transparent conversions between compatible encodings is a must. > Ruby does not need a separate byte vector. An unencoded string can be > treated as a byte vector with no problems. > ; if it is determined to have > textual meaning, it can be tagged with an encoding very simply It can be, but it is not and will not be. Do you read emails? The problem is that people do not do things like that. And then other people have problems. If all the code you run is yours, then you are right. For many people that is not true. > . There are times when the > encoding is *not* best treated in Unicode, especially if there are > potential conversion errors. Why do you keep on about this? Once again - WE DO NOT CARE WHAT ENCODING IS THERE. We just want the string operations to work without any extra programming work when operands have compatible encodings. As written very well by Lugovoi Nikolai: > b) no need to care what encodings have strings returned from third- > party code; > c) using explicit stated conversion options for external IO. > 3) on Unicode and i18n : at least to have a set of classes for > Unicode-specific tasks (collation, normalization, string search, > locale-aware formatting etc.) that would efficiently work with Ruby > strings. Me too, please. izidor
on 2006-06-26 00:16

On 6/25/06, Izidor Jerebic <ij.rubylist@gmail.com> wrote: > > > > This is an incorrect and unsupportable statement. It is completely > > unnecessary to separate unencoded (e.g., binary) String support into > > String and ByteArray. > > Well, if it is a byte array, it is not a String (an array of > characters), is it? > > If Ruby would have RegEx operations on byte arrays, there would be no > need for untyped quasi String. API that has two incompatible things > as one class is just plain ugly and wrong. Here you contradict yourself. Regexes are string (character) operations, and you want them on byte arrays. So the concepts aren't really separate. Similarily, when you read part of a file, and use it to determine what kind of file it was you do not want to convert that part into another class or re-read it because somebody decided String and ByteVector are separate. Plus this has been already mentioned here. Michal
on 2006-06-26 00:25

> Here you contradict yourself. Regexes are string (character) > operations, and you want them on byte arrays. So the concepts aren't > Similarily, when you read part of a file, and use it > to determine what kind of file it was you do not want to convert that > part into another class or re-read it because somebody decided String > and ByteVector are separate. Why not? When I read CGI params I get them as strings, but if I want to add them together I need to convert them to integers, because someone decided that "1" != 1. This is a good thing, so you don't get "5 purple elephants"+"3 monkeys" = 7, like you do in PHP. Likewise, when you read from a file/socket/whatever you might not be getting a real string, you might be getting a byte array. They are fundamentally different things, a byte array may happen to contain text at some point, but some time later it may be just a stream of data. Conversely a String _always_ contains human-readble text in whatever encoding you want. As someone who has to work with Unicode in PHP, I'd say it's important to separate the types. If you want to display something to a user you have to know what it is, but when you're reading a file you don't care, unless you know what's in it. A Unicode String could be a subclass of the byte array with some niceties for dealing with multibyte characters. Just a thought.
on 2006-06-26 00:37

On Jun 25, 2006, at 1:45 PM, Izidor Jerebic wrote: > Well, if it is a byte array, it is not a String (an array of > characters), is it? +1 to this and to Nutter previously. Text strings and byte arrays are different kinds of things and both are useful and I don't see any benefit from trying to pretend they're the same thing. But some apparently-smart people seem to think there is a benefit; perhaps they could explain it in simple terms for those of us insufficiently- clued to see it? -Tim
on 2006-06-26 00:59

Hi, In message "Re: Unicode roadmap?" on Mon, 26 Jun 2006 05:38:46 +0900, "Charles O Nutter" <headius@headius.com> writes: |When is a ByteArray not a ByteArray? When is a String not a String? Is it |correct to mingle the two concepts perpetually, when they each have fairly |specific definitions? My problem with continuing to treat String as a byte |vector is that it forces two somewhat incompatible concepts on the same |class and the same methods. A string is a sequence of data that can be represented by small integers. Some may want to treat them as CharacterStrings, other may want to treat them ByteStrings. They are not different as you say. On many platforms, a file can contain text data or binary data. Is a chunk of data read from a open file a text, or binary? If you separate ByteArray and (Character) String, you will need to have two separate IO classes, BinaryIO and TextIO, etc. Or you will need explicit conversion from read ByteArray to CharacterString. That makes Ruby programs look a lot like Java programs, which I don't want them to be. One of the good property of Ruby class library is a small number of classes. A class might have multiple roles. For example, a Ruby Array can be treated as Stacks, Queues, etc. And it is a good thing, rather than having separate classes for each role. Why can't Strings be both sequence of text and binary data? matz.
on 2006-06-26 03:02

On 6/25/06, Izidor Jerebic <ij.rubylist@gmail.com> wrote: > Well, if it is a byte array, it is not a String (an array of > characters), is it? It could be indistinguishable from such. Even a Unicode string is ultimately an array of bytes in memory. It just happens that there's a higher level abstraction that can be used to interpret that particular array of bytes. What you're asking for is rather like the difference between std::string and std::vector<unsigned char>. They represent the same thing, but don't work the same. If you're going to have a String and ByteVector that work the same (except that the String also has the higher-level interpretation of characters), is it meaningfully a different object? I think not. Indeed, I think that having a separate object for these would increase the overall complexity and reduce the usability overall. >> Please don't try to assume that the problem is this completley >> unnecessary division. The problem is that existing strings are >> completely unencoded and have no way of being flagged with an >> encoding that is supported in any way across all of Ruby. > The problem is exactly this: the separation between bytes and > characters. This is the general problem we have and discuss right now. > API should help us solve the problem. > And you apparently missed all the attempts to extend String (also with > encodings a la 1.9) that failed because of existing software, not > because of Ruby. Excuse me? You don't know what you're talking about here. No existing version of Ruby has a String with encodings. Not even Ruby 1.9. Any extension which tries to do this *will fail* because there is no way to enforce this extension's semantics on all of Ruby and all extensions. Ruby 1.9 will be different because the m17n String will be a guaranteed behaviour in Ruby. The problem is not the separation between bytes and characters, but that there's no way *in Ruby* to distinguish between the two, at least not reliably. >> Ruby does not need a String with an internal representation in >> Unicode; > Nobody says at this point of conversation that we need internal > representation in unicode for all strings. We just want to avoid > thinking about ANY encoding. We have other things to do. So having a > transparent conversions between compatible encodings is a must. I think that you're confusing me with someone else. Most people who have advocated a separate ByteVector have been unable to articulate exactly what this would buy us, and most have also advocated an internal Unicode representation of Strings. I have been one of the ones who have advocated transparent conversions all along. Frankly, with coersion, it would be possible to upconvert to a compatible conversion between any encoding. >> Ruby does not need a separate byte vector. An unencoded string can be >> treated as a byte vector with no problems. ; if it is determined to >> have textual meaning, it can be tagged with an encoding very simply > It can be, but it is not and will not be. Do you read emails? The > problem is that people do not do things like that. And then other > people have problems. If all the code you run is yours, then you are > right. For many people that is not true. "Is not" is a useless term. OF COURSE IT ISN'T -- right now. In the future, with the m17n Strings, it could be -- and would be. And yes, I have read every single one of these emails about Unicode. Most of them have been ignorant of anything but their own narrow needs and clueless about good API design. >> There are times when the encoding is *not* best treated in Unicode, >> especially if there are potential conversion errors. > Why do you keep on about this? > > Once again - WE DO NOT CARE WHAT ENCODING IS THERE. We just want the > string operations to work without any extra programming work when > operands have compatible encodings. I suggest you look through the Unicode threads again. You'll find your statement is untrue. There are a lot of people who (foolishly) want Unicode to be the only internal representation of Strings in Ruby. > As written very well by Lugovoi Nikolai: >> What I, as Ruby user, wish for Unicode/M17N support: >> 1) reliability and consistency: >> a) String should be abstraction for character sequence, >> b) String methods shouldn't allow me to garble internal >> representation; >> c) treating String as byte sequence is handy, but must be explict >> stated. An unencoded -- raw -- String would be *only* interpretable as a byte sequence unless "recoded." Aside from that, everything said above would be true. >> 2) coding comfort: >> a) no need to care what encodings have strings while working with >> them; >> b) no need to care what encodings have strings returned from third- >> party code; >> c) using explicit stated conversion options for external IO. You'll always need to care, even if you're using Unicode. You can't *not* care and claim to be doing Unicode or m17n work. We can *reduce* those concerns, but you *CANNOT* be ignorant of this at any time. >> 3) on Unicode and i18n : at least to have a set of classes for >> Unicode-specific tasks (collation, normalization, string search, >> locale-aware formatting etc.) that would efficiently work with Ruby >> strings. > Me too, please. That would be useful. -austin
on 2006-06-26 03:14

On 6/25/06, Phillip Hutchings <sitharus@sitharus.com> wrote: >> Here you contradict yourself. Regexes are string (character) >> operations, and you want them on byte arrays. So the concepts aren't >> Similarily, when you read part of a file, and use it to determine >> what kind of file it was you do not want to convert that part into >> another class or re-read it because somebody decided String and >> ByteVector are separate. > Why not? When I read CGI params I get them as strings, but if I want > to add them together I need to convert them to integers, because > someone decided that "1" != 1. This is a good thing, so you don't get > "5 purple elephants"+"3 monkeys" = 7, like you do in PHP. Sorry, but "reading" CGI params is a red herring. You may get it as one thing and then convert it to something else. > Likewise, when you read from a file/socket/whatever you might not be > getting a real string, you might be getting a byte array. They are > fundamentally different things, a byte array may happen to contain > text at some point, but some time later it may be just a stream of > data. Conversely a String _always_ contains human-readble text in > whatever encoding you want. Okay. What class should I get here? data = File.open("file.txt", "rb") { |f| f.read } Under the people who want separate ByteVector and String class, I'll need *two* APIs: st = File.open("file.txt", "rb") { |f| f.read_string } bv = File.open("file.txt", "rb") { |f| f.read_bytes } Stupid, stupid, stupid, stupid. If I have guessed wrong about the contents of file.txt, I have to rewind and read it again. Better to *always* read as bytes and then say, "this is actually UTF-8". This would be as stupid in C++, Java, or C#: class File { bool read(string& st); bool read(byte_vector& bv); } Yes, I can't actually read into the item, but have to call an accessor. Moronic design, mostly because I can't do: class File { string read(void); byte_vector read(void); } That would help in static languages, but they can't do that -- and Ruby can't do it either, since variables are just labels. > As someone who has to work with Unicode in PHP, I'd say it's important > to separate the types. If you want to display something to a user you > have to know what it is, but when you're reading a file you don't > care, unless you know what's in it. The problem here is not unification. The problem here is that PHP is stupid. It is generally recognised that Ruby's API decisions are much smarter than most other languages, and this is a good example of where this would happen. > A Unicode String could be a subclass of the byte array with some > niceties for dealing with multibyte characters. Just a thought. Unnecessary and overcomplex. -austin
on 2006-06-26 03:23

> Sorry, but "reading" CGI params is a red herring. You may get it as one > thing and then convert it to something else. Exactly. > > Likewise, when you read from a file/socket/whatever you might not be > > getting a real string, you might be getting a byte array. They are > > fundamentally different things, a byte array may happen to contain > > text at some point, but some time later it may be just a stream of > > data. Conversely a String _always_ contains human-readble text in > > whatever encoding you want. > > Okay. What class should I get here? > > data = File.open("file.txt", "rb") { |f| f.read } A byte vector. Unknown input, so you just get a stream of bytes. > Under the people who want separate ByteVector and String class, I'll > need *two* APIs: > > st = File.open("file.txt", "rb") { |f| f.read_string } > bv = File.open("file.txt", "rb") { |f| f.read_bytes } Why? This looks needlessly complex. string = File.open('file.txt', 'r') {f.read.to_s(:utf-8)} Or possibly string = File.open('file.txt', 'r') {f.read(:utf8)} bytes = File.open('file.txt', 'r') {f.read(:bytearray)} with no argument assuming it's a default encoding. But with this approach the same class could be used for both, which takes us full circle ;) > > As someone who has to work with Unicode in PHP, I'd say it's important > > to separate the types. If you want to display something to a user you > > have to know what it is, but when you're reading a file you don't > > care, unless you know what's in it. > > The problem here is not unification. The problem here is that PHP is > stupid. It is generally recognised that Ruby's API decisions are much > smarter than most other languages, and this is a good example of where > this would happen. Hence why I'm using Ruby, but I'm paid for PHP. Ruby is by far the nicer language. The best approach to my untrained eye would be for some sort of global setting for all libraries to operate on, and the developer has to ensure that all data are read in that encoding. Hopefully it will make dealing with legacy data will be easier. The ideal situation would be for everything to be in one encoding, but that just doesn't happen.
on 2006-06-26 04:24

Hi, In message "Re: Unicode roadmap?" on Mon, 26 Jun 2006 10:22:15 +0900, "Phillip Hutchings" <sitharus@sitharus.com> writes: |> st = File.open("file.txt", "rb") { |f| f.read_string } |> bv = File.open("file.txt", "rb") { |f| f.read_bytes } | |Why? This looks needlessly complex. | |string = File.open('file.txt', 'r') {f.read.to_s(:utf-8)} | |Or possibly |string = File.open('file.txt', 'r') {f.read(:utf8)} |bytes = File.open('file.txt', 'r') {f.read(:bytearray)} They are equally more complex than the current design. If File can return String or ByteArray, why shouldn't String with "no encoding" behave as sequence of bytes instead of separating? Are there any specific operations that should be in ByteArray but not in String, or vise versa? matz.
on 2006-06-26 04:43

One clarification I'd like to add to this: I'm not saying that a ByteArray needs to be added, but if you're going to treat String as a ByteArray, then perhaps there should be another type for character vectors? Perhaps through some logic (perhaps the fact that this is the "way it is" in Ruby 1.8) String does == ByteArray. If I could play devil's advocate for a moment, maybe the new, fancy m17n String, however it's implemented, should be a different class? String == ByteArray in form and function CharString == a string of characters with some particular encoding, character logic, and so on Perhaps even CharString < String, so it retains byte-level read/write operations. There's another obvious advantage here...APIs that currently return a byte array String will continue to do so, as they work in Ruby 1.8. CharString could also be implemented today for Ruby 1.8, providing an encoding and character-aware String implementation for applications that need it. My only point about the dichotomy between <byte collection treated as a string> and <character collection treated as a string> is that at some level, they imply different behaviors, different APIs, different interfaces. Perhaps the answer is not to change existing Ruby code to use a m17n String while trying to retain byte array capabilities in the same time...but maybe it's worth considering that the new behavior warrants a separate type? String.to_cs(:utf8) => CharString String retains current interface and semantics CharString gains the [n] => character or single-char string rather than int, etc. I know you (matz) want to break as much as possible with the 2.0 release, but I still don't see the advantage of marrying the "byte array string" and "char string" types in the same class when separate types and behaviors would be more logical and break far less.
on 2006-06-26 04:43

On 6/25/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote: >| bytes = File.open('file.txt', 'r') {f.read(:bytearray)} > They are equally more complex than the current design. If File can > return String or ByteArray, why shouldn't String with "no encoding" > behave as sequence of bytes instead of separating? Are there any > specific operations that should be in ByteArray but not in String, or > vise versa? There are operations for Strings (#each_character, perhaps) that make less sense for ByteVectors than for character-based Strings. But everything or nearly everything you would want to do with a ByteVector you would want to do with a String, and some operations from Strings make sense on ByteVectors (regexp operations). I would much rather keep the API -- and the class library -- simple. I would rather do this: st = File.open("file.txt", "rb", :encoding => :utf8) { |f| f.read } or bv = File.open("file.txt", "rb") { |f| f.read } st = bv.to_encoding(:utf8) -austin
on 2006-06-26 05:02

Hi, In message "Re: Unicode roadmap?" on Mon, 26 Jun 2006 11:37:45 +0900, "Charles O Nutter" <headius@headius.com> writes: |I know you (matz) want to break as much as possible with the 2.0 release, |but I still don't see the advantage of marrying the "byte array string" and |"char string" types in the same class when separate types and behaviors |would be more logical and break far less. I still don't see how separate types and behaviors would be more logical and break far less. For example, if I want to check EXIF conformance of a jpeg file, I do def self.exif_file? (filename) exif_header = "\xff\xd8\xff\xe1" magic = File.open(filename) {|f| f.read(4) } magic == exif_header end I am not sure what you expect about separation, but I doubt separation would make above code to "be more logical and break far less". matz.
on 2006-06-26 06:43

On Jun 25, 2006, at 6:11 PM, Austin Ziegler wrote: > Under the people who want separate ByteVector and String class, I'll > need *two* APIs: > > st = File.open("file.txt", "rb") { |f| f.read_string } > bv = File.open("file.txt", "rb") { |f| f.read_bytes } Maybe I'm missing something, but in today's networked heterogeneous environment, that first call looks deeply dangerous to me. I don't see how you can expect to get a String out of a file in the general case. Files contain bytes, strings contain characters, and pretending you can get from one to the other without explicit encoding specification or inference is unsound. Pardon me if I'm missing something obvious. -Tim
on 2006-06-26 06:49

On Jun 25, 2006, at 7:21 PM, Yukihiro Matsumoto wrote: > Are there any > specific operations that should be in ByteArray but not in String, or > vise versa? Well, on strings, indexing and substring operations and iterators and regular expressions should (at least optionally) have character rather than byte semantics, right? Another example is encoding- normalization (combining diacritics, etc) which doesn't apply to byte arrays. -Tim
on 2006-06-26 06:53

On 26.6.2006, at 5:01, Yukihiro Matsumoto wrote: > I am not sure what you expect about separation, but I doubt separation > would make above code to "be more logical and break far less". Above code assumes all file operations return byte arrays. What is the code when we want to obtain String of characters? What if there is some $KCODE (or equivalent) setting somewhere in the program before these lines? What would be the effect of that? The problem is the auto-magic encoding handling which is required to have text processing be as simple as it is now. You can have either text processing (which adds encoding handling for us, combines bytes in characters etc.) or byte processing (which does not). How do we distinguish between the two modes of operation? The obvious way is by adding a ByteArray. But maybe there is better way... izidor
on 2006-06-26 07:26

Yukihiro Matsumoto wrote: > I am not sure what you expect about separation, but I doubt separation > would make above code to "be more logical and break far less". Just jumping into the discussion here, I have to agree with Matz. A char-vector is simply a higher-level representation of a byte-vector, not different enough to warrant two entirely separate classes. I think the real issue is not technical but rather a problem of perception and education. Ever since C-style strings, programmers have learned to view a string as an array of chars. So when we need to do char-string manipulation, we resort to pointer arithmetic when it fact the "correct" and ruby-native way of manipulating strings is with regular expressions. Instead of giving in to this old string-as-array mentality, maybe we should teach people to use regular expressions? Hmmm, probably impossible. A string can be interpreted as both a sequence of bytes or a sequence of characters, but the methods can be confusing. Obviously, upcase and downcase are operations at the character level, but what is [] supposed to do? From the ruby point of view, str[0..3] gives you the first 4 bytes and str.scan(/^..../) gives you the first 4 characters. But for the majority with the string-as-array mentality, [] is ambiguous; does it give you access to the bytes or to the characters of the string? In the interest of facilitating education, there needs to be a clear disambiguation; instead of str[0..3] it should be str.byte(0..3) and str.char(0..3) -- with maybe the latter one giving a warning along the lines of "use regular expressions!" ;-) That way the ambiguity between byte-vector and char-vector could be resolved. Daniel
on 2006-06-26 07:36

Hello Tim, TB> Well, on strings, indexing and substring operations and iterators and TB> regular expressions should (at least optionally) have character TB> rather than byte semantics, right? For UTF-8 which hopefully will rule the world soon, the worst libraries i have seen are trying to do this. But it is not the intention of the designers and with an implementation that works on characters you loose the genius encoding style of UTF-8. Of course some operation are more difficult, but this is left for good reasons to the application programmer. Only few cases of string manipulation need some special (non ASCII) character handling.
on 2006-06-26 07:51

On 6/25/06, Charles O Nutter <headius@headius.com> wrote: > One clarification I'd like to add to this: I'm not saying that a ByteArray > needs to be added, but if you're going to treat String as a ByteArray, then > perhaps there should be another type for character vectors? There's no meaningful distinction between the division of ByteArray/String and String/CharString. I do *not* believe that this is a viable option. The *sole* argument in favour is that we could add a CharString to Ruby 1.8 -- but I believe that this would be stampeding us in the wrong direction. Even if CharString < String, there will be problems -- people already note that there are issues with subclasses of the built-in classes. > My only point about the dichotomy between <byte collection treated as a > string> and <character collection treated as a string> is that at some > level, they imply different behaviors, different APIs, different interfaces. > Perhaps the answer is not to change existing Ruby code to use a m17n String > while trying to retain byte array capabilities in the same time...but maybe > it's worth considering that the new behavior warrants a separate type? This is where I disagree with you completely. If I have a String that contains ISO-8859-15 data, it *happens* that s#byte_count and s#length are the same value. It differs with UTF-8 data, but the interpretation of a Character is, at best, a *trait* of the data being stored. I have *really* given this a lot of thought, and I really do think that Matz is right about this and that the people who want Unicode-native strings are wrong. This sort of sucks for JRuby because of problems with Java. But I do not think that Sun made the right decision with Java. If nothing else, they ended up backing a dead "standard" during the initial phases, and have had to hack out since then. > I know you (matz) want to break as much as possible with the 2.0 release, > but I still don't see the advantage of marrying the "byte array string" and > "char string" types in the same class when separate types and behaviors > would be more logical and break far less. It *isn't* more logical. It doubles the number of required APIs for IO. It *completely* complicates things from that perspective, with little value for the people who have to implement character-oriented data routines. -austin
on 2006-06-26 07:51

On Jun 26, 2006, at 1:24 AM, Daniel DeLorme wrote: > I think the real issue is not technical but rather a problem of > perception and education. Ever since C-style strings, programmers > have learned to view a string as an array of chars. So when we need > to do char-string manipulation, we resort to pointer arithmetic > when it fact the "correct" and ruby-native way of manipulating > strings is with regular expressions. Instead of giving in to this > old string-as-array mentality, maybe we should teach people to use > regular expressions? Hmmm, probably impossible. Regular expressions are a very powerful tool, but they do not describe the entire set of operations one would reasonably want to perform on a string. Or perhaps they do but in a needlessly complex way. I want to get the first letter (character?) of a sentence, in pure regexp terms I'd do this: str.match(/\A./)[0] It's needlessly cryptic. Note that I'm not trying to make a commentary on whether or not character string/byte string should be separate, just trying to point out that "use regular expressions" shouldn't always be the answer.
on 2006-06-26 08:10

Hi, In message "Re: Unicode roadmap?" on Mon, 26 Jun 2006 13:51:33 +0900, Izidor Jerebic <ij.rubylist@gmail.com> writes: |> |> I am not sure what you expect about separation, but I doubt separation |> would make above code to "be more logical and break far less". | |Above code assumes all file operations return byte arrays. What is |the code when we want to obtain String of characters? line = File.open(filename, "r", "utf8") {|f| f.gets } |What if there is some $KCODE (or equivalent) setting somewhere in the |program before these lines? What would be the effect of that? I think IO#read shall always return "binary" string, since its specified length should always be in bytes. Anyway, when in doubt, you can explicitly specify "binary" encoding, |The problem is the auto-magic encoding handling which is required to |have text processing be as simple as it is now. You can have either |text processing (which adds encoding handling for us, combines bytes |in characters etc.) or byte processing (which does not). How do we |distinguish between the two modes of operation? By explicitly setting their encoding to "binary", e.g. text = obtain_string_data() text.encoding = "binary" ... |The obvious way is by adding a ByteArray. But maybe there is better |way... Show me the pseudo code using ByteArray, I will show you its counterpart using String with encoding tag. matz.
on 2006-06-26 08:16

Logan Capaldo wrote: > Regular expressions are a very powerful tool, but they do not describe > the entire set of operations one would reasonably want to perform on a > string. Or perhaps they do but in a needlessly complex way. I want to > get the first letter (character?) of a sentence, in pure regexp terms > I'd do this: str.match(/\A./)[0] It's needlessly cryptic. Note that I'm > not trying to make a commentary on whether or not character string/byte > string should be separate, just trying to point out that "use regular > expressions" shouldn't always be the answer. > irb(main):001:0> "It's needlessly cryptic."[/./] => "I" Not disagreeing, just trying to get more credit for regexes. irb(main):010:0> "It's needlessly cryptic."[/.{17}(.)/, 1] => "r" That's a bit more cryptic.
on 2006-06-26 08:16

On 6/26/06, Tim Bray <tbray@textuality.com> wrote: > pretending you can get from one to the other without explicit > encoding specification or inference is unsound. Um. You're not missing anything -- I'm mocking the API pair that would be required to make this work as certain advocates have suggested. > Pardon me if I'm missing something obvious. -Tim You're not. IO should be done on byte buffers. There's no meaningful and useful distinction between a byte buffer and a string at the most basic level. There's an additional interpretation that's possible at a higher level (giving character-oriented operations), but that in and of itself does not imply a need for a separation of the two concepts. (Indeed, I find myself infuriated in C++ when I have to do something that would work well with std::vector<unsigned char> and I'm actually working with std::string -- or vice versa.) -austin
on 2006-06-26 08:16

On 6/26/06, Tim Bray <tbray@textuality.com> wrote: > On Jun 25, 2006, at 7:21 PM, Yukihiro Matsumoto wrote: > > Are there any > > specific operations that should be in ByteArray but not in String, or > > vise versa? > Well, on strings, indexing and substring operations and iterators and > regular expressions should (at least optionally) have character > rather than byte semantics, right? Another example is encoding- > normalization (combining diacritics, etc) which doesn't apply to byte > arrays. -Tim Those are interpretations of the data underlying the String, though. Nothing says we can't use these sort of operations still, especially with Ruby's dynamic objects. But I *firmly* believe that it can be done in a way so as to not require the separation of a String from a Byte Array. -austin
on 2006-06-26 08:23

On 6/26/06, Izidor Jerebic <ij.rubylist@gmail.com> wrote: > > > > I am not sure what you expect about separation, but I doubt separation > > would make above code to "be more logical and break far less". > Above code assumes all file operations return byte arrays. What is > the code when we want to obtain String of characters? As Tim Bray pointed out in a response to me, trying to get a String from a file is a ludicrous operation. I was mocking the API required (e.g., File#read_string or something equally bozonic). You need to read your data and *then* mark it as a String with a particular encoding. And if you *globally* change the interpretation of File#read to be String, you will be breaking the ability to read truly binary data. > The problem is the auto-magic encoding handling which is required to > have text processing be as simple as it is now. You can have either > text processing (which adds encoding handling for us, combines bytes > in characters etc.) or byte processing (which does not). How do we > distinguish between the two modes of operation? > > The obvious way is by adding a ByteArray. But maybe there is better > way... Yes. It's to actually read what has been suggested. The m17n String won't be a magic bullet. But you'll be able to do something like: bv = File.open("file.txt", "rb") { |f| f.read } sv = bv.with_encoding(:utf8) Or something like that. And you can still do bv == "\xff\xd8\xff\xe1" as appropriate. -austin
on 2006-06-26 08:29

Logan Capaldo wrote: > > Regular expressions are a very powerful tool, but they do not describe > the entire set of operations one would reasonably want to perform on a > string. Or perhaps they do but in a needlessly complex way. I want to > get the first letter (character?) of a sentence, in pure regexp terms > I'd do this: str.match(/\A./)[0] It's needlessly cryptic. Note that I'm > not trying to make a commentary on whether or not character string/byte > string should be separate, just trying to point out that "use regular > expressions" shouldn't always be the answer. It's funny, maybe I'm just dumb but I can't think of a single *real-world* example where you'd want to access particular characters of a string. Why do you want the first char? In the context of a byte string there might be something special at position n (e.g. exif header), but in the context of a human-readable string what is there? For example, if you want that first char in order to check if it's a space or not, you should use str =~ /^ /, etc, etc. I honestly can't think of any real-world examples where regular expressions are less appropriate than pointer arithmetic. Can you illuminate me with some? Daniel
on 2006-06-26 09:00

On 26.6.2006, at 8:08, Yukihiro Matsumoto wrote: > > |What if there is some $KCODE (or equivalent) setting somewhere in the > |program before these lines? What would be the effect of that? > > I think IO#read shall always return "binary" string, since its > specified length should always be in bytes. Anyway, when in doubt, > you can explicitly specify "binary" encoding, Oh, I see. So basically IO always returns ByteArray, and one needs to convert it to String of characters explicitly (or implicitly by specifying a parameter to IO). No magic tagging with encoding. Well, this is nice and easy to understand. But how will this influence the simplicity of small programs in Ruby which deal with data in known (single) encoding? I was under impression that there would be some magic global setting which will enable such programs to use Strings in correct encoding. Thank you for clarifications. They are most welcome... izidor
on 2006-06-26 09:34

Hi, In message "Re: Unicode roadmap?" on Mon, 26 Jun 2006 15:58:30 +0900, Izidor Jerebic <ij.rubylist@gmail.com> writes: |But how will this influence the simplicity of small programs in Ruby |which deal with data in known (single) encoding? I was under |impression that there would be some magic global setting which will |enable such programs to use Strings in correct encoding. The detail is not fixed yet but it would honor locales for the default encoding. matz.
on 2006-06-26 09:38

On 6/26/06, Daniel DeLorme <dan-ml@dan42.com> wrote: > > It's funny, maybe I'm just dumb but I can't think of a single *real-world* > example where you'd want to access particular characters of a string. Why do you > want the first char? In the context of a byte string there might be something > special at position n (e.g. exif header), but in the context of a human-readable > string what is there? For example, if you want that first char in order to check > if it's a space or not, you should use str =~ /^ /, etc, etc. I honestly can't > think of any real-world examples where regular expressions are less appropriate > than pointer arithmetic. Can you illuminate me with some? Substrings? Finding occurence of a string in a nother string? Why shouldn't str[0..3] work on characters (for a string with encoding set)? Maybe I want to do something like str[0] = Unicode::upcase(str[0])? :) Isn't that what Humane Interface Design (http://www.martinfowler.com/bliki/HumaneInterface.html) is all about ;-) Regular expressions _are_ cryptic. They are powerful, but do I need a sledgehammer when I need a paperclip?
on 2006-06-26 09:45
Yukihiro Matsumoto wrote: > You said Tcl has Unicode support that works well with you. So that I > think treating all of them in UTF-8 is OK for you. It's actually not about treating everything in UTF-8, it just unifies everything in Tcl in a way that you can have all variety of characters in strings. > Then how can it > determine which should be in the current code page, or in Unicode? > Or using Win32 API ending with W could allow you living in the > Unicode? Well, currently (just downloaded latest cvs sources) ruby uses ansi versions of CreateFile and FindFirstFile/FindNextFile APIs, so even if I say, for example, KCODE to UTF-8 (not sure how you can currently make ruby work with UTF-8) ansi versions of APIs are still called, and that means that 1) if there are filenames with characters that don't fall in range of current codepage, I will receive '?' in place of them when I enumerate directory contents. 2) I receive filenames in current code page, and not in UTF-8 3) There is no way for me to open a file with these characters using standard ruby classes The same with win32ole extension, I can see a lot of ole_wc2mb/ole_mb2wc there, which breaks things horribly when interoperating with, for example, Excel and trying to work with russian/greek/japanese and all other languages all on the same sheet (after I process the sheet, modifying all of the cells, it will just strip all languages except russian from it). In *nixes you can just change your locale to *.UTF-8 and you're ok with that, because everything you receive when enumerating directory is UTF-8, and File.open will expect UTF-8. Unfortunately, for Windows that is not possible: MS already provides 'wide' versions of APIs for those who need them, and there is no UTF-8 ANSI codepage you can set as default (because UTF-8 codepage in Windows is somewhat 'virtual', for conversion purposes only). In Tcl you have all of your strings in UTF-8, and when Tcl interoperates with the rest of the world, it converts strings appropriately (for example, on Win9x there are mostly no 'wide' APIs, so it converts strings to current code page and uses ansi APIs, but on WinNT it converts it to unicode and uses 'wide' APIs). What I was thinking is maybe a way for setting "current codepage" for ruby on win32 (including possibility to set it to UTF-8), and so that when ruby works with the world it would use 'wide' APIs when possible, converting to and from this codepage (so that instead the way it is Tcl when it is hard-coded to be UTF-8, there would be a possibility to choose), because there are no other way to do that on Windows by user (user can't set current codepage to UTF-8).
on 2006-06-26 09:50
Snaury Miyoto wrote: > Yukihiro Matsumoto wrote: >> Then how can it >> determine which should be in the current code page, or in Unicode? >> Or using Win32 API ending with W could allow you living in the >> Unicode? > Well, currently (just downloaded latest cvs sources) ruby uses ansi > versions of CreateFile and FindFirstFile/FindNextFile APIs, so even if I > say, for example, KCODE to UTF-8 (not sure how you can currently make > ruby work with UTF-8) ansi versions of APIs are still called, and that > means that > The same with win32ole extension, I can see a lot of ole_wc2mb/ole_mb2wc > there, which breaks things horribly when interoperating with, for > example, Excel and trying to work with russian/greek/japanese and all > other languages all on the same sheet (after I process the sheet, > modifying all of the cells, it will just strip all languages except > russian from it). Ah, well, for ole that's not true, only now I realized I can set codepage there to UTF-8, but still similar thing for win32 file io (and maybe for other things where win32 API or win32 cruntime used) would be great.
on 2006-06-26 10:08

Dmitrii Dimandt wrote: > Substrings? Finding occurence of a string in a nother string? Those operations are precisely what regexes are best at. > shouldn't str[0..3] work on characters (for a string with encoding > set)? Maybe I want to do something like > str[0] = Unicode::upcase(str[0])? :) What about str.sub!(/^./){ |c| Unicode::upcase(c) } That hardly seems more cryptic to me. It's not that I don't understand the attraction; it's just that I think when handling char-strings it's best to change your mental model to something further away from char/byte arrays. BTW, if str[0..3] returns the first 4 characters, then how do I get the first 4 bytes? Daniel
on 2006-06-26 14:21

On 6/26/06, Daniel DeLorme <dan-ml@dan42.com> wrote: > > It's funny, maybe I'm just dumb but I can't think of a single *real-world* > example where you'd want to access particular characters of a string. Why do you > want the first char? In the context of a byte string there might be something > special at position n (e.g. exif header), but in the context of a human-readable > string what is there? For example, if you want that first char in order to check > if it's a space or not, you should use str =~ /^ /, etc, etc. I honestly can't > think of any real-world examples where regular expressions are less appropriate > than pointer arithmetic. Can you illuminate me with some? Have you looked at the "short but unique" ruby quiz? Also when you are building some search trees or such you want access to letters one by one. Thanks Michal
on 2006-06-26 14:51

On 26-jun-2006, at 3:01, Austin Ziegler wrote: > I suggest you look through the Unicode threads again. You'll find > your statement is untrue. There are a lot of people who (foolishly) > want > Unicode to be the only internal representation of Strings in Ruby. Let's say there are people who not-so-foolishly believe that trying to ahve strings in all possible encodings is not technically possible and the aforementioned people don't understand how a system cen reliably handle them. Especially the aformentioned people remember that Strings in Ruby are mutable and can transition from being Unicode to being "something- else" in one method call.
on 2006-06-26 14:55

On 26-jun-2006, at 3:11, Austin Ziegler wrote: > > Stupid, stupid, stupid, stupid. If I have guessed wrong about the > contents of file.txt, I have to rewind and read it again. Better to > *always* read as bytes and then say, "this is actually UTF-8". This > would be as stupid in C++, Java, or C#: Not so fast, let's say you read from a file: > st = File.open("file.txt", "rb") { |f| f.read(4056) } and you recieve a PART of a unicode string (because you cannot know where to stop reading before yoy look into the structure). The only way to make what you read valid now is to slide along the byte length and try to catch the bytes that you skipped. Should I continue?
on 2006-06-26 14:58

On 26-jun-2006, at 8:27, Daniel DeLorme wrote: > It's funny, maybe I'm just dumb but I can't think of a single *real- > world* example where you'd want to access particular characters of > a string. Well, think again. You have a truncate(text) helper in Rails which truncates the text to X characters and "dot dot dot". The easiest example. Or you have excerpts... etc.
on 2006-06-26 15:07

On 26-jun-2006, at 10:07, Daniel DeLorme wrote: > str.sub!(/^./){ |c| Unicode::upcase(c) } > That hardly seems more cryptic to me. It does seem unnatural and hints that you are working with an encoding-incapable language, because people who are lucky to be in ASCII will be able to do str[0] = str[0].upcase but people who are not will have to invent silly workarounds. > > It's not that I don't understand the attraction; it's just that I > think when handling char-strings it's best to change your mental > model to something further away from char/byte arrays. > > BTW, if str[0..3] returns the first 4 characters, then how do I get > the first 4 bytes? str.bytes[0..3] seems OK to me. That is: for Strings the character- based routines are the base ones, and byte routines are secondary. Not the "chars" accessor I had to bolt on right now. The problem is that you have to PROTECT an ignorant programmer from things like normalization and character unity and NEVER allow him to cut into a character of a multibyte string UNLESS he especially mentions that he wants it that way.
on 2006-06-26 15:29

On 6/26/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote: > > need *two* APIs: > > > st = File.open("file.txt", "rb") { |f| f.read(4056) } > > and you recieve a PART of a unicode string (because you cannot know > where to stop reading before yoy look into the structure). > The only way to make what you read valid now is to slide along the > byte length and try to catch the bytes that you skipped. > Should I continue? Why would you read 4096 bytes in the first place? If you knew the file is in some weird multibyte encoding you should have set it for the stream, and read something meaningful. If it is "ascii compatible" (ISO-8859-*, cp*, utf-8, .. ) you can just use gets. Otherwise there is no meaningful string content. Note that 4096 bytes is always OK for UTF-32 (or similar plain wide character encodings), and may at worst get you half of a surrogate character for UTF-16. And strings will have to handle incomplete characters anyway - they may result from some delays/buffering in network IO or such. Thanks Michal
on 2006-06-26 15:42

On 26-jun-2006, at 15:27, Michal Suchanek wrote: > > Why would you read 4096 bytes in the first place? This is a pattern. If a file has no line endings, but just one (very logn) stream of characters - can you really use gets? > > If you knew the file is in some weird multibyte encoding you should > have set it for the stream, and read something meaningful. Or there should be a facility that preserves you from reading incomplete strings. But is it implied that if I set IO.encoding = foo the IO objects will prevent me? Will they go out to the provider of the io and get the missing remaining bytes? In the case of Unicode the absolute, rigorous minimum is to NEVER EVER slice into a codepoint, and it can go anywhere you want in terms of complexity (because slicing between codepoints is also not the way). > > If it is "ascii compatible" (ISO-8859-*, cp*, utf-8, .. ) you can > just use gets. > > Otherwise there is no meaningful string content. > > Note that 4096 bytes is always OK for UTF-32 (or similar plain wide > character encodings), Of which UTF-32 is the only one that is relevant for Unicode, and if you investigated the subject a little you would know that slicing Unicode strings at codepoint boundaries is often NOT enough. That way you can cut a part of a compound character, a modifier codepoint or an RTL override remarkably easily, which will just give you a different character altogether (or alter your string diplay in a particularly nasty way - that is, _reverse_ your string display for the remaining output of you program if you remove an RTL override terminator). > and may at worst get you half of a surrogate > character for UTF-16. And strings will have to handle incomplete > characters anyway - they may result from some delays/buffering in > network IO or such. This is exactly why the notion of having strings both as byte buffers and character vectors seems a little difficult. 90 percent of my use cases for Ruby need characters, not bytes - and I would love to hint it specifically shall that be needed. The problem right now is that Ruby does not distinguish these at the moment.
on 2006-06-26 16:36

On 6/26/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote: > > On 26-jun-2006, at 15:27, Michal Suchanek wrote: > > > > Why would you read 4096 bytes in the first place? > This is a pattern. If a file has no line endings, but just one (very > logn) stream of characters - can you really use gets? But can you work with the file in parts then? If there is no meaningful internal structure you have to work with the file in its entirety (or do a block copy but you should not be concerned with characters then). If there is a structure you may use alternate line endings. > of complexity (because > slicing between codepoints is also not the way). At most you can expect it to hold incomplete codepoints until they are read fully I guess. However, incomplete codepoints are going to exist anyway so the strings must deal with them in one way or another. > you would know that slicing Unicode strings at codepoint boundaries > is often NOT enough. That way you can cut a part of > a compound character, a modifier codepoint or an RTL override > remarkably easily, which will just give you a different character > altogether (or alter your string > diplay in a particularly nasty way - that is, _reverse_ your string > display for the remaining output of you program if you remove an RTL > override terminator). If the file has some meaningful structure (like line endings or XML) you should get the complete parts. If it does not you have to deal with it. And nobody can do it for you except the one who chose the format in which th file was saved. > problem right now is that Ruby does not distinguish these at the moment. But the problem is you cannot distinguish them, not that you do not have separate classes for them. Michal
on 2006-06-26 16:52

On 6/26/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote: > On 26-jun-2006, at 3:11, Austin Ziegler wrote: > > st = File.open("file.txt", "rb") { |f| f.read(4056) } > and you recieve a PART of a unicode string (because you cannot know > where to stop reading before yoy look into the structure). > The only way to make what you read valid now is to slide along the > byte length and try to catch the bytes that you skipped. > Should I continue? Sure. It won't make you any more correct. Let's play with your example: st = File.open("file.txt", "rb", :encoding => :utf8) { |f| f.read(4096) } Okay. Am I reading 4096 bytes or 4096 characters? The *correct* and *least surprising* behaviour is to read the specified number of bytes. Instead it would be better to expose the minimum amount required to work with this: bv = File.open("file.txt", "rb") { |f| f.read(4096) } bv.encoding = :utf8 bv.encoding_valid? # will return false if the whole string isn't a valid UTF-8 sequence. You're really looking for something that is, in the end, completely unworkable and unnecessarily complex in doing so. The m17n String -- with byte vector characteristics retained -- maintains a clear, simple API with few exceptions that would have to be memorised or understood. Adding another class *doubles* the size of the class hierarchy that has to be understood, and if there are *any* variances between them the number of exceptions effectively doubles. If there *aren't* any variances between the class APIs, then what's the point of separating them in the first place? A string is an ordered sequence of characters. A byte vector is an ordered sequence of bytes. If your string is suitably flexible, then it can say that a byte vector is a string where each character is one byte long and that collation (etc.) are determined by the byte value. We're not talking rocket science here. Stop trying to make it such. -austin
on 2006-06-26 17:05

On 6/26/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote: > On 26-jun-2006, at 15:27, Michal Suchanek wrote: >> Why would you read 4096 bytes in the first place? > This is a pattern. If a file has no line endings, but just one (very > logn) stream of characters - can you really use gets? >> If you knew the file is in some weird multibyte encoding you should >> have set it for the stream, and read something meaningful. > Or there should be a facility that preserves you from reading > incomplete strings. But is it implied that if I set IO.encoding = foo > the IO objects will prevent me? Will they go out to the provider of > the io and get the missing remaining bytes? In the case of Unicode the > absolute, rigorous minimum is to NEVER EVER slice into a codepoint, > and it can go anywhere you want in terms of complexity (because > slicing between codepoints is also not the way). Anyone who wants to set all IO operations to a particular encoding is making a huge mistake. Individual IO operations or handles could be set to a particular encoding, but you would have a high probability of breaking code external to you that did any IO operations if you forced all IO to use your encodings. > you can cut a part of a compound character, a modifier codepoint or an > RTL override remarkably easily, which will just give you a different > character altogether (or alter your string diplay in a particularly > nasty way - that is, _reverse_ your string display for the remaining > output of you program if you remove an RTL override terminator). Oh, I understand that very well. At least as well as you do. However, that is independent of whether IO works on encoded or unencoded values. It's easy enough to check the validity of your encoding, too. If you're not checking external input for taintedness, then you're doing silly things, too. One *cannot* hide too much of the complexity from Unicode, because to do so will increase the chance that programmers not as smart as you are will, well, screw the pooch royally. >> and may at worst get you half of a surrogate character for UTF-16. >> And strings will have to handle incomplete characters anyway - they >> may result from some delays/buffering in network IO or such. > This is exactly why the notion of having strings both as byte buffers > and character vectors seems a little difficult. 90 percent of my use > cases for Ruby need characters, not bytes - and I would love to hint > it specifically shall that be needed. The problem right now is that > Ruby does not distinguish these at the moment. Yes, and that's where your opposition to maintaining this is persistently misguided. Ruby *will* distinguish between a String without an encoding and a String with an encoding. You're basing your opposition to tomorrow's behaviour based on today's (known bad) behaviour. Please, stop doing that. And while most of your use cases deal with characters, code that I've written deals with both bytes and characters in equal measures. -austin
on 2006-06-26 17:08

On 6/19/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote: > > |- at present time Ruby parser can parse only sources in ASCII compatible > |encoding. Would it change? > > No. Ruby would not allow scripts in EBCDIC, nor UTF-16, although it > allows processing of those encoding. > > And what about minilanguages, incorporated in Ruby: regexp patterns, sprintf, strftime patterns etc.? Regexps syntax uses several metachars ( []{}()+-*?.\: ) and latin letters - lower and upper. But there are charsets/encodings which don't have some of them, e.g.: GB_2312-80 has none of them, JIS_X0201 doesn't have backslash, ebcdic-cp-ar1 doesn't have backslash, square and curly brackets. So, regexp patterns can't be constructed for these charsets/encodings.
on 2006-06-26 17:54
I've been following this debate with some interest. Alas, since my unicode/m17n experience is quite limited, I don't have a strong opinion in the matter. But the following caught my eye: Austin Ziegler wrote: > [...] Ruby *will* distinguish between a String without > an encoding and a String with an encoding. You're basing your opposition > to tomorrow's behaviour based on today's (known bad) behaviour. Part of the problem is that we are basing our discussions on descriptions of what will happen in the future, but that makes it difficult to understand the issues involved without real code. What I would like to see is prototype implementations of both approaches, and see the differences in how they effect the code. I'm more interested in anwering questions like "How do I safely concatenate strings with potentially different encodings" and "How do I do I/O with encoded strings" rather than addressing efficiency questions. In other words, how do the different approaches effect the way I write code. I think it would be a great idea to prototype these ideas in real code to understand the advantages and disadvantages of each. -- Jim Weirich
on 2006-06-26 18:00

On Monday 26 June 2006 11:54 am, Jim Weirich wrote: > I think it would be a great idea to prototype these ideas in real > code to understand the advantages and disadvantages of each. +1^2
on 2006-06-26 18:49

On Jun 26, 2006, at 2:13 AM, Joel VanderWerf wrote:
>
It's funny I'm always forgetting you can index by regexp. But this
brings up a good point, this is Ruby, with the new Hash / named
argument syntax we can do:
"It's needlessly cryptic."[byte:2]
This doesn't add anything at all to the conversation, but I think it
looks good, and it's in the "make similiar things look similar" vein.
Indexing Strings
s[0] # The first character
s[/./] # The first character
s[byte:0] # The first byte (of a string with some non ascii
compatible encoding)
on 2006-06-26 19:01

On 6/26/06, Jim Weirich <jim@weirichhouse.org> wrote: > descriptions of what will happen in the future, but that makes it > I think it would be a great idea to prototype these ideas in real code > to understand the advantages and disadvantages of each. I mostly agree with you here (about prototyping), Jim. There are a few things that I think can be done without working code. I often start from this point in my own programs, anyway. I'll try to address each of your questions as I understand them. Hopefully, Matz or other participants will step in and correct me where I'm wrong. Before I get started, there are two orthogonal divisions here. The first division is about the internal representation of a String. There is a camp that very strongly believes that some Unicode encoding is the only right way to internally represent String data. Sort of like Java's String without the mistake of char being UCS-2. The other camp strongly believes that forcing a single universal encoding is a mistake for a variety of reasons and would rather have an unencoded internal representation with an interpretive encoding tag available. These two camps can be referred to as UnicodeString and m17nString. I think that I can be safely classified as in the m17nString camp -- but there are caveats to that which I will address in a moment. The second division is about the suitability of a String as a ByteVector. Some folks believe that the twain should never meet, others believe that there's little to meaningfully distinguish them in practice and that the resulting API would be unnecessarily complex. I can safely be classified in the latter camp. There is an open question about the resulting String class about how well it will work with various arcane features of Unicode such as combining characters, RTL/LTR marks, etc. and these are good questions. Ultimately, I believe that the answer is that it should support them as transparently as possible without (a) hiding *too* much and (b) compromising support for multiple encodings. Your first question: How do I safely concatenate strings with potentially different encodings? This deals with the first division. Under the UnicodeString camp, you would *always* be able to safely concatenate strings because they never have a separate encoding. All incoming data would have to be classified as binary or character data and the character data would have to be converted from its incoming code page to the internal representation. Under the m17nString camp, Matz has promised that compatible encodings would work transparently. I have gone a little further and suggested that we have a conversion mechanism similar to #coerce for Number values. I could then combine text from Win1252 and SJIS to get a Unicode result. Or, if I knew that my target could *only* handle SJIS, I would force that to result in an error. Your second question: How do I do I/O with encoded strings? This also sort of deals with the first, but it also deals with the second. Note, by the way, that the UnicodeString camp would *require* a completely separate ByteArray class because you could not then read a JPEG into a String -- its values would be converted to Unicode representations, rendering it unusable as a JPEG. The two class (String/ByteArray) camp would probably require that you either (1) change all IO operations using a pragma-style setting to encoded strings, (2) change individual IO operations, (3) use a separate API, or (4) read a ByteArray and *convert* it to a UnicodeString. Either way, they seem to want an API where they can say "read this IO and give me a UnicodeString as output" and conversely "read this IO and give me a ByteArray as output." (Note: this could apply whether we have a UnicodeString or an m17nString -- but the requests have come most often from UnicodeString supporters.) The one class camp keeps file IO as it is. You can "encourage" a particular encoding with a variant of #2: d1 = File.open("file.txt", "rb", encoding: :utf8) { |f| f.read } d2 = File.open("file.txt", "rb") { |f| f.encoding = :utf8 f.read } However, whether you use an encoding or not, you still get a String back. Consider: s1 = File.open("file.txt", "rb") { |f| f.read } s2 = File.open("file.txt", "rb", encoding: :utf8) { |f| f.read } s1.class == s2.class # true s1.encoding == s2.encoding # false But that doesn't mean I have to keep treating s1 as a raw data byte array -- or even convert it. s1.encoding = :utf8 s1.encoding == s2.encoding # true I think that the fundamental difference here is whether you view encoded strings as fundamentally different objects, or whether you view the encodings as *lenses* on how to interpret the object data. I prefer the latter view. -austin
on 2006-06-26 19:04

On 6/26/06, Logan Capaldo <logancapaldo@gmail.com> wrote: > > compatible encoding) I kinda like that. -austin
on 2006-06-26 19:45

"Austin Ziegler" <halostatue@gmail.com> writes: > I would much rather keep the API -- and the class library -- simple. I > would rather do this: > > st = File.open("file.txt", "rb", :encoding => :utf8) { |f| f.read } > > or > > bv = File.open("file.txt", "rb") { |f| f.read } > st = bv.to_encoding(:utf8) Partly off-topic, but important nevertheless: *Then* it's the right time to drop that damn "rb" by making it default and let the people stuck in the \r\n-age use :encoding => "win-ansi" or "dos" or whatever.
on 2006-06-26 19:49

On 6/26/06, Christian Neukirchen <chneukirchen@gmail.com> wrote: > > Partly off-topic, but important nevertheless: *Then* it's the right > time to drop that damn "rb" by making it default and let the people > stuck in the \r\n-age use :encoding => "win-ansi" or "dos" or whatever. Oh, please, yes. I get tired of libraries breaking because people don't use "rb" and I'm on Windows. -austin
on 2006-06-26 20:39

On 6/26/06, Austin Ziegler <halostatue@gmail.com> wrote: > On 6/26/06, Jim Weirich <jim@weirichhouse.org> wrote: > caveats to that which I will address in a moment. Note that a fixed encoding UnicodeString has several caveats: - you have only one encoding, and while it may be optimal in some respects it may be suboptimal in other. This leads to split among UnicodeString supporters - about which encoding to choose. m17n solves this neatly by allowing you to choose the encoding for every application at least. -utf-8 - most likely encountered on io (especially network) = less conversions. Space efficient for languages using Latin script -utf-16 - sometimes encountered on io (file names on certain systems). Space efficient for most(?) other languages -utf-32 - fast indexing/slicing. Generally easier manipulation (but only inside the string class) -you cannot use a non-unicode encoding, or even have both unicode and non-unicode (with characters outside of unicode) strings without chnaging the interpreter incompatibly Another subdivision exists among m17n camp about what strings are compatible. The behavior in some other languages (which some find unfortunate) is that strings with different encodings are incompatible (ie operations on two strings always have to take strings with the same encoding). In Matz's current proposal the only improvement over this is allowing to add 7-bit ascii string to strings where this makes sense (ie. to ISO-8859-[12], cp85[02], utf-8). The other position is to make strings to coerce themselves automatically if lossless conversion exists (ie cp1251, cp852, and iso-8859-2 should be the same set of characters ordered differently iirc, and most character sets can be safely converted to utf-8). I could count myself into the autoconversion camp. Yet another subdivision is about the exact meaning of string.encoding = :utf8. It can either just change the tag or check that string is indeed a valid utf-8 character seequence. Matz thinks that without checking autoconversion would be too unreliable. I think that checking would be good for debugging or when one wants to be paranoid. But the ability to turn it off when I think (or find out) that my application spends lots of time checking needlessly could be handy. > Ultimately, I believe that the answer is that it should support them as > have a separate encoding. All incoming data would have to be classified > as binary or character data and the character data would have to be > converted from its incoming code page to the internal representation. > > Under the m17nString camp, Matz has promised that compatible encodings > would work transparently. I have gone a little further and suggested > that we have a conversion mechanism similar to #coerce for Number > values. I could then combine text from Win1252 and SJIS to get a > Unicode result. Or, if I knew that my target could *only* handle SJIS, I > would force that to result in an error. The answer also depends on what strings are compatible. If most strings are incompatible, you would convert all strings and other data structures you get from IO or external libraries to your chosen encoding, and you will only concatenate strings with the same encoding. With autoconversion it will just work most of the time (ie when you work with string that can be converted to unicode). Writing to streams that do not support all unicode characters is going to be a problem most of the time (when you do not work in the output encoding). Unless write attempts the conversion first, and only fails when there are non-convertible characters. > > Your second question: > > How do I do I/O with encoded strings? > ... > However, whether you use an encoding or not, you still get a String > > s1.encoding = :utf8 > s1.encoding == s2.encoding # true > > I think that the fundamental difference here is whether you view encoded > strings as fundamentally different objects, or whether you view the > encodings as *lenses* on how to interpret the object data. I prefer the > latter view. If you consider s3 = File.open('legacy.txt','rb',:iso885915) { |f| f.read } without autoconversion you would have to immediately do s3.recode :utf8 otherwise s1 + s3 would not work. The same for stuff you get from database queries (unless you are sure you always get the right encoding), text you get from the web, emails, third party libraries, etc. Thanks Michal
on 2006-06-26 21:28

On 26.6.2006, at 20:37, Michal Suchanek wrote: >> array -- or even convert it. > > If you consider s3 = File.open('legacy.txt','rb',:iso885915) { |f| > f.read } > without autoconversion you would have to immediately do > s3.recode :utf8 > otherwise s1 + s3 would not work. Yes. This shows that if there is no autoconversion, programmer will always need to recode to a common app encoding if the aplication is to work without problems. And if we always need to recode strings which we receive from third-part classes/libraries, encoding handling will either consume half of the program lines or people won't do it and programs will be full of errors. As can be seen from experience of other languages (and Ruby), the second option will prevail and we will be in a mess not much better than today. Therefore m17n without autconversion (as is current Matz's proposal) gains us almost nothing. If we have no autoconversion, my vote goes to Unicode internal encoding (because it implicitly handles autoconversion problems). On the topic of ByteArray: my concern is that the distinction between bytes and characters will not be clear and therefore we need to introduce ByteArray to separate bytes from characters, to ensure reliability and predictability of code like result = File.open ( "file" ) { |f| f.read 1000 } (now tell me what 'result' is?}. If there will be clear and simple rules, such as "IO always returns binary strings if not given encoding parameter" then this distinction will not need to be additionally enforced by separating classes. One String class will do. On the other hand, if there will be all kinds of automatic encoding tagging for convenience of simple-script-writers, then we need ByteArray to prevent error-prone code with undefined results. izidor
on 2006-06-26 21:47

On 6/26/06, Izidor Jerebic <ij.rubylist@gmail.com> wrote: > of other languages (and Ruby), the second option will prevail and we > will be in a mess not much better than today. I doubt this is in the least bit true. The real problem is that you're trying to suggest a doomsday scenario based on what currently exists and emotion. I'm saying that your cure is far worse than disease. > Therefore m17n without autconversion (as is current Matz's proposal) > gains us almost nothing. If we have no autoconversion, my vote goes to > Unicode internal encoding (because it implicitly handles > autoconversion problems). So does the coersion proposal that I've made without locking ourselves into Unicode. If I have a thousand files that are Mojikyo-encoded, it becomes very inefficient for me to work with it in Unicode and far easier to work with Mojikyo directly. I couldn't make sense of your last paragraph. -austin
on 2006-06-26 22:21

On 26.6.2006, at 21:46, Austin Ziegler wrote: >> and programs will be full of errors. As can be seen from experience >> of other languages (and Ruby), the second option will prevail and we >> will be in a mess not much better than today. > > I doubt this is in the least bit true. > I'm saying that your cure is far worse than disease. Basically, I am just advocating to get autoconversion into "official" proposal. I am not proposing unicode. But if there is no autoconversion, unicode is better. This claim is supposed to get support for autoconversion :-) BTW, you may have no problems at all. We, on the other hand, have lots of problems (in Ruby and other languages) which can be traced to exactly this hope of "all programmers will be doing lots of manual work to make things safe for others". You are deluded. In environments which already have this cure (internal unicode), there are no such enormous problems as we experience in those without this cure. So sucessess and failures I describe are based on real experience. Unlike your claims, which are just opinions. I am not saying that unicode encoding is the ideal solution. But it turned out to be quite good one, and for sure much better than manual checking/changing of encoding. > >> Therefore m17n without autconversion (as is current Matz's proposal) >> gains us almost nothing. If we have no autoconversion, my vote >> goes to >> Unicode internal encoding (because it implicitly handles >> autoconversion problems). > > So does the coersion proposal that I've made without locking ourselves > into Unicode. But that is your proposal (and mine and several others'), not Matz's. Current "official" proposal will make a mess. > > I couldn't make sense of your last paragraph. Well, tell me what exactly do I get when this code executes: result = File.open( "file ) { |f| f.read( 1000 ) } What is 'result' ? Binary string under all circumstances? Or maybe sometimes I get a String and sometimes I get a binary String? Which one under what circumstances? This is called error-prone code with undefined results. We have two equally good options: 1. If we change API and IO returns ByteArray, we have no confusion. 2. If we have clear and simple rules about IO returning Strings, we also have no confusion. Therefore, if there will be complex auto-magic String tagging with encoding, I prefer introducing ByteArray, because it will prevent errors. izidor
on 2006-06-26 22:49

On 6/26/06, Izidor Jerebic <ij.rubylist@gmail.com> wrote: > On 26.6.2006, at 21:46, Austin Ziegler wrote: > > I doubt this is in the least bit true. > > I'm saying that your cure is far worse than disease. > BTW, you may have no problems at all. We, on the other hand, have > lots of problems (in Ruby and other languages) which can be traced to > exactly this hope of "all programmers will be doing lots of manual > work to make things safe for others". You are deluded. Um. Not what I'm saying. I want as much clean autoconversion as possible without being forced into it. But much *more* than that, I want an API that works reasonably well with all sorts of encodings. I want String#[] to work equally well with Mojikyo, ASCII, ISO-8859-12, and UTF-8. > In environments which already have this cure (internal unicode), > there are no such enormous problems as we experience in those without > this cure. So sucessess and failures I describe are based on real > experience. Unlike your claims, which are just opinions. No, they're not just opinions. They're experiences that I've had with real situations as well where we have a hard time dealing with autoconversion. Stupid automatic behaviour is worse than manual behaviour *every time*. > > I couldn't make sense of your last paragraph. > Well, tell me what exactly do I get when this code executes: > > result = File.open( "file ) { |f| f.read( 1000 ) } Aside from a syntax error from your missing quote? ;) This would probably be an unencoded String. If you want an encoded String, you would specify it on the File object either during construction or afterwards. The need for ByteArray is nonexistent. -austin
on 2006-06-26 22:56

On 6/26/06, Austin Ziegler <halostatue@gmail.com> wrote: > > So does the coersion proposal that I've made without locking ourselves > into Unicode. If I have a thousand files that are Mojikyo-encoded, it > becomes very inefficient for me to work with it in Unicode and far > easier to work with Mojikyo directly. > Perhaps this debate should be weighing those encodings that could not reasonably (or perhaps, easily) be represented in a pure-unicode String versus those that could. Would it be reasonable to say that if 90% of Ruby users would never have a pressing need for a non-unicode-encodable String, then an uber-String that's entirely encoding-agnostic would be better written as an extension for those special cases? Do we really need to encumber all of Ruby for the needs of a relative few?
on 2006-06-26 23:03
Austin Ziegler wrote:
> Um. Not what I'm saying. I want as much clean autoconversion as [...]
Clarification question: When you say autoconversion, do you mean:
(A) Automatically convert input strings to a given encoding (independent
of the question of a single vs multiple encodings).
(B) When combining strings, autoconvert incompatible encodings into
compatible encodings before combining.
I was thinking you meant (B), but I get the impression that Austin is
replying to (A) (since Austin's coerce suggestion sounds a lot like
(B)).
Thanks.
-- Jim Weirich
on 2006-06-26 23:16

On 26.6.2006, at 22:46, Austin Ziegler wrote: > This would probably be an unencoded String. If you want an encoded > String, you would specify it on the File object either during > construction or afterwards. This seems too good to be true :-) How will e.g. Japanese (or we non-English Europeans), which now use default $KCODE, write their Ruby scripts? Will we need to specify encoding in every script for every IO? This can get cumbersome very fast. Not really Ruby style. But if there will be some default encoding, it will interfere with said rules about return values. And that may cause errors when I run script meant for some other default encoding. This problem makes me think that rules won't be so simple as described now (actually, Matz said that this detail is not fixed yet). We'll see. I have just voiced my concerns about separation between bytes and characters. Must wait for the master to present solution (and hope he considers these problems)... izidor
on 2006-06-26 23:20
Thanks for the response, Austin. It seemed to help clearify the issues (at least for me). Austin Ziegler wrote: > d1 = File.open("file.txt", "rb", encoding: :utf8) { |f| f.read } Question: Does the encoding parameter specify the encoding of the file, or the encoding of the strings you get back (my guess is both). Related question: In environments that use a lot of different encodings, are there ways or conventions for specifying the encoding, or do you just have to "know". > s1.encoding = :utf8 Another Question: When you set the encoding, are you: (A) Just changing the encoding specifier without changing the underlaying string. (B) Re-encoding the string according to the new encoding specifier. (B) seems to be implied by the attribute notation, but that seems a bit dangerous in my mind. Thanks. -- Jim Weirich
on 2006-06-26 23:22

On 26.6.2006, at 23:04, Jim Weirich wrote: > Clarification question: When you say autoconversion, do you mean: > > (A) Automatically convert input strings to a given encoding > (independent > of the question of a single vs multiple encodings). > > (B) When combining strings, autoconvert incompatible encodings into > compatible encodings before combining. Autoconversion (as suggested by many people in this thread) is meant to convert string in *compatible but different* encoding to the encoding of other string (or common compatible superset encoding), to facilitate the operation using those two strings. Point A is the great can of worms and source of errors, which I suggested can be avoided by either: 1. Very simple and strict rules on String encoding of return values 2. Introduction of ByteArray as return values izidor
on 2006-06-26 23:25

On 26.6.2006, at 22:55, Charles O Nutter wrote: > reasonably (or perhaps, easily) be represented in a pure-unicode > String > versus those that could. Would it be reasonable to say that if 90% > of Ruby > users would never have a pressing need for a non-unicode-encodable > String, > then an uber-String that's entirely encoding-agnostic would be better > written as an extension for those special cases? Ahem, no. 100% of Ruby lanuage creators say that they need something better than Unicode :-) And if we get both unicode and other stuff, there is no point in discussing it, no? Provided we get autoconversion, of course. izidor
on 2006-06-26 23:44

On 6/26/06, Izidor Jerebic <ij.rubylist@gmail.com> wrote: > > Ahem, no. > 100% of Ruby lanuage creators say that they need something better > than Unicode :-) > > And if we get both unicode and other stuff, there is no point in > discussing it, no? > > Provided we get autoconversion, of course. > All due respect to matz and companyand the wondrous thing they have wrought, but *nobody* is perfect. Accepting a decision blindly based on who is making it is a recipe for trouble. My only concern is that while the proposed m17n implementation may make Ruby more perfect and more ideal for at least one person, it may (emphasis on 'may') make it harder for many thousands of others. Does that make sense? I'm sure there will be those who argue that Ruby is matz's creation and matz's creation alone, but there's a lot of people with a vested interest in "the Ruby way". A little critical analysis of the "benevolent dictator's" decisions is always prudent. If we get unicode and it's a lot harder than people like, or if it causes unpleasant compatibility, portability, or interoperability issues, then we're no better off. Hey, the uber-string m17n impl might be the most amazing, remarkable thing ever to come along. It just seems based on a lot of anecdotal evidence that this approach is very complex and very dangerous, and arguably has never been done right yet. matz and company are amazing hackers, but is it a good risk to take? Is it worth it for 10% of Ruby users or less? And again, I mean no disrespect by questioning the Ruby elders. It's just my way.
on 2006-06-26 23:47

On 6/26/06, Charles O Nutter <headius@headius.com> wrote: > written as an extension for those special cases? Do we really need to > encumber all of Ruby for the needs of a relative few? I do not believe that this is a viable argument for "killing". At best, this is an argument for making sure that Unicode support *rock* in Ruby. It doesn't mean we need to make those "special" cases harder than they need to be. -austin
on 2006-06-26 23:48

On Jun 26, 2006, at 2:15 PM, Izidor Jerebic wrote: > How will e.g. Japanese (or we non-English Europeans), which now use > default $KCODE, write their Ruby scripts? Will we need to specify > encoding in every script for every IO? This can get cumbersome very > fast. Not really Ruby style. I think that anyone, living in any country, working in any language, who counts on one global variable to specify the encoding of any file they might want to read, will very soon have lots of nasty surprises. Ten years ago, you could do this; no longer. -Tim
on 2006-06-26 23:54

On 6/26/06, Jim Weirich <jim@weirichhouse.org> wrote: > Thanks for the response, Austin. It seemed to help clearify the issues > (at least for me). > > Austin Ziegler wrote: > > d1 = File.open("file.txt", "rb", encoding: :utf8) { |f| f.read } > Question: Does the encoding parameter specify the encoding of the file, > or the encoding of the strings you get back (my guess is both). I would assume both, based on what I've seen from Matz. > Related question: In environments that use a lot of different encodings, > are there ways or conventions for specifying the encoding, or do you > just have to "know". In my experience, you just have to "know" unless you can do some detection of the encoding. I think that only UTF-16 or UTF-32 is really amenable to this ;) This is one of the problems that I've seen with the encoding work that I've done. If I'm reading a list of files from a NetWare server, what encoding is the data in? I don't necessarily have a Unicode interface -- and my code page may not match the server's code page. *Whenever* you're dealing with legacy data, you have to "agree" or guess and hope you're right. >> s1.encoding = :utf8 > Another Question: When you set the encoding, are you: > > (A) Just changing the encoding specifier without changing the > underlaying string. > (B) Re-encoding the string according to the new encoding specifier. > (B) seems to be implied by the attribute notation, but that seems a bit > dangerous in my mind. I personally consider it to be (A) because I believe that encoding is a lens. If you want (B) it should be s1.recode(:utf8). But #recode would not work on an encoding of "binary" (or "raw"); #recode would be similar to the Iconv steps you would use today. -austin
on 2006-06-27 01:52

On 26-Jun-06, at 1:03 PM, Austin Ziegler wrote: >> argument syntax we can do: >> s[byte:0] # The first byte (of a string with some non ascii >> compatible encoding) > > I kinda like that. Presumably this is general arm waving, because s[/./] need not return the first character of a non-empty string, unless you mean s[/./m] or some uglier alternative ratdog:~ mike$ irb --simple-prompt >> "\nx"[/./] => "x" >> "\nx"[/./m] => "\n" Mike -- Mike Stok <mike@stok.ca> http://www.stok.ca/~mike/ The "`Stok' disclaimers" apply.
on 2006-06-27 01:55

On 6/27/06, Austin Ziegler <halostatue@gmail.com> wrote: > construction or afterwards. > > The need for ByteArray is nonexistent. ..or, to put that another way, when you see "unencoded String", feel free to say "ByteArray" in your head. ;D
on 2006-06-27 03:59

On 6/26/06, Mike Stok <mike@stok.ca> wrote: > >> s[0] # The first character > >> s[/./] # The first character > >> s[byte:0] # The first byte (of a string with some non ascii > >> compatible encoding) > > I kinda like that. > Presumably this is general arm waving, because s[/./] need not return > the first character of a non-empty string, unless you mean s[/./m] or > some uglier alternative I'm referring to s[byte: 0]. It's elegant. -austin
on 2006-06-27 03:59

On 6/26/06, Daniel Baird <danielbaird@gmail.com> wrote: > > The need for ByteArray is nonexistent. > ..or, to put that another way, when you see "unencoded String", feel free to > say "ByteArray" in your head. There's a point where you're right. But there's a point where you're wrong. My point is simply that we don't need a separate class for this, because character encodings are *ways* of interpreting a vector of bytes. -austin
on 2006-06-27 04:18

On Jun 26, 2006, at 9:57 PM, Austin Ziegler wrote:
> I'm referring to s[byte: 0]. It's elegant.
It seems a bit weighty. It requires the allocation of a Hash simply
to index a byte vector.
s.byte(0)
seems just as readable without the overhead.
Gary Wright
on 2006-06-27 04:21

Charles O Nutter wrote: > Hey, the uber-string m17n impl might be the most amazing, remarkable thing > ever to come along. It just seems based on a lot of anecdotal evidence that > this approach is very complex and very dangerous, and arguably has never > been done right yet. matz and company are amazing hackers, but is it a good > risk to take? Is it worth it for 10% of Ruby users or less? I'd like to point out that MySQL has m17n strings, and it rocks. Daniel
on 2006-06-27 06:05

On Jun 26, 2006, at 10:16 PM, gwtmp01@mac.com wrote: > > Gary Wright > **Must defend random syntax that I invented ;-)** It only has to allocate a hash depending on the named argument interface e.g. # not real ruby syntax, affaik def [](char_index = nil, byte: nil) ... end
on 2006-06-27 08:32

On Jun 26, 2006, at 7:20 PM, Daniel DeLorme wrote:
> I'd like to point out that MySQL has m17n strings, and it rocks.
I am often unable to get Unicode strings from Perl into MySQL and
back out without breaking them. Haven't tried the Ruby/MySQL combo;
does it work better? -Tim
on 2006-06-27 09:46

Hi, In message "Re: Unicode roadmap?" on Tue, 27 Jun 2006 06:52:14 +0900, "Austin Ziegler" <halostatue@gmail.com> writes: |> Austin Ziegler wrote: |> > d1 = File.open("file.txt", "rb", encoding: :utf8) { |f| f.read } |> Question: Does the encoding parameter specify the encoding of the file, |> or the encoding of the strings you get back (my guess is both). | |I would assume both, based on what I've seen from Matz. I think so. |> Another Question: When you set the encoding, are you: |> |> (A) Just changing the encoding specifier without changing the |> underlaying string. |> (B) Re-encoding the string according to the new encoding specifier. | |> (B) seems to be implied by the attribute notation, but that seems a bit |> dangerous in my mind. | |I personally consider it to be (A) because I believe that encoding is |a lens. If you want (B) it should be s1.recode(:utf8). But #recode |would not work on an encoding of "binary" (or "raw"); #recode would be |similar to the Iconv steps you would use today. str.encoding="ascii" would cause (A). matz.
on 2006-06-27 10:05

Hi, In message "Re: Unicode roadmap?" on Tue, 27 Jun 2006 00:05:22 +0900, "Dmitry Severin" <dmitry.severin@gmail.com> writes: |And what about minilanguages, incorporated in Ruby: regexp patterns, |sprintf, strftime patterns etc.? Good point. Currently they don't support non ASCII compatible encoding (including UTF-16 and UTF-32, but this is not fundamental restriction). matz.
on 2006-06-27 10:24

Hi, In message "Re: Unicode roadmap?" on Tue, 27 Jun 2006 06:43:30 +0900, "Charles O Nutter" <headius@headius.com> writes: |All due respect to matz and companyand the wondrous thing they have wrought, |but *nobody* is perfect. Accepting a decision blindly based on who is making |it is a recipe for trouble. My only concern is that while the proposed m17n |implementation may make Ruby more perfect and more ideal for at least one |person, it may (emphasis on 'may') make it harder for many thousands of |others. Does that make sense? I'm sure there will be those who argue that |Ruby is matz's creation and matz's creation alone, but there's a lot of |people with a vested interest in "the Ruby way". A little critical analysis |of the "benevolent dictator's" decisions is always prudent. Good point. |If we get unicode and it's a lot harder than people like, or if it causes |unpleasant compatibility, portability, or interoperability issues, then |we're no better off. | |Hey, the uber-string m17n impl might be the most amazing, remarkable thing |ever to come along. It just seems based on a lot of anecdotal evidence that |this approach is very complex and very dangerous, and arguably has never |been done right yet. matz and company are amazing hackers, but is it a good |risk to take? Is it worth it for 10% of Ruby users or less? But unfortunately, the implementer is living among those "10% or less". So it's a risk already taken, choosing a language designed by such a person. ;-) Anyway, please give me a chance to be proven wrong (or right). I will try not to make lives of thousands of others hard. matz.
on 2006-06-27 10:27

Charles O Nutter schrieb: > (...) > Hey, the uber-string m17n impl might be the most amazing, remarkable > thing ever to come along. It just seems based on a lot of anecdotal > evidence that this approach is very complex and very dangerous, and > arguably has never been done right yet. matz and company are amazing > hackers, but is it a good risk to take? Is it worth it for 10% of > Ruby users or less? > (...) Charles, could it be that "the uber-string m17n implementation" would make your life as JRuby implementer a lot harder? ;-> Regards, Pit
on 2006-06-27 15:30

On 6/26/06, Charles O Nutter <headius@headius.com> wrote: > versus those that could. Would it be reasonable to say that if 90% of Ruby > users would never have a pressing need for a non-unicode-encodable String, > then an uber-String that's entirely encoding-agnostic would be better > written as an extension for those special cases? Do we really need to > encumber all of Ruby for the needs of a relative few? Its' been asked already. Again: How does the possibility to store non-unicode characters in strings encumber you? Michal
on 2006-06-27 16:31

It won't matter much either way to JRuby, since Java's going to internalize all strings as UTF-16 anyway. Those encodings that can't be represented in unicode simply won't work, since that's just a platform limitation we'll probably live with. There's always the option of building our own uber-string based on what matz creates (porting to Java wouldn't be impossible, or perhaps even difficult) but we'll cross that bridge when we come to it. I'm just trying to play both sides of the fence here since there seems to be a number of people opposed to or doubtful of the m17n uberstring. As a Ruby platform implementer of a sort I'd like to make sure those concerns are considered.
on 2006-06-27 16:34

On 6/27/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote: > > But unfortunately, the implementer is living among those "10% or > less". So it's a risk already taken, choosing a language designed by > such a person. ;-) That's certainly fair to say, and I'm optimistic that whatever the best decision is you'll make it right. It's especially heartening that you are an active participant in this debate; I know certain other language designers that are less open to comment and criticism. Anyway, please give me a chance to be proven wrong (or right). > I will try not to make lives of thousands of others hard. > > matz. It seems you're giving yourself the change to be proven wrong already. I'll just watch that process as it moves forward and do what I can to mix things up.
on 2006-06-27 16:54

On 6/27/06, Michal Suchanek <hramrach@centrum.cz> wrote: >> non-unicode-encodable String, then an uber-String that's entirely >> encoding-agnostic would be better written as an extension for those >> special cases? Do we really need to encumber all of Ruby for the >> needs of a relative few? > Its' been asked already. > > Again: How does the possibility to store non-unicode characters in > strings encumber you? To be fair to Charles, he would benefit immensely from a Unicode internal representation because he could then simply *and cleanly* use Java Strings as Ruby Strings in JRuby. With an m17n String, he will need to have something else that isn't compatible with Java Strings, which hurts JRuby's use as a Java glue language. I think that there are ways around this. Maybe make the JRuby String class have an internal something like: class JRuby.String { private Java.Lang.String unicode; private ByteVector m17n; private Java.Lang.String encoding; private bool isUnicode; } That way, if it's a Unicode encoding -- regardless of what's desired -- he could use the unicode member; otherwise internally he uses the ByteVector. (Strictly speaking, for non-"raw" or "binary" encodings, he could always use the unicode member and convert as necessary.) -austin
on 2006-06-27 17:23

On 6/27/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote: > But unfortunately, the implementer is living among those "10% or > less". So it's a risk already taken, choosing a language designed by > such a person. ;-) That also means that the implementor has much better understanding of internationalization issues than those who live in the US ;-) This should give us at least sound base string class. And since the class is open in Ruby automatic this or that can be added. Thanks Michal
on 2006-06-27 18:28

On 6/27/06, Austin Ziegler <halostatue@gmail.com> wrote: > private Java.Lang.String encoding; > private bool isUnicode; > } > This would certainly be an option once matz has solved all the hard problems of an encoding-free String. Some minimal testing of a byte[] based UTF-8 Java String replacement has shown that there are very few general performance issues arising from reimplementing string with a different data structure (a testament to Java's JIT, since most Java code runs faster without native bits). When there's something concrete in the m17n plan, we shouldn't have much difficulty supporting it. We could also run with pure unicode internally as well, for folks who didn't need any unicode-incompatible encodings. Without the m17n code ready for general consumption, it's hard to say what path will be best. The other advantage of a byte[] or ByteVector-based JRuby string is for IO; currently we use Java's StringBuffer for handling mutable string operations. This works well, but StringBuffer maintains a char[] internally, so for every byte of IO we waste a byte. We're considering various options to improve that, and the end result may be closer to the UberString than to Java's own. So yes, there's some alterior motive in my support for pure Unicode and ByteArray, but any path taken will be implementable in JRuby. However, I support those because I feel they simplify rather than complicate, and not because they might be easier to implement in Java.
on 2006-06-27 19:21

On 6/27/06, Charles O Nutter <headius@headius.com> wrote: > So yes, there's some alterior motive in my support for pure Unicode and > ByteArray, but any path taken will be implementable in JRuby. However, I > support those because I feel they simplify rather than complicate, and not > because they might be easier to implement in Java. IME, more classes complicates. Sometimes the complexity is necessary because it is simpler than the alternative, but I don't believe that this is the case here. As I said, most of my opposition is based on (1) stupid statically typed languages and (2) an inability to tell Ruby what type you want back from a method call (this is a good thing, because it in part prevents #1 ;). -austin
on 2006-06-27 23:56

On 6/26/06, Daniel DeLorme <dan-ml@dan42.com> wrote: > > It's funny, maybe I'm just dumb but I can't think of a single *real-world* > example where you'd want to access particular characters of a string. If that is the case, then why doesn't Ruby remove *all* substring notation? If everyone is so comfortable with manipulating strings via regexp's, then why does the language bother to support my_str[a..b] , my_str[a...b] , and my_str[a,b] ? I don't mean to sound all-worked-up over this, but it does seem hard to believe that those method calls for String are never used in real-world code.
on 2006-06-28 03:00

Daniel DeLorme <dan-ml@dan42.com> writes: > It's funny, maybe I'm just dumb but I can't think of a single > *real-world* example where you'd want to access particular characters > of a string. I'll point you at my solution to ruby quiz #83: (short but unique) http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/... How would you write the method string_similarity without access to each character? (This method computes the length of the longest common substring) How would you compute the Levenshtein distance (edit distance) between two strings without access to each character? How would you pull strings out of a file with fixed-width fields? With regular expressions? Really? What if you had a hundred fields?
on 2006-06-28 03:28

Tim Bray wrote: > On Jun 26, 2006, at 7:20 PM, Daniel DeLorme wrote: > >> I'd like to point out that MySQL has m17n strings, and it rocks. > > I am often unable to get Unicode strings from Perl into MySQL and back > out without breaking them. Haven't tried the Ruby/MySQL combo; does it > work better? -Tim I've never had any problems. You just have to make sure the client correctly tells the server what encoding it is using. The only annoyance is that MySQL will silently change inconvertible characters to '?', but that's part of the MySQL design philosophy rather than inherent to m17n strings. Daniel
on 2006-06-28 04:03

Daniel Martin wrote: > I'll point you at my solution to ruby quiz #83: (short but unique) > > http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/... > > How would you write the method string_similarity without access to > each character? (This method computes the length of the longest > common substring) > > How would you compute the Levenshtein distance (edit distance) between > two strings without access to each character? I'll grant that I don't have enough imagination and that there *are* cases where you want character access. But it seems to me that the main use case is for something like this: str = "cogito <b>ergo</b> sum" i = str.index("<b>") + 3 j = str.index("</b>",i) str[i...j] => "ergo" and for that common case, regexes are far more appropriate: str.match(/<b>(.*?)<\/b>/)[1] => "ergo" Advocating regexes-only for character manipulation is certainly extreme. I'm just saying that byte access and character access needs to have different semantics. If you look at the current ruby String API, bytes are accessed through integer positions and characters are accessed through regexes. The byte are char APIs are quite distinct, it's just that everybody is using the byte API and expecting to get characters as a result. From what I understand (and please correct me if I'm wrong), ruby2 will fix that by changing the api so that integer positions represent characters instead of bytes. For binary strings, those two concepts map to the same reality so it won't be such a backward-incompatible change. I just wonder what will be the behavior of str[0]. Will it return a 0..255 integer in the case of binary string and a 1-character string in the case of encoding-set string? Now *that* would be an API nightmare. > How would you pull strings out of a file with fixed-width fields? > With regular expressions? Really? What if you had a hundred fields? Hmm, fixed width records and fields were created for the purpose of fast access to data, i.e. seek to position recnum*reclength and extract reclength bytes; they only make sense in the case of single-byte characters. So this is more a case of byte access. Daniel
on 2006-06-28 07:44

On 27.6.2006, at 19:19, Austin Ziegler wrote: > As I said, most of my opposition is based on > (1) stupid statically typed languages and (2) an inability to tell > Ruby what type you want back from a method call (this is a good thing, > because it in part prevents #1 ;). First, "most of my opposition" is not useful in discussion and is a straw-man, because we are not counting people here, we try to evaluate reasons for and against. One person with good reason should overcome 1000 not-so-good posts. This is not about winning the argument, it's about having the best solution. About (2), inability to tell in advance in your program whether you get bytes or characters from a method in core (or any other) API is NOT a good thing. This causes innumerable problems and unexpected behaviour if programmer expects one and code sometimes gets the other. The API should prevent such errors, either by very simple and strict rules that enable easy prediction, or by introducing ByteArray, which makes prediction trivial. This is not about duck- typing, it's about randomly having semantically different results. Since the rules are not fixed yet, nobody can say whether one or the other solution is better. But if the API is not very clear or requires lots of manual specifying in code, we will be in a mess, similar to today. izidor
on 2006-06-28 18:50

On 6/28/06, Izidor Jerebic <ij.rubylist@gmail.com> wrote: > On 27.6.2006, at 19:19, Austin Ziegler wrote: > > As I said, most of my opposition is based on > > (1) stupid statically typed languages and (2) an inability to tell > > Ruby what type you want back from a method call (this is a good thing, > > because it in part prevents #1 ;). > First, "most of my opposition" is not useful in discussion and is a > straw-man, because we are not counting people here, we try to > evaluate reasons for and against. One person with good reason should > overcome 1000 not-so-good posts. This is not about winning the > argument, it's about having the best solution. You have misread my English. I am not referring to people who oppose my position; I am referring to my opposition to a separate ByteArray class. However, I have yet to see even a mediocre reason for a separate ByteArray. > About (2), inability to tell in advance in your program whether you > get bytes or characters from a method in core (or any other) API is > NOT a good thing. This causes innumerable problems and unexpected > behaviour if programmer expects one and code sometimes gets the > other. The API should prevent such errors, either by very simple and > strict rules that enable easy prediction, or by introducing > ByteArray, which makes prediction trivial. This is not about duck- > typing, it's about randomly having semantically different results. You'll *never* get that without type hinting. And type hinting for return types would be as bad as anything else for Ruby. Consider this copy function: def copy_file(inf, outf) open(inf, "rb") { |fin| File.open(outf, "wb") { |fout| fout.write fin.read } } end Why didn't I use File.open? Because I can now do this: require 'open-uri' copy_file("http://www.ruby-lang.org/en", "ruby-lang-en.html") I didn't get a "File" object from Kernel#open; I got (in this case) a Tempfile. > Since the rules are not fixed yet, nobody can say whether one or the > other solution is better. But if the API is not very clear or > requires lots of manual specifying in code, we will be in a mess, > similar to today. Quite simply, you're either wrong or you don't understand the parameters of the problem. I'd rather assume the latter. However, if you want to ensure a particular class is returned from a Ruby method, you must have a method which guarantees that it will only return that class (or nil, perhaps). Therefore, with a separate ByteArray class, we would *of necessity* see parallel File operations or a separate IO class hierarchy or (worst of all!) constructors which tell the File to return String or ByteArray depending on how it was constructed. There is *no possible good argument* for separating ByteArray from String in Ruby. Not with what it would do to the rest of the API, and I don't think that anyone who wants a ByteArray is thinking beyond String issues. -austin
on 2006-06-28 19:48

On Mon, Jun 26, 2006 at 11:21:59AM +0900, Yukihiro Matsumoto wrote: > |string = File.open('file.txt', 'r') {f.read.to_s(:utf-8)} > > matz. Any additional complexity here should be offset later, when doing operations on the read data as appropriate for its type. Of course, the first line should raise an exception if file.txt is not utf8 encoded, this saves extra complexity down the line, and is a real difference between the two. I imagine Bytevector would be implemented with maximum performance and space efficiency in mind, while String is a higher level class streamlined for easy of use. There could be accessor for the Bytevector to convert it (or parts of it) to a String, for cases where you really need to read mixed string/data from somewhere. string = bytes.to_str(:utf8) string2 = bytes[1..5].to_str(:utf8) Or maybe a StrStream-like interface: bytes.stream_open("r") do |b| s = b.read(:utf8) ... end -Jürgen
on 2006-06-28 20:42

On 28.6.2006, at 18:48, Austin Ziegler wrote: >> About (2), inability to tell in advance in your program whether you >> get bytes or characters from a method in core (or any other) API is >> NOT a good thing. This causes innumerable problems and unexpected >> behaviour if programmer expects one and code sometimes gets the >> other. The API should prevent such errors, either by very simple and >> strict rules that enable easy prediction, or by introducing >> ByteArray, which makes prediction trivial. This is not about duck- >> typing, it's about randomly having semantically different results. > > You'll *never* get that without type hinting. I think you do not understand what the problem is, because your claim is so obviously false. How can I get that with very simple rule: all IO#read (and similar) calls always return binary Strings. No type hinting in sight, but I always know whether my code receives Strings or binary Strings. But this simple option is clearly not possible, because it complicates the text processing in simple scripts. We'll see how complicated the final rules will be. Alternative (actually equivalent) to the above is: all IO#readbytes calls return ByteArray objects, and we need separate call IO#readstring which always return Strings with encoding. izidor
on 2006-06-28 20:45

On 6/28/06, Juergen Strobel <strobel@secure.at> wrote: > Any additional complexity here should be offset later, when doing > operations on the read data as appropriate for its type. It won't be. All of the complexity of the m17n String will be inside of the String, not exposed (by default) to the user. Stop thinking of the encoding of a String as something that makes the String a unique object; instead it is a lens that gives meaning to the bytes of the String. > Of course, the first line should raise an exception if file.txt is not > utf8 encoded, The internal format of String is not going to be Unicode by default. Matz has already said that. I happen to agree with him. > this saves extra complexity down the line, and is a real difference > between the two. I imagine Bytevector would be implemented with > maximum performance and space efficiency in mind, while String is a > higher level class streamlined for easy of use. These two items are not mutually exclusive. Think a little more about humane design and you'll see that two wholly separate classes require a lot more than what you're assuming and would end up in programmers making even dumber assumptions than they do today, because they'd think they're "protected" during IO because they're getting a String. This is not a safe assumption. Ever. The separate byte vector class is needlessly complex and solves exactly nothing that isn't already solved in a better way. -austin
on 2006-06-28 20:53

On 6/28/06, Izidor Jerebic <ij.rubylist@gmail.com> wrote: > I think you do not understand what the problem is, because your claim > is so obviously false. Oh, bollocks. Go ahead, pull the other one. > IO#readstring which always return Strings with encoding. Nope. Not nearly equivalent and a lot dumber. I've just spent the last week explaining in simple terms why it's dumb. You want to *at least* double the complexity of the IO API because you're either unwilling or incapable of considering anything but your ByteArray concept. I, for one, am not willing to consider an extensively more complex API because your imagination is lacking. -austin
on 2006-06-28 20:56

On 28.6.2006, at 18:48, Austin Ziegler wrote: > There is *no possible good argument* for separating ByteArray from > String in Ruby. Not with what it would do to the rest of the API, and > I don't think that anyone who wants a ByteArray is thinking beyond > String issues. Oh, really? So it is OK for this code to sometimes receive binary String and sometimes String with encoding: io = SomeIO.open( .... ) v = io.read( 1000 ) This is the most problematic part of String handling. Because if my code expects this 'v' to be binary string, v[0..15] is the first 16 bytes (maybe a message header or something). If this is encoded string (because some setting changed outside of my code), v[0..15] will be some random amount of data. This is the error that happens right now and will happen in the future also, if the rules are not clear. izidor
on 2006-06-28 21:06

On 28.6.2006, at 20:43, Austin Ziegler wrote: > Think a little more about > humane design and you'll see that two wholly separate classes > require a > lot more than what you're assuming and would end up in programmers > making even dumber assumptions than they do today, because they'd > think > they're "protected" during IO because they're getting a String. > This is > not a safe assumption. Ever. True. That's why most solutions do not offer String IO, but only ByteArray. But for language with large part of usage being text processing, this brings lots of conversions into code, which as Matz said, makes it like Java. But it is the safe way. Just the way you like it - no automatic conversion :-) But most of us would not like the language which makes you type all the conversions manually in code, even for single-line scripts. Which we would not have any more. Scripts would be at least two lines - one line for conversion code :-) izidor
on 2006-06-28 21:19

On Thu, 29 Jun 2006, Austin Ziegler wrote: > There is *no possible good argument* for separating ByteArray from String in > Ruby. Not with what it would do to the rest of the API, and I don't think > that anyone who wants a ByteArray is thinking beyond String issues. i woulnd't go that far. i'm wanting a byte array and thinking beyond string issues about every 1-2 hrs in my job. for example f = open 'grayscale_image.dat' n_rows.times do row = f.read n_cols # now i have to this this row = row.split(//).map{|char| char[0]} # because here i need to do avg_pixel_value = row[31,5].inject(0){|avg,n| avg += n} / 5.0 if some_range.include? avg_pixel_value ... end end this may have nothing to do with unicode issues - but would love to have 'array of bytes' style io operations, though i've not thought about api for more that 1 second. anyhow - we actually want byte arrays more often than strings. regards. -a
on 2006-06-28 21:19

On 28.6.2006, at 20:51, Austin Ziegler wrote: > Nope. Not nearly equivalent and a lot dumber. I've just spent the last > week explaining in simple terms why it's dumb. Equivalent in their prediction power. This is the problem I discuss - both give 100% results independent of environment, and ByteArray version is maybe even somewhat firmer because there is even different class of result, not only encoding. You have not given any solution to any of the problems in code examples I have given, related to the problem of predicting the class/ encoding of result. I'd say the solution would prove you know what the problem is. Except if you say that this (random String encoding in result) is not a problem. Then this discussion really can't progress. And we can agree that we disagree and stop right here. izidor
on 2006-06-28 21:47

On 28.6.2006, at 21:18, Izidor Jerebic wrote: > I'd say the solution would prove you know what the problem is. > > Except if you say that this (random String encoding in result) is > not a problem. > Then this discussion really can't progress. And we can agree that > we disagree and stop right here. And to clear the air - I am not advocating ByteArray unconditionally. I have just explained one crucial problem, and the ByteArray is simplistic solution to the problem. I would much prefer some really creative and simple String solution. But I do not have it and have not seen it yet. Hopefully Matz (or anybody, really) will surprise us with elegant, balanced solution. izidor
on 2006-06-28 22:19

On Jun 28, 2006, at 3:16 PM, ara.t.howard@noaa.gov wrote: > > # now i have to this this > row = row.split(//).map{|char| char[0]} > This is off on a tangent here but, ara why not just row = row.to_enum(:each_byte).to_a
on 2006-06-29 02:07

Hi, In message "Re: Unicode roadmap?" on Thu, 29 Jun 2006 03:53:52 +0900, Izidor Jerebic <ij.rubylist@gmail.com> writes: |Oh, really? So it is OK for this code to sometimes receive binary |String and sometimes String with encoding: |io = SomeIO.open( .... ) |v = io.read( 1000 ) No, as I said before, reading with length specified shall always return binary strings, since it counts in bytes, whereas gets, readline etc. would return encoded strings. matz.
on 2006-06-29 07:13

On 6/28/06, Izidor Jerebic <ij.rubylist@gmail.com> wrote: > io = SomeIO.open( .... ) > v = io.read( 1000 ) > > This is the most problematic part of String handling. Because if my > code expects this 'v' to be binary string, v[0..15] is the first 16 > bytes (maybe a message header or something). If this is encoded > string (because some setting changed outside of my code), v[0..15] > will be some random amount of data. > > This is the error that happens right now and will happen in the > future also, if the rules are not clear. I would think that STD* should use locale (or equvialent) for default encoding. So should popen. And open should use locale to determine the encoding of *file names*. This migt be different from the encoding of STD* (ie on Windows). For file io it might be reasonable to set the default encoding from locale as well. However, there is no reason why the files should contain text. So to make things clear the io should be binary by default for files, network, and anything else (except the pipes mentioned above). For short scripts one could change that by assigning some global that specifies the default encoding. For anything else it is reasonable to demand that everybody sets the encoding when calling open. Even issue a warning about that. If you want to know what encoding you get there is not other way. And it is not addding complexity. today you do not specify encoding but you also do not get anything that deals with it. Thanks Michal
on 2006-06-29 08:45

On Thu, Jun 29, 2006 at 03:43:54AM +0900, Austin Ziegler wrote: > On 6/28/06, Juergen Strobel <strobel@secure.at> wrote: > >Any additional complexity here should be offset later, when doing > >operations on the read data as appropriate for its type. > > It won't be. All of the complexity of the m17n String will be inside of > the String, not exposed (by default) to the user. Stop thinking of the > encoding of a String as something that makes the String a unique object; > instead it is a lens that gives meaning to the bytes of the String. Having said lens adds complexity. I'll always have to think of the data and the lens. You are very absolute in denying this, I wonder why. > > >Of course, the first line should raise an exception if file.txt is not > >utf8 encoded, > > The internal format of String is not going to be Unicode by default. > Matz has already said that. I happen to agree with him. Please stop beating this dead horse. Noone is disputing Matz's right to implement as he likes. And this is not about the String per se, in that line of code clearly something supposed to be UTF8 is read from a file, and if the file doesn't contain valid UTF8, I'll expect an exception. Not getting that exception adds complexity to my code, because I'll have to verify it later on manually, and it may obscure the source of the error if I forget. Complexity added in both cases. Prior point proven. > not a safe assumption. Ever. > > The separate byte vector class is needlessly complex and solves exactly > nothing that isn't already solved in a better way. Without a prototype, this is speculation at best. Programmers would be protected by exceptions from invalid String I/O operations. Human interface design hinges on a lot more and different things than this one special detail, I can't imagine it will change a lot, and many ruby programmers aren't that dumb as you make it. This is a red herring. OT, you should watch whom you call dumb, stupid or foolish here, even by implication. That said, I am waiting for M17N as Matz has decided on that, and I suspect noone else is going to implement anything else for now. But don't tell me it'll be just perfect for everyone, when discussed use cases already show it won't be. Matz himself said, that in order to cater to his own special interest group, he is willing to sacrifice some convenience for others. -Jürgen
on 2006-06-29 08:57

Hi, In message "Re: Unicode roadmap?" on Thu, 29 Jun 2006 15:44:10 +0900, Juergen Strobel <strobel@secure.at> writes: |That said, I am waiting for M17N as Matz has decided on that, and I |suspect noone else is going to implement anything else for now. But |don't tell me it'll be just perfect for everyone, when discussed use |cases already show it won't be. Matz himself said, that in order to |cater to his own special interest group, he is willing to sacrifice |some convenience for others. Did I said so? I am not going to sacrifice anybody. At least I am trying not to, even though I cannot promise. matz.
on 2006-06-29 09:35

On Thu, Jun 29, 2006 at 03:56:55PM +0900, Yukihiro Matsumoto wrote: > |some convenience for others. > > Did I said so? I am not going to sacrifice anybody. At least I am > trying not to, even though I cannot promise. > > matz. I don't think you can possibly cater to everyone here. Simplicissity, Flexibility, Performance, take any two. My impression is that M17N is going for maximum flexibility with good performance, but for e.g. Unicode only users there'll be some extra complexity to be aware of. I don't think you'll sacrifice Unicode users totally, but it is not your top priority either. And I understood you expressed this yourself in the following quote. On Tue, Jun 27, 2006 at 05:21:27PM +0900, Yukihiro Matsumoto wrote: > |ever to come along. It just seems based on a lot of anecdotal evidence that > > matz. > -Jürgen
on 2006-06-29 09:44

Hi, In message "Re: Unicode roadmap?" on Thu, 29 Jun 2006 16:33:19 +0900, Juergen Strobel <strobel@secure.at> writes: |I don't think you can possibly cater to everyone here. Simplicissity, |Flexibility, Performance, take any two. My impression is that M17N is |going for maximum flexibility with good performance, but for |e.g. Unicode only users there'll be some extra complexity to be aware |of. I don't think you'll sacrifice Unicode users totally, but it is |not your top priority either. I can't promise implementation simplicity. Because it would not be inside. But I am trying to build "pseudo simplicity", which means simplicity in the appearance. For example, text processing code with file I/O in Ruby will keep being much simpler than Java. |And I understood you expressed this yourself in the following quote. Don't get me wrong without context. You've said that "this approach is complex, and worth it for 10% or less of Ruby users". And I said, "unfortunately I am one of those 10% or less. You cannot stop Ruby being (implementation) complex". Clear? matz.
on 2006-06-29 13:13

On 6/29/06, Juergen Strobel <strobel@secure.at> wrote: > I don't think you can possibly cater to everyone here. Simplicissity, > Flexibility, Performance, take any two. My impression is that M17N is > going for maximum flexibility with good performance, but for > e.g. Unicode only users there'll be some extra complexity to be aware > of. I don't think you'll sacrifice Unicode users totally, but it is > not your top priority either. Um. You make the same error, I think, that some others have. There are two measures of complexity to be measured. The first is implementation complexity. The second is use complexity. I fully expect that the implementation of the m17nString is going to be complex. (I think it will be simpler than most naysayers are suggesting, but it will certainly be more complex than anything that currently exists.) However, I believe that the use complexity -- that is, the external API in both C (for extensions) and Ruby --- is going to be relatively low. Maybe a little more complex than we have today. The *actual* complexity in use is going to depend on your needs. If you're dealing with Unicode and binary data only -- as will likely be the case -- you will find it much easier to use than someone who has to deal with multiple encodings at once. -austin
on 2006-06-30 01:06

On Thu, Jun 29, 2006 at 04:42:49PM +0900, Yukihiro Matsumoto wrote: > |not your top priority either. > "unfortunately I am one of those 10% or less. You cannot stop Ruby > being (implementation) complex". Clear? > > matz. First, it wasn't me who brought this up, the quote about the 10% is from "Charles O Nutter". Second, I know a complex implementation doesn't mean the interface has to be complex, on the contrary. My fear is that the interface will still be more complex than really neccessary for *me* -- not that I would expect this is reason enough you deviate from your plans. Voicing my own concerns and wishes about the interface design is a thing I can do though, in the hope that such feedback will be useful to you, or at least informative to other readers. I still think that you won't be able to please everybody, that's just not possible. No evangelist will ever convince me. But I am eager to see for myself how close you can come (and where you will compromise). -Jürgen
on 2007-05-31 22:29
Hello, everyone. I am sorry, I was a bit embarassed by the quantity of text in this discussion and I may have read it not enough carefully to firure out the answer, and it (discussion) itself seems to be a year old, so I've decided to ask: Finally, is there a convenient support for Unicode in Ruby? Or, if not, when will it be? I am going to develop an international website (with pages in some european languages, including those using non-latin alphabets). I think it should prove to be a good idea to make such a website totally in Unicode (probably UTF-16), without using any legacy encodings at all. The DBMS I am going to use is Oracle 10g (Express edition until it comes to its limitations). As well I would like to ask when the next Ruby release is planned to. If it comes this year, I should probably try nightly builds as it seems to be wise to start a new project targeting ea version of the next release. Thanks in advance.
on 2007-06-01 00:31

On 5/31/07, Ivan Mashchenko <ivan.mashchenko@gmail.com> wrote: > Hello, everyone. I am sorry, I was a bit embarassed by the quantity of > text in this discussion and I may have read it not enough carefully to > firure out the answer, and it (discussion) itself seems to be a year > old, so I've decided to ask: > Finally, is there a convenient support for Unicode in Ruby? Or, if not, > when will it be? There are a lot of answers to that question, and I strongly suggest you search as this is a hotly debated discussion. Google is more useful for searching this than ruby-forum.com. You will find out when there will be a new release, and the current state of Unicode. -austin
on 2007-06-01 08:16

On Fri, Jun 01, 2007 at 05:29:31AM +0900, Ivan Mashchenko wrote: > Finally, is there a convenient support for Unicode in Ruby? Or, if not, > when will it be? Well, Ruby 1.9 (which is due in December) will have some Unicode support. (So you'll have a `chars` method on strings, like with Rails.) Matz is working on it right now even, as he posted that he was tooling around with string.c earlier this week on his blog. That is, nothing's been checked in yet. Because he wants it to be good, you see? _why
on 2007-06-01 08:25

On 2007-05-31 15:30:50 -0700, "Austin Ziegler" <halostatue@gmail.com> said: > you search as this is a hotly debated discussion. > > Google is more useful for searching this than ruby-forum.com. You will > find out when there will be a new release, and the current state of > Unicode. If it helps any, I've moved ~2000 web pages in an internal work project that had mixed UTF-8/cp-1252 (in the content, not just between pages) and ruby handled it very gracefully. I was using 1.8.5-p12 and Hpricot (but not Hpricot's encoding features, which last I checked are broken) for the process. While I'm certainly not an authority on the subject, I've thoroughly battle-tested this and it works with a high degree of confidence. Certainly better than perl and libxml2, which was our original implementation.
on 2007-06-01 11:51

On 5/31/07, Ivan Mashchenko <ivan.mashchenko@gmail.com> wrote: > Finally, is there a convenient support for Unicode in Ruby? Or, if not, > when will it be? It depends on your definition of 'convenient'. The short answer is that unicode applications can be made in Ruby, particularly Web Apps. It is not especially difficult, but it is not 'for free' or seamless. You generally have to use an encoding-aware string type, or modify the existing string class to support multi-byte characters. A longer answer would contain references to the fact that there are multiple options here, that web apps (Rails in particular) are ahead of pure Ruby in terms of Unicode, and that there are actually a lot of projects to investigate. The hardest part of Ruby and Unicode is that not all of the libraries support it, or that some of the meta-hackery to the string class could break libraries that expect chars.length to equal bytes.length (there are other examples). Some of the more popular libraries are like this, or they inherit the encoding from your O/S settings and cannot be driven from an API. > I am going to develop an international website (with pages in some > european languages, including those using non-latin alphabets). I think > it should prove to be a good idea to make such a website totally in > Unicode (probably UTF-16), without using any legacy encodings at all. Well yes, but I would use UTF-8 instead. Its Unicode designed for the web (and UTF-16 is a bit wierd in some ways - there are at least 3 kinds of UTF-16 that I am aware of). Rails 1.2 introduced some pretty impressive support for Unicode in the last release, all of the major i18n plugins should be compatible with these changes by now. > As well I would like to ask when the next Ruby release is planned to. If > it comes this year, I should probably try nightly builds as it seems to > be wise to start a new project targeting ea version of the next release. AFAIK there is no release schedule. YARV is basically Ruby 1.9, and it is scheduled for release around the end of the year. However there is no firm commitment to make it the next Ruby version. Also Ruby 1.9 is going to break/deprecate stuff - I wouldn't develop against it, it will be a rough experience. Ruby 1.9 is kind of a staging release; migrating from 1.8 -> 1.9 is going to be tricky, but 1.9 -> 2.0 should be a drop in; that the intention - isolate the biggest changes to the 1.9 release. If you are moving to Ruby 1.9, do it with a complete working application. Or better still, develop against Rails versions, not Ruby versions. Let the Rails team figure out the best Ruby migration strategy for you.
on 2007-06-01 13:28
Richard Conroy wrote: > It depends on your definition of 'convenient'. IMHO convinient is as in C#. There I don't have to bother how are strings stored in memroy, they just do work and are international. > Well yes, but I would use UTF-8 instead. Won't there be a problem if the data is stored in UTF-16 (as far as I know Orace, NVARCHAR uses 16-bit per symbol) > Also Ruby 1.9 is going to break/deprecate stuff - I wouldn't develop against it > migrating from 1.8 -> 1.9 is going to be tricky So why should anyone develop a new project against 1.8 if it is going to be deprecated? > If you are moving to Ruby 1.9, do it with a complete working > application. But isn't it going to be tricky, as you've said? I dont have to be moving for now as I have no line of Ruby code (I have only an idea in my head) for today. And no Ruby experience (I am C++, C#, Java and T-SQL developer). I've chosen Ruby as it seems almost good and free. Have I understood you correctly - you think I should make it Ruby 1.8 and then do a tricky move when it comes? > Or better still, develop against Rails versions, not Ruby versions. This advice can prove useful. I'll think about it.
on 2007-06-01 16:25

On 6/1/07, Ivan Mashchenko <ivan.mashchenko@gmail.com> wrote: > Richard Conroy wrote: > > > It depends on your definition of 'convenient'. > > IMHO convinient is as in C#. There I don't have to bother how are > strings stored in memroy, they just do work and are international. It's not *that* convenient. By default Ruby strings are 8-byte. You can make them Unicode strings very easily through a library (kCODE IIRC), and they will behave as unicode in a way that you don't have to think about. You don't have to use a different string type. The problem occurs when you use code that you didn't write that expects strings to be single-byte. So every time you evaluate a Ruby library, Rails plugin or gem, you have to do more homework than you would in the unicode centric Java or C#. > > Well yes, but I would use UTF-8 instead. > > Won't there be a problem if the data is stored in UTF-16 (as far as I > know Orace, NVARCHAR uses 16-bit per symbol) Every database worth using lets you specify the encoding of your string and character types. Check your manuals or the Oracle forums. Anything that is any way associated with web development supports UTF-8. > > > Also Ruby 1.9 is going to break/deprecate stuff - I wouldn't develop against it > > migrating from 1.8 -> 1.9 is going to be tricky > > So why should anyone develop a new project against 1.8 if it is going to > be deprecated? Okay, you misunderstood me. There is a feature roadmap towards Ruby 2.0, where major changes are coming in; the two biggest that I recall are Unicode support and native/pre-emptive threads. The only reasonable way to implement them are by altering the behaviour of core classes and the standard library. This will mean that Ruby code of any sophistication written for Ruby 1.8, including many libraries is likely to break. Ruby 1.8 is not going away. Ruby is an open language, with a public source repository. Unlike with .Net say, where Microsoft distribute the runtime in binary only-form and can make older versions difficult to get. You have no obligation to migrate to the most recent version, and there is no technical reason that multiple runtimes (application specific) cannot co-exist on the same machine. Chasing the latest release is really something that you only do with commercial languages. It's not something that is generally done with open languages. > > > If you are moving to Ruby 1.9, do it with a complete working > > application. > > But isn't it going to be tricky, as you've said? It would be one hell of a lot easier than developing against a moving target, not knowing if the issues in your code are your issues or due to the latest release candidate. Bleeding edge software development is for people who can spare a lot of blood loss; > I dont have to be moving for now as I have no line of Ruby code (I have > only an idea in my head) for today. And no Ruby experience (I am C++, > C#, Java and T-SQL developer). I've chosen Ruby as it seems almost good > and free. Yeah, its a great language. Make a point of checking out the JRuby project. Its an exceptionally well developed Ruby runtime; it is considerably more than an interpreter or language bridge - the JRuby guys have basically doubled the size of the Java platform (or Ruby platform depending on POV). Ruby is strong where Java is weak, and vice versa. > Have I understood you correctly - you think I should make it Ruby 1.8 > and then do a tricky move when it comes? Use Rails, where the most compelling features in Ruby 1.9/2.0 are already present: Unicode, native concurrency (via processes) and good performance (via all those <foo>caching mechanisms). When the Rails guys go Ruby 1.9 you can. > > Or better still, develop against Rails versions, not Ruby versions. > > This advice can prove useful. I'll think about it. regards, Richard.
on 2007-06-02 01:00

On Jun 1, 2007, at 9:23 AM, Richard Conroy wrote: > them Unicode strings very easily through a library (kCODE IIRC), > unicode centric Java or C#. > > 2.0, > > same machine. >> But isn't it going to be tricky, as you've said? >> only an idea in my head) for today. And no Ruby experience (I am C++, > on POV). > Ruby 1.9 > you can. > >> > Or better still, develop against Rails versions, not Ruby versions. >> >> This advice can prove useful. I'll think about it. > > regards, > Richard. > Objective-C (through the Cocoa framework) also handles Unicode superbly. Problem is, it is not cross-platform and is in fact strictly OS X stuff, but you could indeed use those libraries (NSString, etc...) through RubyCocoa, but of course that is far from convenient or optimal for most purposes. Ideally, if major OS vendors got behind Ruby full force and put their Unicode know-how into the codebase, things would be smoother. They're the ones who really have already figured out pretty good ways to handle that stuff, and all the major scripting languages could benefit from it.