Unicode roadmap?

rhaus · June 19, 2006, 1:11am

On 19-jun-2006, at 1:00, Yukihiro M. wrote:

1.9 Oniguruma regexp engine should handle these, otherwise it’s a bug.

I’ll try to check. Oniguruma on 1.8.4. didn’t cope, but maybe it just
weren’t hooked in properly.

rhaus · June 19, 2006, 1:02am

Hi,

In message “Re: Unicode roadmap?”
on Mon, 19 Jun 2006 00:29:46 +0900, Julian ‘Julik’ Tarkhanov
[email protected] writes:

|In other words, tell me, can Ruby’s regexes cope with the following:
|
|/[Á-Ñ]/
|/[Á-Ñ]/i

1.9 Oniguruma regexp engine should handle these, otherwise it’s a bug.

						matz.

rhaus · June 19, 2006, 3:34am

On 19-jun-2006, at 1:56, Yukihiro M. wrote:

a bug.
|
|I’ll try to check. Oniguruma on 1.8.4. didn’t cope, but maybe it just
|weren’t hooked in properly.

If you have any problem, send us a report with what you expect and
what you get.

Well, I tried on the CVS latest (1.9) and I get:

irb(main):011:0> “ÐÐ?Ð?Ð»Ð°Ð³Ð¾Ð´Ð°Ñ?ÐÐ°Ñ” =~ /[Ð°-Ñ]/i
=> 6 (should be zero)

That is - character classes work, casefolding doesn’t.

rhaus · June 19, 2006, 6:06am

Hi,

In message “Re: Unicode roadmap?”
on Mon, 19 Jun 2006 10:32:08 +0900, Julian ‘Julik’ Tarkhanov
[email protected] writes:

|Well, I tried on the CVS latest (1.9) and I get:
|
|irb(main):011:0> “îåâÌÁÇÏÄÁÒîÁÑ” =~ /[Á-Ñ]/i
|=> 6 (should be zero)
|
|That is - character classes work, casefolding doesn’t.

I found out that Oniguruma casefolding works only for characters
within iso8869-*. Considering the size of the casefolding table it is
compromise for the time being. I will fix this in the future.

						matz.

rhaus · June 19, 2006, 1:57am

Hi,

In message “Re: Unicode roadmap?”
on Mon, 19 Jun 2006 08:09:29 +0900, Julian ‘Julik’ Tarkhanov
[email protected] writes:

|> |/[Á-Ñ]/
|> |/[Á-Ñ]/i
|>
|> 1.9 Oniguruma regexp engine should handle these, otherwise it’s a bug.
|
|I’ll try to check. Oniguruma on 1.8.4. didn’t cope, but maybe it just
|weren’t hooked in properly.

If you have any problem, send us a report with what you expect and
what you get.

						matz.

rhaus · June 19, 2006, 7:24am

On 19-jun-2006, at 6:05, Yukihiro M. wrote:

|
|That is - character classes work, casefolding doesn’t.

I found out that Oniguruma casefolding works only for characters
within iso8869-*. Considering the size of the casefolding table it is
compromise for the time being. I will fix this in the future.

Thanks for the clarification

rhaus · June 19, 2006, 7:58am

Correct me,if I’m wrong, but for Matz’s plan on M17N, summary is:

String internally will remain the same : char *ptr, long len - in
bytes
String instances will have encoding tag
All String/Regexp methods will respect that encoding tag and return
char(glyph) indexes
Methods like byte_size, codepoints, each_char, each_codepoint will be
introduced(?)
slice will always accept chars indices and return substrings

I’d say that WOULD BE GOOD, and with methods like
String#enforce_encoding!(encoding) and
String#coerce_encoding!(otherstring)
it won’t require developers (for C extensions also) to look at encoding
tag,
just set it when needed.

But, I can see several imlementation issues and possible options, that
should be considered:

what will happen if one tries to perfom str1.operation(str2) on two
strings with different encodings:
a) raise exception
b) silent coerce one or both strings to some “compatible”
charset/encoding, update encoding of result, replacing non-convertable
chars
using fallback mappings? (ouch, this can be split to set of options)
c) same as b) but raise exception if non-loss conversion is not
possible?
d) same as b) but warn if non-loss conversion is not possible?
e) downgrade encoding tag of acceptor to “raw/bytes” and process it?
what will happen if one changes encoding tag for String instance:
a) check and raise exception if current bytes don’t represent valid
encoding sequence?
b) just set new tag?
c) convert byte sequence to given encoding, using fallback mappings?
what to do with IO:
a) IO will return strings in “raw/bytes”?
b) IO can be tagged and will return Strings with given econding tag?
c) IO can be tagged and is by default tagged with global encoding tag?
d) IO can be tagged, but is not tagged by default, although methods
returning strings (such as read, readlines) will use global encoding
tag?
e) if IO is tagged and one tries to write to it a String with
different
encoding, what will happen?
what will be default encoding tag for new Strings:
a) “raw/bytes”
b) derived from system properties of host platform
c) option b) and can be overriden in application (btw, $KCODE, as
present,
must definitely go away!!!)
how to process source code files:
a) restrict them to ASCII and require all non-ASCII strings to be
externalized?
b) process them as “raw/bytes”?
c) introduce some kind of commented pragma for source files allowing
to
set encoding,
at present time Ruby parser can parse only sources in ASCII compatible
encoding. Would it change?
what encodings will have Numeric.to_s, Time.to_s etc., or String has
to
have/conform for String#to_f, String#to_i?

On Unicode:

case-independent canonical string matches/searches DO MATTER. And even
for
encodings, that code variants of glyphs with different codepoints
“variant-insensitive” search, as for me, is desired. Will there be such
functionality?
string comparison: will <=> use at least UCA rules for Unicode
strings, or
only byte-order comparisons will stay?
is_digit, is_space, is_alpha, is_foobarbaz etc. could matter, when
writing
a custom parser. Will those methods be provided for one-char strings?

Yes, this is short and incomplete list, but, you should get my point:
it’s
not that easy – there are dozens of decisions, with their pros and
cons, to
be done and implemented

rhaus · June 19, 2006, 1:17pm

Tim B. [email protected] writes:

without knowing what the binary is? If it’s text in a known
encoding, no breakage should occur. If it’s unknown bit patterns,
you can’t really expect anything sensible to happen… or am I
missing an obvious scenario? -Tim

Those were just fictive method calls. But let’s say I read from
a pipe and I know it contains UTF-16 with BOM, then .to_unicode
would make perfect sense, no?

In case of binary bit patterns, I sooner or later would expect some
kind of EncodingError, given this API. (I haven’t seen yet drafts of
how the API really will be.)

rhaus · June 19, 2006, 2:40pm

On 6/19/06, Yukihiro M. [email protected] wrote:

|- what will happen if one tries to perfom str1.operation(str2) on two
compatible. This point is arguable.
What is “ascii”? Specifically I would like string operations to suceed
in cases when both strings are encoded as different subset of Unicode
(or anything else). ie concatenating an ISO-8859-2 and an ISO-8859-1
string sould result in UTF-* string, not an error.

However, this would make the errors from incompatible encodings more
surprising as they would be very infrequent.

I wonder what operations on raw strings (ones without specified
encoding) would do. Or where one of the strings is raw, and the other
is not.

c), the global default shall be set from locale setting.

I am not sure this is good for network IO as well. For diagnostics it
might be useful to set the default to none, and have string raise an
exception when such strings are combined with other strings.

It is only obvious for STDIN and STDOUT that they should follow the
locale setting.

hmm, but it would need to carefully consider which operations should
work on raw strings and which not. Perhaps it is not as nice as it
looks at the first glance.

Thanks

Michal

rhaus · June 19, 2006, 9:57am

Hi,

In message “Re: Unicode roadmap?”
on Mon, 19 Jun 2006 14:57:22 +0900, “Dmitry S.”
[email protected] writes:

|But, I can see several imlementation issues and possible options, that
|should be considered:

Thank you for the ideas.

|- what will happen if one tries to perfom str1.operation(str2) on two
|strings with different encodings:
| a) raise exception
| b) silent coerce one or both strings to some “compatible”
|charset/encoding, update encoding of result, replacing non-convertable chars
|using fallback mappings? (ouch, this can be split to set of options)
| c) same as b) but raise exception if non-loss conversion is not possible?
| d) same as b) but warn if non-loss conversion is not possible?
| e) downgrade encoding tag of acceptor to “raw/bytes” and process it?

a), unless either of strings is “ascii” and the other is “ascii”
compatible. This point is arguable.

|- what will happen if one changes encoding tag for String instance:
| a) check and raise exception if current bytes don’t represent valid
|encoding sequence?
| b) just set new tag?
| c) convert byte sequence to given encoding, using fallback mappings?

b), encoding conformance check shall done lazily. I think there’s a
need for explicit encoding conformance check method.

|- what to do with IO:
| a) IO will return strings in “raw/bytes”?
| b) IO can be tagged and will return Strings with given econding tag?
| c) IO can be tagged and is by default tagged with global encoding tag?
| d) IO can be tagged, but is not tagged by default, although methods
|returning strings (such as read, readlines) will use global encoding tag?
| e) if IO is tagged and one tries to write to it a String with different
|encoding, what will happen?

c), the global default shall be set from locale setting.

|- what will be default encoding tag for new Strings:
| a) “raw/bytes”
| b) derived from system properties of host platform
| c) option b) and can be overriden in application (btw, $KCODE, as present,
|must definitely go away!!!)

Encoding for literal strings are set by pragma.

1.9 already has encoding pragma a la Python PEP263.

|- at present time Ruby parser can parse only sources in ASCII compatible
|encoding. Would it change?

No. Ruby would not allow scripts in EBCDIC, nor UTF-16, although it
allows processing of those encoding.

|- what encodings will have Numeric.to_s, Time.to_s etc., or String has to
|have/conform for String#to_f, String#to_i?

Good point. Currently, I think they should work on ASCII.

|On Unicode:
|- case-independent canonical string matches/searches DO MATTER. And even for
|encodings, that code variants of glyphs with different codepoints
|“variant-insensitive” search, as for me, is desired. Will there be such
|functionality?

Casefold search/match will be provided for Regexp. “variant
insensitive” search should be accomplished by explicit normalization
or collation.

|- string comparison: will <=> use at least UCA rules for Unicode strings, or
|only byte-order comparisons will stay?

Byte order comparison. UCA rules or such should be done explicitly
via normalization or collation.

|- is_digit, is_space, is_alpha, is_foobarbaz etc. could matter, when writing
|a custom parser. Will those methods be provided for one-char strings?

Those functions will be provided via Regexp. I am not sure if we will
provide character classification methods for strings.

						matz.

rhaus · June 19, 2006, 3:02pm

Hi,

In message “Re: Unicode roadmap?”
on Mon, 19 Jun 2006 21:39:33 +0900, “Michal S.”
[email protected] writes:

|> a), unless either of strings is “ascii” and the other is “ascii”
|> compatible. This point is arguable.
|
|What is “ascii”? Specifically I would like string operations to suceed
|in cases when both strings are encoded as different subset of Unicode
|(or anything else). ie concatenating an ISO-8859-2 and an ISO-8859-1
|string sould result in UTF-* string, not an error.

Every encoding has an attribute named ascii_compat. EUC_JP, SJIS,
ISO-8859-* and UTF-8 are declared ascii compatible, where EBCDIC,
UTF-16 and UTF-32 are not. No other auto conversion shall be done,
since we don’t particularly encourage mixed encoding model.

Restricting default encoding from locale to STDIO may be a good idea.
There’s still open issues, since default encoding from locale is not
covered by the prototype, so we need more experience.

						matz.

rhaus · June 19, 2006, 3:28pm

On 6/19/06, Yukihiro M. [email protected] wrote:

|(or anything else). ie concatenating an ISO-8859-2 and an ISO-8859-1
|string sould result in UTF-* string, not an error.

Every encoding has an attribute named ascii_compat. EUC_JP, SJIS,
ISO-8859-* and UTF-8 are declared ascii compatible, where EBCDIC,
UTF-16 and UTF-32 are not. No other auto conversion shall be done,
since we don’t particularly encourage mixed encoding model.

I wonder. Why cannot Strings throughout Ruby be always represented
as Unicode and why no let ICU handle the conversion between various
encodings for incoming and outgoing data?
(What is Java? | IBM). I know, it is a
long-stanbding issue on Unicode’s Han unification process, but without
proper Unicode support Ruby is destined to be a toy for
English-speaking and Japanese communities only. (And as I’m gearing up
to prepare a web-site in Russian, Turkish and English, I feel that
using Ruby could prove to be a major pain in the nether regions of my
body )

rhaus · June 19, 2006, 3:34pm

On 6/19/06, Dmitrii D. [email protected] wrote:

I wonder. Why cannot Strings throughout Ruby be always represented
as Unicode and why no let ICU handle the conversion between various
encodings for incoming and outgoing data?
(What is Java? | IBM). I know, it is a
long-stanbding issue on Unicode’s Han unification process, but without
proper Unicode support Ruby is destined to be a toy for
English-speaking and Japanese communities only. (And as I’m gearing up
to prepare a web-site in Russian, Turkish and English, I feel that
using Ruby could prove to be a major pain in the nether regions of my
body )

This entire discussion is centered around a proposal to do exactly
that. There are many very good reasons to avoid doing this. Unicode
Is Not Always The Answer.

It’s usually the answer, but there are times when it’s just easier
to work with data in an established code page.

-austin

rhaus · June 19, 2006, 4:35pm

On 6/19/06, Dmitrii D. [email protected] wrote:

Because otherwise we are in a risk of ending up with incompatible
extensions to strings that “simplfy” a developer’s life (and the
trend’s already begun). I wouldn’t want a C/C++ scenario with a string
class upon string class upon extension upon extension that aim to do
something String should do from the start.

I think that’s more likely with (a) what we have now and (b) a
Unicode-internal approach. (Indeed, a Unicode-internal approach
requires separating a byte vector from String, which doubles
interface complexity.) I would suggest that you look through the whole
discussion and particular attention to Matz’s statements.

-austin

rhaus · June 19, 2006, 3:47pm

On 6/19/06, Austin Z. [email protected] wrote:

body )

This entire discussion is centered around a proposal to do exactly
that. There are many very good reasons to avoid doing this. Unicode
Is Not Always The Answer.

It’s usually the answer, but there are times when it’s just easier
to work with data in an established code page.

I totally agree with that. IMO, the point lies exactly in this
“usually an answer”. What was the last time 90% of developers had to
wonder what encoding their data was in And with the advent of
Unicode (and storage becoming cheaper and cheaper and developers
becoming more and more lazy and lazy) more and more of that data is
going to be Unicode.

So, since Unicode is usually the answer, make it as painless as
possible. Make all String methods and any other functions that work
with strings accept Unicode straight out of the box without any
worries on the developer’s part. And provide alternatives (or optional
parameters?) that would allow the few more encoding-aware gurus do
whatever they want with encodings.

Because otherwise we are in a risk of ending up with incompatible
extensions to strings that “simplfy” a developer’s life (and the
trend’s already begun). I wouldn’t want a C/C++ scenario with a string
class upon string class upon extension upon extension that aim to do
something String should do from the start.

All is IMHO, of course

rhaus · June 19, 2006, 6:09pm

On Jun 19, 2006, at 4:16 AM, Christian N. wrote:

without knowing what the binary is? If it’s text in a known
encoding, no breakage should occur. If it’s unknown bit patterns,
you can’t really expect anything sensible to happen… or am I
missing an obvious scenario? -Tim

Those were just fictive method calls. But let’s say I read from
a pipe and I know it contains UTF-16 with BOM, then .to_unicode
would make perfect sense, no?

Yep. And yes, calling to_unicode on it might in fact change the bit
patterns if you adopted Early Uniform Normalization (which would be a
good thing to do). -Tim

rhaus · June 19, 2006, 7:22pm

On 6/19/06, Yukihiro M. [email protected] wrote:

|(or anything else). ie concatenating an ISO-8859-2 and an ISO-8859-1
|string sould result in UTF-* string, not an error.

Every encoding has an attribute named ascii_compat. EUC_JP, SJIS,
ISO-8859-* and UTF-8 are declared ascii compatible, where EBCDIC,
UTF-16 and UTF-32 are not. No other auto conversion shall be done,
since we don’t particularly encourage mixed encoding model.

Reading what you said it appears it would be only possible to add
ascii strings to ascii-compatible sttings. That does not sound very
useful.
If the intended meanig was rather that operations on two
ascii-compatible strings
should always be possible, and that the result is again
ascii-compatible that would sound better.

But it makes these “ascii” encodings a special case. In particular, it
makes UTF-32 less convenient to use.
I guess that for calculation so complex that it would really benefit
form the fast random access of UTF-32 it is reasonable to create a
wrapper that converts the arguments and results. However, If one wants
to perform several such (different) consecutive calculations there are
going to be several useless conversions. It is certainly possible to
make the input interface clever enough to get it right for both UTF-32
and ascii strings but requiring the user to do the conversion on
results does not look nice.

The compatibility could also be just general value that specifies the
encoding family.

ie " ".compatibility => :ascii

ASCII=“”.encode(:utf8).compatibility

raise “Incompatible encoding #{str.encoding}” unless str.compatibility
== ASCII

But different families could be possible. I am not sure if any other
encoding families of any significance exist, though.

Thanks

Michal

rhaus · June 19, 2006, 8:35pm

On 6/19/06, Tim B. [email protected] wrote:

are “many good” reasons to avoid this, but probably that’s just
because I’ve been fortunate enough to not encounter the problem
scenarios. This material would have application in a far larger
domain than just Ruby, obviously. -Tim

I’ve found that a Unicode-based string class gets in the way when it
forces you to work around it. For most text-processing purposes, it
isn’t an issue. But when you’ve got text that you don’t know the
origin encoding (and you’re probably working in a different code page),
a Unicode-based string class usually guesses wrong.

Transparent Unicode conversion only works when it is guaranteed that the
starting code page and the ending code page are identical. It’s
definitely a legacy data issue, and doesn’t affect most people, but it
has affected me in dealing with (in a non-Ruby context) NetWare.
Additionally, the overhead of converting to Unicode if your entire data
set is in ISO-8859-1 is unnecessary; again, this is a specialized case.

More problematic, from the Ruby perspective, is the that a Unicode-based
string class would require that there be a wholly separate byte vector
class; I am not sure that is necessary or wise. The first time I read a
JPG into a String, I was delighted – the interface presented was so
clean and nice as opposed to having to muck around in languages that
force multiple interfaces because of such a presentation.

Like I said, I’m not anti-Unicode, and I want Ruby’s Unicode support to
be the best, bar none. I’m not willing to compromise on API or
flexibility to gain that, though.

-austin

rhaus · June 20, 2006, 1:40am

Hi,

In message “Re: Unicode roadmap?”
on Tue, 20 Jun 2006 02:20:10 +0900, “Michal S.”
[email protected] writes:

|Reading what you said it appears it would be only possible to add
|ascii strings to ascii-compatible sttings. That does not sound very
|useful.

You will have all your strings in the encoding you choose as a
internal encoding in the usual case, so that you will have a few
compatibility problem. Only if you want to handle multiple encodings
at a time, you need explicit code conversion for mix encoding
operations.

|I guess that for calculation so complex that it would really benefit
|form the fast random access of UTF-32 it is reasonable to create a
|wrapper that converts the arguments and results. However, If one wants
|to perform several such (different) consecutive calculations there are
|going to be several useless conversions.

I am not sure what you mean. I feel like that my plan does not have
anything against UTF-32 in this regard. Perhaps, I am missing
something. What is going to cause useless conversions?

						matz.

rhaus · June 20, 2006, 2:13pm

On 6/20/06, Yukihiro M. [email protected] wrote:

internal encoding in the usual case, so that you will have a few
compatibility problem. Only if you want to handle multiple encodings
at a time, you need explicit code conversion for mix encoding
operations.

If I read pieces of text from web pages they can be in different
encodings. I do not see any reason why such pieces of text could not
be automatically concatenated as long as they are all subset of
unicode.

It was the complaint of one of the people here that in Python strings
with different encodings exist but the operations on tham fail. And it
makes the life of anybody working with such strings unneccessarily
hard. They have to be converted explicitly.

|I guess that for calculation so complex that it would really benefit
|form the fast random access of UTF-32 it is reasonable to create a
|wrapper that converts the arguments and results. However, If one wants
|to perform several such (different) consecutive calculations there are
|going to be several useless conversions.

I am not sure what you mean. I feel like that my plan does not have
anything against UTF-32 in this regard. Perhaps, I am missing
something. What is going to cause useless conversions?

If automatic conversions aren’t implemented at all, utf-32 does not
really stand out in this regard.

Thanks

Michal