Unicode roadmap?

On 19-jun-2006, at 1:00, Yukihiro M. wrote:

1.9 Oniguruma regexp engine should handle these, otherwise it’s a bug.

I’ll try to check. Oniguruma on 1.8.4. didn’t cope, but maybe it just
weren’t hooked in properly.

Hi,

In message “Re: Unicode roadmap?”
on Mon, 19 Jun 2006 00:29:46 +0900, Julian ‘Julik’ Tarkhanov
[email protected] writes:

|In other words, tell me, can Ruby’s regexes cope with the following:
|
|/[Á-Ñ]/
|/[Á-Ñ]/i

1.9 Oniguruma regexp engine should handle these, otherwise it’s a bug.

						matz.

On 19-jun-2006, at 1:56, Yukihiro M. wrote:

a bug.
|
|I’ll try to check. Oniguruma on 1.8.4. didn’t cope, but maybe it just
|weren’t hooked in properly.

If you have any problem, send us a report with what you expect and
what you get.

Well, I tried on the CVS latest (1.9) and I get:

irb(main):011:0> “НÐ?Ð?лагодаÑ?Ная” =~ /[а-я]/i
=> 6 (should be zero)

That is - character classes work, casefolding doesn’t.

Hi,

In message “Re: Unicode roadmap?”
on Mon, 19 Jun 2006 10:32:08 +0900, Julian ‘Julik’ Tarkhanov
[email protected] writes:

|Well, I tried on the CVS latest (1.9) and I get:
|
|irb(main):011:0> “îåâÌÁÇÏÄÁÒîÁÑ” =~ /[Á-Ñ]/i
|=> 6 (should be zero)
|
|That is - character classes work, casefolding doesn’t.

I found out that Oniguruma casefolding works only for characters
within iso8869-*. Considering the size of the casefolding table it is
compromise for the time being. I will fix this in the future.

						matz.

Hi,

In message “Re: Unicode roadmap?”
on Mon, 19 Jun 2006 08:09:29 +0900, Julian ‘Julik’ Tarkhanov
[email protected] writes:

|> |/[Á-Ñ]/
|> |/[Á-Ñ]/i
|>
|> 1.9 Oniguruma regexp engine should handle these, otherwise it’s a bug.
|
|I’ll try to check. Oniguruma on 1.8.4. didn’t cope, but maybe it just
|weren’t hooked in properly.

If you have any problem, send us a report with what you expect and
what you get.

						matz.

On 19-jun-2006, at 6:05, Yukihiro M. wrote:

|
|That is - character classes work, casefolding doesn’t.

I found out that Oniguruma casefolding works only for characters
within iso8869-*. Considering the size of the casefolding table it is
compromise for the time being. I will fix this in the future.

Thanks for the clarification :slight_smile:

Correct me,if I’m wrong, but for Matz’s plan on M17N, summary is:

  1. String internally will remain the same : char *ptr, long len - in
    bytes
  2. String instances will have encoding tag
  3. All String/Regexp methods will respect that encoding tag and return
    char(glyph) indexes
  4. Methods like byte_size, codepoints, each_char, each_codepoint will be
    introduced(?)
  5. slice will always accept chars indices and return substrings

I’d say that WOULD BE GOOD, and with methods like
String#enforce_encoding!(encoding) and
String#coerce_encoding!(otherstring)
it won’t require developers (for C extensions also) to look at encoding
tag,
just set it when needed.

But, I can see several imlementation issues and possible options, that
should be considered:

  • what will happen if one tries to perfom str1.operation(str2) on two
    strings with different encodings:
    a) raise exception
    b) silent coerce one or both strings to some “compatible”
    charset/encoding, update encoding of result, replacing non-convertable
    chars
    using fallback mappings? (ouch, this can be split to set of options)
    c) same as b) but raise exception if non-loss conversion is not
    possible?
    d) same as b) but warn if non-loss conversion is not possible?
    e) downgrade encoding tag of acceptor to “raw/bytes” and process it?

  • what will happen if one changes encoding tag for String instance:
    a) check and raise exception if current bytes don’t represent valid
    encoding sequence?
    b) just set new tag?
    c) convert byte sequence to given encoding, using fallback mappings?

  • what to do with IO:
    a) IO will return strings in “raw/bytes”?
    b) IO can be tagged and will return Strings with given econding tag?
    c) IO can be tagged and is by default tagged with global encoding tag?
    d) IO can be tagged, but is not tagged by default, although methods
    returning strings (such as read, readlines) will use global encoding
    tag?
    e) if IO is tagged and one tries to write to it a String with
    different
    encoding, what will happen?

  • what will be default encoding tag for new Strings:
    a) “raw/bytes”
    b) derived from system properties of host platform
    c) option b) and can be overriden in application (btw, $KCODE, as
    present,
    must definitely go away!!!)

  • how to process source code files:
    a) restrict them to ASCII and require all non-ASCII strings to be
    externalized?
    b) process them as “raw/bytes”?
    c) introduce some kind of commented pragma for source files allowing
    to
    set encoding,

  • at present time Ruby parser can parse only sources in ASCII compatible
    encoding. Would it change?

  • what encodings will have Numeric.to_s, Time.to_s etc., or String has
    to
    have/conform for String#to_f, String#to_i?

On Unicode:

  • case-independent canonical string matches/searches DO MATTER. And even
    for
    encodings, that code variants of glyphs with different codepoints
    “variant-insensitive” search, as for me, is desired. Will there be such
    functionality?

  • string comparison: will <=> use at least UCA rules for Unicode
    strings, or
    only byte-order comparisons will stay?

  • is_digit, is_space, is_alpha, is_foobarbaz etc. could matter, when
    writing
    a custom parser. Will those methods be provided for one-char strings?

Yes, this is short and incomplete list, but, you should get my point:
it’s
not that easy – there are dozens of decisions, with their pros and
cons, to
be done and implemented :frowning:

Tim B. [email protected] writes:

without knowing what the binary is? If it’s text in a known
encoding, no breakage should occur. If it’s unknown bit patterns,
you can’t really expect anything sensible to happen… or am I
missing an obvious scenario? -Tim

Those were just fictive method calls. But let’s say I read from
a pipe and I know it contains UTF-16 with BOM, then .to_unicode
would make perfect sense, no?

In case of binary bit patterns, I sooner or later would expect some
kind of EncodingError, given this API. (I haven’t seen yet drafts of
how the API really will be.)

On 6/19/06, Yukihiro M. [email protected] wrote:

|- what will happen if one tries to perfom str1.operation(str2) on two
compatible. This point is arguable.
What is “ascii”? Specifically I would like string operations to suceed
in cases when both strings are encoded as different subset of Unicode
(or anything else). ie concatenating an ISO-8859-2 and an ISO-8859-1
string sould result in UTF-* string, not an error.

However, this would make the errors from incompatible encodings more
surprising as they would be very infrequent.

I wonder what operations on raw strings (ones without specified
encoding) would do. Or where one of the strings is raw, and the other
is not.

c), the global default shall be set from locale setting.

I am not sure this is good for network IO as well. For diagnostics it
might be useful to set the default to none, and have string raise an
exception when such strings are combined with other strings.

It is only obvious for STDIN and STDOUT that they should follow the
locale setting.

hmm, but it would need to carefully consider which operations should
work on raw strings and which not. Perhaps it is not as nice as it
looks at the first glance.

Thanks

Michal

Hi,

In message “Re: Unicode roadmap?”
on Mon, 19 Jun 2006 14:57:22 +0900, “Dmitry S.”
[email protected] writes:

|But, I can see several imlementation issues and possible options, that
|should be considered:

Thank you for the ideas.

|- what will happen if one tries to perfom str1.operation(str2) on two
|strings with different encodings:
| a) raise exception
| b) silent coerce one or both strings to some “compatible”
|charset/encoding, update encoding of result, replacing non-convertable chars
|using fallback mappings? (ouch, this can be split to set of options)
| c) same as b) but raise exception if non-loss conversion is not possible?
| d) same as b) but warn if non-loss conversion is not possible?
| e) downgrade encoding tag of acceptor to “raw/bytes” and process it?

a), unless either of strings is “ascii” and the other is “ascii”
compatible. This point is arguable.

|- what will happen if one changes encoding tag for String instance:
| a) check and raise exception if current bytes don’t represent valid
|encoding sequence?
| b) just set new tag?
| c) convert byte sequence to given encoding, using fallback mappings?

b), encoding conformance check shall done lazily. I think there’s a
need for explicit encoding conformance check method.

|- what to do with IO:
| a) IO will return strings in “raw/bytes”?
| b) IO can be tagged and will return Strings with given econding tag?
| c) IO can be tagged and is by default tagged with global encoding tag?
| d) IO can be tagged, but is not tagged by default, although methods
|returning strings (such as read, readlines) will use global encoding tag?
| e) if IO is tagged and one tries to write to it a String with different
|encoding, what will happen?

c), the global default shall be set from locale setting.

|- what will be default encoding tag for new Strings:
| a) “raw/bytes”
| b) derived from system properties of host platform
| c) option b) and can be overriden in application (btw, $KCODE, as present,
|must definitely go away!!!)

Encoding for literal strings are set by pragma.

|- how to process source code files:
| a) restrict them to ASCII and require all non-ASCII strings to be
|externalized?
| b) process them as “raw/bytes”?
| c) introduce some kind of commented pragma for source files allowing to
|set encoding,

1.9 already has encoding pragma a la Python PEP263.

|- at present time Ruby parser can parse only sources in ASCII compatible
|encoding. Would it change?

No. Ruby would not allow scripts in EBCDIC, nor UTF-16, although it
allows processing of those encoding.

|- what encodings will have Numeric.to_s, Time.to_s etc., or String has to
|have/conform for String#to_f, String#to_i?

Good point. Currently, I think they should work on ASCII.

|On Unicode:
|- case-independent canonical string matches/searches DO MATTER. And even for
|encodings, that code variants of glyphs with different codepoints
|“variant-insensitive” search, as for me, is desired. Will there be such
|functionality?

Casefold search/match will be provided for Regexp. “variant
insensitive” search should be accomplished by explicit normalization
or collation.

|- string comparison: will <=> use at least UCA rules for Unicode strings, or
|only byte-order comparisons will stay?

Byte order comparison. UCA rules or such should be done explicitly
via normalization or collation.

|- is_digit, is_space, is_alpha, is_foobarbaz etc. could matter, when writing
|a custom parser. Will those methods be provided for one-char strings?

Those functions will be provided via Regexp. I am not sure if we will
provide character classification methods for strings.

						matz.

Hi,

In message “Re: Unicode roadmap?”
on Mon, 19 Jun 2006 21:39:33 +0900, “Michal S.”
[email protected] writes:

|> a), unless either of strings is “ascii” and the other is “ascii”
|> compatible. This point is arguable.
|
|What is “ascii”? Specifically I would like string operations to suceed
|in cases when both strings are encoded as different subset of Unicode
|(or anything else). ie concatenating an ISO-8859-2 and an ISO-8859-1
|string sould result in UTF-* string, not an error.

Every encoding has an attribute named ascii_compat. EUC_JP, SJIS,
ISO-8859-* and UTF-8 are declared ascii compatible, where EBCDIC,
UTF-16 and UTF-32 are not. No other auto conversion shall be done,
since we don’t particularly encourage mixed encoding model.

|> |- what to do with IO:
|> | a) IO will return strings in “raw/bytes”?
|> | b) IO can be tagged and will return Strings with given econding tag?
|> | c) IO can be tagged and is by default tagged with global encoding tag?
|> | d) IO can be tagged, but is not tagged by default, although methods
|> |returning strings (such as read, readlines) will use global encoding tag?
|> | e) if IO is tagged and one tries to write to it a String with different
|> |encoding, what will happen?
|>
|> c), the global default shall be set from locale setting.
|
|I am not sure this is good for network IO as well. For diagnostics it
|might be useful to set the default to none, and have string raise an
|exception when such strings are combined with other strings.
|
|It is only obvious for STDIN and STDOUT that they should follow the
|locale setting.

Restricting default encoding from locale to STDIO may be a good idea.
There’s still open issues, since default encoding from locale is not
covered by the prototype, so we need more experience.

						matz.

On 6/19/06, Yukihiro M. [email protected] wrote:

|(or anything else). ie concatenating an ISO-8859-2 and an ISO-8859-1
|string sould result in UTF-* string, not an error.

Every encoding has an attribute named ascii_compat. EUC_JP, SJIS,
ISO-8859-* and UTF-8 are declared ascii compatible, where EBCDIC,
UTF-16 and UTF-32 are not. No other auto conversion shall be done,
since we don’t particularly encourage mixed encoding model.

I wonder. Why cannot Strings throughout Ruby be always represented
as Unicode and why no let ICU handle the conversion between various
encodings for incoming and outgoing data?
(What is Java? | IBM). I know, it is a
long-stanbding issue on Unicode’s Han unification process, but without
proper Unicode support Ruby is destined to be a toy for
English-speaking and Japanese communities only. (And as I’m gearing up
to prepare a web-site in Russian, Turkish and English, I feel that
using Ruby could prove to be a major pain in the nether regions of my
body :slight_smile: )

On 6/19/06, Dmitrii D. [email protected] wrote:

I wonder. Why cannot Strings throughout Ruby be always represented
as Unicode and why no let ICU handle the conversion between various
encodings for incoming and outgoing data?
(What is Java? | IBM). I know, it is a
long-stanbding issue on Unicode’s Han unification process, but without
proper Unicode support Ruby is destined to be a toy for
English-speaking and Japanese communities only. (And as I’m gearing up
to prepare a web-site in Russian, Turkish and English, I feel that
using Ruby could prove to be a major pain in the nether regions of my
body :slight_smile: )

This entire discussion is centered around a proposal to do exactly
that. There are many very good reasons to avoid doing this. Unicode
Is Not Always The Answer.

It’s usually the answer, but there are times when it’s just easier
to work with data in an established code page.

-austin

On 6/19/06, Dmitrii D. [email protected] wrote:

Because otherwise we are in a risk of ending up with incompatible
extensions to strings that “simplfy” a developer’s life (and the
trend’s already begun). I wouldn’t want a C/C++ scenario with a string
class upon string class upon extension upon extension that aim to do
something String should do from the start.

I think that’s more likely with (a) what we have now and (b) a
Unicode-internal approach. (Indeed, a Unicode-internal approach
requires separating a byte vector from String, which doubles
interface complexity.) I would suggest that you look through the whole
discussion and particular attention to Matz’s statements.

-austin

On 6/19/06, Austin Z. [email protected] wrote:

body :slight_smile: )

This entire discussion is centered around a proposal to do exactly
that. There are many very good reasons to avoid doing this. Unicode
Is Not Always The Answer.

It’s usually the answer, but there are times when it’s just easier
to work with data in an established code page.

I totally agree with that. IMO, the point lies exactly in this
usually an answer”. What was the last time 90% of developers had to
wonder what encoding their data was in :wink: And with the advent of
Unicode (and storage becoming cheaper and cheaper and developers
becoming more and more lazy and lazy) more and more of that data is
going to be Unicode.

So, since Unicode is usually the answer, make it as painless as
possible. Make all String methods and any other functions that work
with strings accept Unicode straight out of the box without any
worries on the developer’s part. And provide alternatives (or optional
parameters?) that would allow the few more encoding-aware gurus :slight_smile: do
whatever they want with encodings.

Because otherwise we are in a risk of ending up with incompatible
extensions to strings that “simplfy” a developer’s life (and the
trend’s already begun). I wouldn’t want a C/C++ scenario with a string
class upon string class upon extension upon extension that aim to do
something String should do from the start.

All is IMHO, of course :slight_smile:

On Jun 19, 2006, at 4:16 AM, Christian N. wrote:

without knowing what the binary is? If it’s text in a known
encoding, no breakage should occur. If it’s unknown bit patterns,
you can’t really expect anything sensible to happen… or am I
missing an obvious scenario? -Tim

Those were just fictive method calls. But let’s say I read from
a pipe and I know it contains UTF-16 with BOM, then .to_unicode
would make perfect sense, no?

Yep. And yes, calling to_unicode on it might in fact change the bit
patterns if you adopted Early Uniform Normalization (which would be a
good thing to do). -Tim

On 6/19/06, Yukihiro M. [email protected] wrote:

|(or anything else). ie concatenating an ISO-8859-2 and an ISO-8859-1
|string sould result in UTF-* string, not an error.

Every encoding has an attribute named ascii_compat. EUC_JP, SJIS,
ISO-8859-* and UTF-8 are declared ascii compatible, where EBCDIC,
UTF-16 and UTF-32 are not. No other auto conversion shall be done,
since we don’t particularly encourage mixed encoding model.

Reading what you said it appears it would be only possible to add
ascii strings to ascii-compatible sttings. That does not sound very
useful.
If the intended meanig was rather that operations on two
ascii-compatible strings
should always be possible, and that the result is again
ascii-compatible that would sound better.

But it makes these “ascii” encodings a special case. In particular, it
makes UTF-32 less convenient to use.
I guess that for calculation so complex that it would really benefit
form the fast random access of UTF-32 it is reasonable to create a
wrapper that converts the arguments and results. However, If one wants
to perform several such (different) consecutive calculations there are
going to be several useless conversions. It is certainly possible to
make the input interface clever enough to get it right for both UTF-32
and ascii strings but requiring the user to do the conversion on
results does not look nice.

The compatibility could also be just general value that specifies the
encoding family.

ie " ".compatibility => :ascii

ASCII=“”.encode(:utf8).compatibility

raise “Incompatible encoding #{str.encoding}” unless str.compatibility
== ASCII

But different families could be possible. I am not sure if any other
encoding families of any significance exist, though.

Thanks

Michal

On 6/19/06, Tim B. [email protected] wrote:

are “many good” reasons to avoid this, but probably that’s just
because I’ve been fortunate enough to not encounter the problem
scenarios. This material would have application in a far larger
domain than just Ruby, obviously. -Tim

I’ve found that a Unicode-based string class gets in the way when it
forces you to work around it. For most text-processing purposes, it
isn’t an issue. But when you’ve got text that you don’t know the
origin encoding (and you’re probably working in a different code page),
a Unicode-based string class usually guesses wrong.

Transparent Unicode conversion only works when it is guaranteed that the
starting code page and the ending code page are identical. It’s
definitely a legacy data issue, and doesn’t affect most people, but it
has affected me in dealing with (in a non-Ruby context) NetWare.
Additionally, the overhead of converting to Unicode if your entire data
set is in ISO-8859-1 is unnecessary; again, this is a specialized case.

More problematic, from the Ruby perspective, is the that a Unicode-based
string class would require that there be a wholly separate byte vector
class; I am not sure that is necessary or wise. The first time I read a
JPG into a String, I was delighted – the interface presented was so
clean and nice as opposed to having to muck around in languages that
force multiple interfaces because of such a presentation.

Like I said, I’m not anti-Unicode, and I want Ruby’s Unicode support to
be the best, bar none. I’m not willing to compromise on API or
flexibility to gain that, though.

-austin

Hi,

In message “Re: Unicode roadmap?”
on Tue, 20 Jun 2006 02:20:10 +0900, “Michal S.”
[email protected] writes:

|Reading what you said it appears it would be only possible to add
|ascii strings to ascii-compatible sttings. That does not sound very
|useful.

You will have all your strings in the encoding you choose as a
internal encoding in the usual case, so that you will have a few
compatibility problem. Only if you want to handle multiple encodings
at a time, you need explicit code conversion for mix encoding
operations.

|I guess that for calculation so complex that it would really benefit
|form the fast random access of UTF-32 it is reasonable to create a
|wrapper that converts the arguments and results. However, If one wants
|to perform several such (different) consecutive calculations there are
|going to be several useless conversions.

I am not sure what you mean. I feel like that my plan does not have
anything against UTF-32 in this regard. Perhaps, I am missing
something. What is going to cause useless conversions?

						matz.

On 6/20/06, Yukihiro M. [email protected] wrote:

internal encoding in the usual case, so that you will have a few
compatibility problem. Only if you want to handle multiple encodings
at a time, you need explicit code conversion for mix encoding
operations.

If I read pieces of text from web pages they can be in different
encodings. I do not see any reason why such pieces of text could not
be automatically concatenated as long as they are all subset of
unicode.

It was the complaint of one of the people here that in Python strings
with different encodings exist but the operations on tham fail. And it
makes the life of anybody working with such strings unneccessarily
hard. They have to be converted explicitly.

|I guess that for calculation so complex that it would really benefit
|form the fast random access of UTF-32 it is reasonable to create a
|wrapper that converts the arguments and results. However, If one wants
|to perform several such (different) consecutive calculations there are
|going to be several useless conversions.

I am not sure what you mean. I feel like that my plan does not have
anything against UTF-32 in this regard. Perhaps, I am missing
something. What is going to cause useless conversions?

If automatic conversions aren’t implemented at all, utf-32 does not
really stand out in this regard.

Thanks

Michal