Unicode roadmap?


#81

On Jun 17, 2006, at 2:55 PM, Michal S. wrote:

First for reasons of efficiency. If an application is going to perform
lots of slicing and poking on strings it will want some encoding that
is suiatble for that such as UTF-32. If an application runs on system
with little memory it will want space-efficient encoding (ie UTF-8 or
UTF-16 for Asian languages).

Um, the practical experience is that the code required to unpack a
UTF-8 stream into a sequence of integer codepoints (and reverse the
process) is easy and very efficient; to the point that for “slicing
and poking”, UTF-8 vs UTF-16 vs UTF-32 is pretty well a wash.

-Tim


#82

On Jun 17, 2006, at 3:15 PM, Austin Z. wrote:

Why? What if the library that I’m interfacing with requires EUC-JP?
Sorry, but Unicode is not necessarily the right answer.

Indeed it’s not, but this argument escapes me. If you try feed that
library an Arabic string, something will break, because EUC-JP can’t
represent Arabic. So what? Whatever character set(s) you
standardize on, there is going to be existing software that won’t be
able to handle all of it… I’m just not following your argument.

-Tim


#83

On Jun 17, 2006, at 10:34 AM, Stefan L. wrote:

Or do you mean Ruby should determine the encoding
automatically? IMO, that would be bad magic and error-prone.

Not possible in the general case. There are a few data formats
including XML and ASN.1, which make it possible to reliably infer the
encoding from the instance, but a lot of Web processing these days is
best-guess, and often fails.

How does the regular expression engine work then?

The two sane options are
(a) have a fixed encoding for Strings and compile the regex in such a
way that it runs directly on the encoding. This has been done for
both UTF-8 and UTF-16 and is insanely efficient, but it locks you
into the fixed encoding.
(b) have an iterator which produces abstract characters from whatever
encoding is in use and run the regex over the characters, not the
bytes of the representation. The implementation is trickier and
performance is an issue, but you’re not locked to an encoding.

-Tim


#84

I’ll chime back in with my not-so-expert opinion, so it’s known where I
stand. Take it for whatever it’s worth.

  • I almost entirely agree with Juergen’s longer post on what unicode
    support
    should look like in 2.0. I won’t go into the details of what I disagree
    with
    because I’m a little squishy in those areas.
  • I believe that supporting encoding-tagged strings would be a horrible,
    horrible mess for both Ruby VM/interpreter implementers and extension
    implementers while not adding any serious benefits for Ruby the
    language.
    When it comes down to it, you’re going to have string A using encoding X
    and
    string B using encoding Y and in order to work with them both together
    you’ll have to find some common ground. Settle on common ground early or
    you
    pay the price to do it EVERY time you work with strings later.
  • I have no intention to ever write a C extension for Ruby. I know many
    out
    there do. However, I think the important thing about Ruby is Ruby, and
    making the language bend over backwards to make life easier for C
    hackers is
    absurd. Making unicode support needlessly complex in Ruby (the language)
    only ends up hurting its usability. I for one would not want to
    sacrifice
    the beauty and simplicity of Ruby solely to apease the C community.
    Flame on
    if you will, but The Ruby Way should rule here.
  • In the end, I should not have to care what encoding strings use
    internally
    unless I absolutely have to know. Every time questions come up about
    unicode
    support in Java, I have to look it up…UTF-8? UTF-16? UCS-2? I rarely
    need
    to know this information, and I rarely remember it. That’s exactly the
    point. Make the one internal encoding whatever is deemed most flexible,
    most
    performant, and above all most global. Nobody writing Ruby code should
    have to care.
  • I so rarely work with Strings on a character-by-character basis, and
    when
    I do all I should have to say is get_character and know that what I have
    represents a full and complete character representation. If you’re
    dealing
    with bytes, call it what it is–the aforementioned ByteBuffer. Ruby
    needs to
    support the concepts of Strings and ByteBuffers independently.

I think it all comes back to a simple question: Which method of
supporting
unicode would feel the most “Ruby”? Which one is DRY and KISS and all
the
other lovely acronyms this community holds so dear? Figure that out, and
there’s your answer. I’d be willing to bet it’s not
every-string-can-encode-differently, because I don’t see how that would
ever
help me write better Ruby code…and improving Ruby is the point of all
this, right?


#85

On Jun 17, 2006, at 3:22 PM, removed_email_address@domain.invalid wrote:

be locale and encoding-aware?
I’m not close enough to Ruby to have a useful opinion, but for many
other software systems, the designers decided that the performance
and interoperability gains achievable by limiting themselves to
Unicode were a compelling enough argument, and so chose.

In particular, these days, both the W3C and the IETF overwhelmingly
specify the use of Unicode characters when text is to be included in
protocols or data delivery formats. So even if you can handle lots
of non-Unicode stuff, the Net may have difficulty getting it to you. -
Tim


#86

On 6/18/06, Stefan L. removed_email_address@domain.invalid wrote:

encoding that is suiatble for that such as UTF-32. If an
encoding. If somebody has enough time and motivation he can
even implement support for multiple encodings and let the user
choose at build-time.

Why? It can already handle utf-8 strings or arrays of unicode
codepoints. They just do not feel like strings with ruby 1.8. What I
want is a glue in string class that does make them feel so.

had to read anyway?
encoding was choosen and iff the encoding of the input doesn’t
match the internal encoding.

However, if you can choose the encoding there is no need to recode at
all. You just keep the string as is, and there is a good chance the
output encoding will match the input encoding.

And in case you need to recode the string you got the encoding
information, and the recoding can be done automatically, and only when
needed.

Michal


#87

On Jun 17, 2006, at 4:15 PM, Julian ‘Julik’ Tarkhanov wrote:

I would much rather prefer UTF-8 in a language such as Ruby which
is often used as glue between
other systems. UTF-8 is used for interchange and it’s indisputable.
If you go for UTF-16 or UTF-32, you are most likely
to convert every single character of text files you read (in text
files present in the wild AFAIK UTF-16 and UTF-32 are a minority,
thanks to the BOM and other setbacks).

There’s a lot of UTF-16 out there. There’s more ISO-8859-* than
that, and more Microsoft code-page-* text than everything else put
together. Yes, with UTF-16 & -32 you do a lot of byte swapping but
it’s pretty cheap and pretty reliable. (I like UTF-8 too, but it’s
not without issues).

-Tim


#88

On 6/18/06, Julian ‘Julik’ Tarkhanov removed_email_address@domain.invalid wrote:

If you go for UTF-16 or UTF-32, you are most likely
to convert every single character of text files you read (in text
files present in the wild AFAIK UTF-16 and UTF-32 are a minority,
thanks to the BOM and other setbacks).

Here you go. You can have the strings in UTF-8, and I can heve them in
UTF-32. That is the flexibility of the solution without a fixed
encoding.

how would Unicode Ruby and TRON Ruby exchange strings?

I think Alan Little summed it up very well. The problem with Unicode
in Ruby is strive for perfection
(i.e. satisfy the users of every conceivable or needed encoding).
It’s very noble and I personally can’t imagine it
(even with the “democratic coerce” approach Austin cited). The only
thing I don’t know if a system having this type of handling can be
built at all and how it will interoperate.

But quite a few people here look like they do know. I do not know much
about regexes but I can imagine just about any other string operation.
And the current regexes already do operate on multiple encodings.

Up until now all scripting languages I used somewhat (Perl, Python,
Ruby) allowed all encodings in strings and doing Unicode in them hurts.

And how that leads to the conclusion that there should be only one
encoding?

Bluntly put, I am selfish and I don’t believe in the “saving grace”
of the M17N (because I just can’t wrap it around my head and I sure
as hell know it’s going to be VERY complex).

That’s the point. If it is wrapped into the string class you do not
have to wrap it around your head.

It’s also something that bothers me the most about Ruby’s “unicode
discussions” (I’ve read all of them on this list dating back to 2002
because I need it to work NOW) and they
always transcend into this kind of religious discussion in the spirit
of “but your encoding is not good enough”, “but my bad encoding isn’t
that one and I still need it to work” etc.

And that is eaxctly why a fixed encoding is bad. If strings can be
encoded in any way there is no point i religious discussions which
encoding you like the most.

While for me the greatest thing about Unicode is that it’s Just Good
Enough. And it doesn’t seem Unicode is indeed THAT useless for CJK
languages either
(although I’m sure Paul can correct me - all the 4 languages I am in
control of use only 2 scripting systems with some odd additions here
and there).

It is JustGoodEnouhg for most cases but not for all. It is not useless
for CJK, just suboptimal because of the Han unification. And it also
does not try to include the historic characters.

And no, I didn’t have a chance to see a TRON system in the wild. If
someone would show me one within 200 km distance I would be glad to
take a look.

I do not care. Some poeple find that encoding useful. Since the
potential to support any encoding including TRON does not get in the
way when I deal with my text I am fine with that.

Michal


#89

On Sun, Jun 18, 2006 at 07:21:25AM +0900, Austin Z. wrote:

Which code page? EBCDIC has as many code pages (including a UTF-EBCDIC)
as exist in other 8-byte encodings.

Obviously, EBCDIC -> UNICODE -> same EBCDIC Codepage as before.

Not to mention that Matz has explicitly stated in the past that he
character class, and that was Java’s main folly. (UCS-2 is a
strictly 16 bit per character encoding, but new Unicode standards
specify 21 bit characters, so they had to “extend” it).

Um. Do you mean UTF-32? Because there’s no binary representaiton of
Unicode Character Code Points that isn’t an encoding of some sort. If
that’s the case, that’s unacceptable from a memory representation.

Yes, I do mean the String interface to be UTF-32, or pure code
points which is the same but less suscept to to standard changes, if
accessed at character level. If accessed at substring level, a
substring of a String is obviously a String, and you don’t need a
bitwise representation at all.

According to my proposal, Strings do not need an encoding from the
String user’s point of view when working just with Strings, and users
won’t care apart from memory/performance consumption, which I believe
can be made good enough with a totally encapsulted, internal storage
format to be decided later. I will avoid a premature optimization
debate here now.

Of course encoding matters when Strings are read or written somewhere,
or converted to bit-/bytewise representation explicitly. The Encoding
Framework, however it’ll look, needs to be able to convert to and from
Unicode code points for these operations only, and not between
arbitrary encodings. (You may code this to recode directly from
the internal storage format for performance reasons, but that’ll be
transparent to the String user.)

This breaks down for characters not represented in Unicode at all, and
is a nuisance for some characters affected by the Han Unification
issue. But Unicode set out to prevent exactly this, and if we
beleieve in Unicode at all, we can only hope they’ll fix this in an
upcoming revision. Meanwhile we could map any additional characters
(or sets of) we need to higher, unused Unicode plains, that’ll be no
worse than having different, possibly incompatible kinds of Strings.

We’ll need an additional class for pure byte vectors, or just use
Array for this kind of work, and I think this is cleaner.

Regarding Java, they switched from UCS-2 to UTF-16 (mostly). UCS-2 is
a pure 16 bit per character encoding and cannot represent codepoints
above 0xffff. UTF-16 works alike UTF-8, but with 16 bit chunks. But
their abstraction of a single character, the class Char(acter), is
still only 16 bit wide which leads to confusion and similiar to the C
type char, which cannot represent all real characters either. It is
even worse than in C, because C explicitly defines char to be a memory
cell of 8 bits or more, whereas Java really meant Char to be a
character.

I am unaware of unsolveable problems with Unicode and Eastern
languages, I asked specifically about it. If you think Unicode is
unfixably flawed in this respect, I guess we all should write off
Unicode now rather than later? Can you detail why Unicode is
unacceptable as a single world wide unifying character set?
Especially, are there character sets which cannot be converted to
Unicode and back, which is the main requirement to have Unicode
Strings in a non-Unicode environment?

Legacy data and performance.

Map legacy data, that is characters still not in Unicode, to a high
Plane in Unicode. That way all characters can be used together all the
time. When Unicode includes them we can change that to the official
code points. Note there are no files in String’s internal storage
format, so we don’t have to worry about reencoding them.

I am not worried about performance. I’d code in C if I were, or
Lisp.

For one, Moore’s law is at work and my whole proposal was for 2.0. My
proposal only adds a constant factor to String handling, it doesn’t
have higher order complexity.

On the other hand, conversions needs to be done at other times with my
proposal than for M17N Strings, and it depends on the application if
that is more or less often. String-String operations never need to do
recoding, as opposed to M17N Strings. I/O always needs conversion, and
may need conversion with M17N too. I havea a hunch that allowing
different kinds of Strings around (as in M17N presumely) should
require recoding far more often.

Jürgen


#90

On Sun, Jun 18, 2006 at 07:22:34AM +0900, removed_email_address@domain.invalid wrote:

be agnostic about these things but still provide a coherent framework
for building libraries and applications that can be locale and
encoding-aware?

Gary W.

Maybe I was unclear. I did’t mean Ruby has too choose an existing
standard, but Ruby has to choose which set of characters to handle in
Strings, in the mathematical sense.

Language implementation, and usage of the String class should be
easier if this set is

  • well defined

Unicode code points are pretty good in this respect, better than the
union of all characters in all encodings of possible M17N Strings.
And we may use private extensions to Unicode for legacy characters not
included in Unicode already.

  • All characters are equally allowed in all Strings.

M17N fails this one. a[5] = b[3] if their encodings are incompatible?

At best it’ll coerce a to an encoding which can handle both, which
would be Unicode 98% of the time any way, 1% something else, and 1%
totally fail. Don’t nail me down on the numbers.

Mathematically, String functions should be defined on the whole set,
not subsets, or their application becomes a chore.

Jürgen


#91

On 18-jun-2006, at 6:17, Tim B. wrote:

Be careful. Case folding is a horrible can of worms, is rarely
implemented correctly, and when it is (the Java library tries
really hard) is insanely expensive. The reason is that case
conversion is not only language-sensitive but jurisdiction
sensitive (in some respects different in France & Québec). Trying
to do case-folding on text that is not known to be ASCII is likely
a symptom of a bug.

Let’s write a specification.


#92

On 6/18/06, Juergen S. removed_email_address@domain.invalid wrote:

On Sun, Jun 18, 2006 at 07:21:25AM +0900, Austin Z. wrote:

Um. Do you mean UTF-32? Because there’s no binary representaiton of
Unicode Character Code Points that isn’t an encoding of some sort. If
that’s the case, that’s unacceptable from a memory representation.
Yes, I do mean the String interface to be UTF-32, or pure code
points which is the same but less suscept to to standard changes, if
accessed at character level. If accessed at substring level, a
substring of a String is obviously a String, and you don’t need a
bitwise representation at all.

Again, this is completely unacceptable from a memory usage perspective.
I certainly don’t want my programs taking up 4x the additional memory
for string handling.

But “pure code points” is a red herring and a mistake in any case. Code
points aren’t sufficient. You need glyphs, and some glyphs can be
produced with multiple code points (e.g., LOWERCASE A + COMBINING ACUTE
ACCENT as opposed to A ACUTE). Indeed, some glyphs can only be
produced with multiple code points. Dealing with this intelligently
requires a lot of smarts, but it’s precisely what we should do.

According to my proposal, Strings do not need an encoding from the
String user’s point of view when working just with Strings, and users
won’t care apart from memory/performance consumption, which I believe
can be made good enough with a totally encapsulted, internal storage
format to be decided later. I will avoid a premature optimization
debate here now.

Again, you are incorrect. I do care about the encoding of each String
that I deal with, because only that allows me (or String) to deal with
conversions appropriately. Granted, most of the time, I won’t care.
But I do work with legacy code page stuff from time to time, and
pronouncements that I won’t care are just arrogance or ignorance.

Of course encoding matters when Strings are read or written somewhere,
or converted to bit-/bytewise representation explicitly. The Encoding
Framework, however it’ll look, needs to be able to convert to and from
Unicode code points for these operations only, and not between
arbitrary encodings. (You may code this to recode directly from
the internal storage format for performance reasons, but that’ll be
transparent to the String user.)

I prefer arbitrary encoding conversion capability.

This breaks down for characters not represented in Unicode at all, and
is a nuisance for some characters affected by the Han Unification
issue. But Unicode set out to prevent exactly this, and if we
beleieve in Unicode at all, we can only hope they’ll fix this in an
upcoming revision. Meanwhile we could map any additional characters
(or sets of) we need to higher, unused Unicode plains, that’ll be no
worse than having different, possibly incompatible kinds of Strings.

Those choices aren’t ours to make.

We’ll need an additional class for pure byte vectors, or just use
Array for this kind of work, and I think this is cleaner.

I don’t. Such an additional class adds unnecessary complexity to
interfaces. This is the main reason that I oppose the foolish choice
to pick a fixed encoding for Ruby Strings.

Legacy data and performance.
Map legacy data, that is characters still not in Unicode, to a high
Plane in Unicode. That way all characters can be used together all the
time. When Unicode includes them we can change that to the official
code points. Note there are no files in String’s internal storage
format, so we don’t have to worry about reencoding them.

Um. This is the statement of someone who is ignoring legacy issues.
Performance is a big issue when you’re dealing with enough legacy
data. Don’t punish people because of your own arrogance about encoding
choices.

Again: Unicode Is Not Always The Right Choice. Anyone who tells you
otherwise is selling you a Unicode toolkit and only has their wallet in
mind. Unicode is often the right choice, but it’s not the only
choice and there are times when having the flexibility to work in
other encodings without having to work through Unicode as an
intermediary is the right choice. And from an API perspective,
separating String and “ByteVector” is a mistake.

On the other hand, conversions needs to be done at other times with my
proposal than for M17N Strings, and it depends on the application if
that is more or less often. String-String operations never need to do
recoding, as opposed to M17N Strings. I/O always needs conversion, and
may need conversion with M17N too. I havea a hunch that allowing
different kinds of Strings around (as in M17N presumely) should
require recoding far more often.

Unlikely. Mixed-encoding data handling is uncommon.

-austin


#93

On 18-jun-2006, at 13:08, Michal S. wrote:

But quite a few people here look like they do know. I do not know much
about regexes but I can imagine just about any other string operation.
And the current regexes already do operate on multiple encodings.
Oh, lord… Have you at least tried that to make such assumtpions? In
other words, tell me, can Ruby’s regexes cope with the following:

/[а-я]/
/[а-я]/i

or something like this:
http://rubyforge.org/cgi-bin/viewvc.cgi/icu4r/samples/demo_regexp.rb?
revision=1.2&root=icu4r&view=markup

And how that leads to the conclusion that there should be only one
encoding?
Very simply - I use many pieces of software written in many languages
all the time, with non-Latin text.
I know that when they want to get “historically compatible” problems
arise. And the software that settles on Unicode
internally or somehow enforces it on the programmer usually works
best (all Cocoa and all C#. And to a certain extens yes, Java).

Bluntly put, I am selfish and I don’t believe in the “saving grace”
of the M17N (because I just can’t wrap it around my head and I sure
as hell know it’s going to be VERY complex).

That’s the point. If it is wrapped into the string class you do not
have to wrap it around your head.

This is rather naive.

And that is eaxctly why a fixed encoding is bad. If strings can be
encoded in any way there is no point i religious discussions which
encoding you like the most.

Yes, it just becomes hard and error prone to process them.

It is JustGoodEnouhg for most cases but not for all. It is not useless
for CJK, just suboptimal because of the Han unification. And it also
does not try to include the historic characters.

I think this thread is going to end the same as the one in 2002 did.


#94

Hi,

In message “Re: Unicode roadmap?”
on Sun, 18 Jun 2006 23:46:40 +0900, Juergen S.
removed_email_address@domain.invalid writes:

|Language implementation, and usage of the String class should be
|easier if this set is
|
|- well defined
|- All characters are equally allowed in all Strings.

I understand these attributes might make implementation easier. But
who cares if I don’t care. And I am not sure how these make usage
easier, really.

Somebody who owns gigabytes of text data in legacy encoding (e.g. me),
wants to avoid encoding conversion back and forth between Unicode and
legacy encoding everytime. Another somebody want text processing on
historical text which character set is far bigger than Unicode. The
“well-defined” simple implementation just prohibits those demands. On
the contrary, M17N approach does not bother Universal Character Set
solution. You just need to choose Unicode (UTF-8 or UTF-16) as
internal string representation, and convert encoding on I/O as you
might have done in Unicode centric languages. Nothing lost.

You may worry about implementation difficulty (and performance), but
don’t. It’s my concern. I made a prototype, and have convinced
that I can implement it with acceptable performance.

|Unicode code points are pretty good in this respect, better than the
|union of all characters in all encodings of possible M17N Strings.
|And we may use private extensions to Unicode for legacy characters not
|included in Unicode already.

“private extensions”. No. It just cause another nightmare.

						matz.

#95

On Jun 18, 2006, at 8:29 AM, Austin Z. wrote:

You need glyphs, and some glyphs can be
produced with multiple code points (e.g., LOWERCASE A + COMBINING
ACUTE
ACCENT as opposed to A ACUTE).

This is another thing you need your String class to be smart about.
You want an equality test between “más” and “más” to always be true
even their “á” characters are encoded differently. The right way to
solve this is called “Early Uniform Normalization” (see http://
www.w3.org/TR/2003/WD-charmod-20030822/#sec-Normalization); the idea
is you normalize the composed characters at the time you create the
string, then the internal equality test can be done with strcmp() or
equivalent.

Map legacy data, that is characters still not in Unicode, to a high
Plane in Unicode. That way all characters can be used together all
the
time. When Unicode includes them we can change that to the official
code points. Note there are no files in String’s internal storage
format, so we don’t have to worry about reencoding them.

Um. This is the statement of someone who is ignoring legacy issues.
Performance is a big issue when you’re dealing with enough legacy
data.

Note that you don’t have to use a high plane. The Private Use Area
in the Basic Multilingual Pane has 6,400 code points, which is quite
a few. Even if you did use a high plane, it’s not obvious there’d be
a detectable runtime performance penalty.

Unicode is often the right choice, but it’s not the only
choice and there are times when having the flexibility to work in
other encodings without having to work through Unicode as an
intermediary is the right choice.

That may be the case. You need to do a cost-benefit analysis; you
could buy a lot of simplicity by decreeing all-Unicode-internally;
would the benefits of allowing non-Unicode characters be big enough
to to compensate for the loss of simplicity? I don’t know the
answer, but it needs thinking about.

-Tim


#96

On 18-jun-2006, at 21:17, Christian N. wrote:

solve this is called “Early Uniform Normalization” (see http://
www.w3.org/TR/2003/WD-charmod-20030822/#sec-Normalization); the idea
is you normalize the composed characters at the time you create the
string, then the internal equality test can be done with strcmp() or
equivalent.

Does that mean that binary.to_unicode.to_binary != binary is
possible?
That could turn out pretty bad, no?

And it does as long as you are not careful. One of the things I do is
normalize all that come IN
into something that is suitable and predictable.


#97

On Jun 18, 2006, at 12:17 PM, Christian N. wrote:

possible?
That could turn out pretty bad, no?

Yes, but having “más” != “más” is pretty bad too; the alternative is
normalizing at comparison time, which would really hurt for example
in a big sort, so you’d need to cache the normalized form, which
would be a lot more code.

binary.to_unicode looks a little weird to me… can you do that
without knowing what the binary is? If it’s text in a known
encoding, no breakage should occur. If it’s unknown bit patterns,
you can’t really expect anything sensible to happen… or am I
missing an obvious scenario? -Tim


#98

Tim B. removed_email_address@domain.invalid writes:

is you normalize the composed characters at the time you create the
string, then the internal equality test can be done with strcmp() or
equivalent.

Does that mean that binary.to_unicode.to_binary != binary is possible?
That could turn out pretty bad, no?


#99

On Sat, Jun 17, 2006 at 11:24:45PM +0900, Austin Z. wrote:

On 6/17/06, Julian ‘Julik’ Tarkhanov removed_email_address@domain.invalid wrote:

On 17-jun-2006, at 15:52, Austin Z. wrote:

  1. Because Strings are tightly integrated into the language with the
    source reader and are used pervasively, much of this cannot be
    provided by add-on libraries, even with open classes. Therefore the
    need to have it in Ruby’s canonical String class. This will break
    some old uses of String, but now is the right time for that.
    “Now” isn’t; Ruby 2.0 is. Maybe Ruby 1.9.1.

My title was “A Plan for Unicode Strings in Ruby 2.0”. I don’t want to
rush things or break 1.8.

Jürgen


#100

On Mon, Jun 19, 2006 at 01:33:54AM +0900, Yukihiro M. wrote:

solution. You just need to choose Unicode (UTF-8 or UTF-16) as
internal string representation, and convert encoding on I/O as you
might have done in Unicode centric languages. Nothing lost.

You may worry about implementation difficulty (and performance), but
don’t. It’s my concern. I made a prototype, and have convinced
that I can implement it with acceptable performance.

I never worried about performance much, that’s Austin. :stuck_out_tongue:

Thanks for clarifying that. So far I could not find much info on how
exactly M17N will work, especially on the role of the encoding tag, so
I had to guess a lot.

Given your explanation, it seems our ways are quite similiar on the
interface side of things, so far as Unicode is concerned. You chose a
more powerful (and more complex) parametric class design for where I
would have left open only the possiblity of transparently useable
subclasses for performance reasons.

I am happy we’ve worked that out now. And you are right, I am not that
much interested in the implementation, thank you for doing it. My
concern was with the interface of the String class, but several
posters misunderstood me and tried to draw me into implementation
issues.

Jürgen