Brian C. wrote:
Marnen Laibow-Koser wrote:
Huh? Normalization transformations should be pretty easy to implement.
But the point is, you can’t do anything useful with this until you
transcode it anyway, which you can do using Iconv (in either 1.8 or
1.9).
Wrong. Normalization transformations are useful within one Unicode
encoding. In fact, they have little use in transcoding as I understand.
[…]
Notice that the n-accent is displayed as a single character by the
terminal, even though it is two codepoints (110,771)
I don’t think it’s meaningful to say that something is displayed as a
single character. You can’t see characters – they’re abstract ideas.
All you can see is the glyphs that represent those characters.
So you could argue that Dir[] on the Mac is at fault here, for tagging
the string as UTF-8 when it should be UTF-8-MAC.
But you’d be wrong, because UTF-8-MAC is valid UTF-8.
But you still need to transcode to UTF-8 before doing anything useful
with this string. Consider a string containing decomposed characters
tagged as UTF-8-MAC:
(1) The regexp /./ should match a series of decomposed codepoints as a
single ‘character’
I am not sure I agree with you.
str[n] should fetch the nth ‘character’;
Yes, but a combining sequence is not conceptually a character in many
cases.
and so on.
I don’t think this would be easy to implement, since finding a character
boundary is no longer a codepoint boundary.
Sure it is. You are confusing characters and combining sequences.
What you actually get is this:
decomp.split(//)
=> [“e”, “s”, “p”, “a”, “n”, “̃”, “o”, “l”, “.”, “l”, “n”, “g”]
Aside: note that "̃ is actually a single character,
It is nothing of the kind. It is a single combining sequence composed
of two characters. I would expect it to be matched by /…/ .
a double quote with
the accent applied!
Right.
(2) The OP wanted to match the regexp containing a single codepoint /ñ/
against the decomposed representation, which isn’t going to work anyway.
That is, ruby 1.9 does not automatically transcode strings so they are
compatible; it just raises an exception if they are not.
But UTF-8 NFC and UTF-8 NFD are compatible – they’re not even really
separate encodings. At this point I strongly suggest that you read the
article (I think it’s UAX #15) on Unicode normalization.
/ñ/ =~ decomp
Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8
regexp with UTF8-MAC string)
If the only difference between UTF-8 and UTF-8-MAC is normalization,
then this is brain-dead.
from (irb):5
from /usr/local/bin/irb19:12:in `’
(3) Since ruby 1.9 has a UTF-8-MAC encoding, it should be able to
transcode it to UTF-8 without using Iconv. However this is simply
broken, at least in the version I’m trying here.
/ñ/ =~ decomp.encode(“UTF-8”)
=> nil
decomp.encode(“UTF-8”)
=> “espa\xB1\x00ol.lng”
decomp.encode(“UTF-8”).codepoints.to_a
ArgumentError: invalid byte sequence in UTF-8
from (irb):10:in codepoints' from (irb):10:in
each’
from (irb):10:in to_a' from (irb):10 from /usr/local/bin/irb19:12:in
’
RUBY_VERSION
=> “1.9.2”
RUBY_PATCHLEVEL
=> -1
RUBY_REVISION
=> 24186
Yikes! That’s bad.
(4) If general support for decomposed form would be added as further
‘Encodings’, there would be an explosion of encodings: UTF-8-D,
UTF-16LE-D, UTF-16BE-D etc, and that’s ignoring the “compatible” versus
“canonical” composed and decomposed forms.
Right. Different normal forms really aren’t separate encodings in the
usual sense.
(5) It is going to be very hard (if not impossible) to make a source
code string or regexp literal containing decomposed “n” and “̃” to be
distinct from a literal containing a composed “ñ”. Try it and see.
And that’s probably a good thing. In fact, that’s the point of
normalization.
(In the above paragraph, the decomposed accent is applied to the
double-quote; that is, "̃ is actually a single ‘character’).
Combining sequence.
Most
editors are going to display both the composed and decomposed forms
identically.
And at least in the case of ñ versus n + combining ~, they normalize to
the same thing in all normal forms (precomposed ñ in C and KC; a 2-char
combining sequence in D and KD). Thus, under any normalization, they
are equivalent and should be treated as such.
I think this just shows that ruby 1.9’s complexity is not helping in the
slightest. If you have to transcode to UTF-8 composed form, then ruby
1.8 does this just as well (and then you only need to tag the regexp as
UTF-8 using //u)
Normalization really isn’t transcoding in the usual sense.
Best,
Marnen Laibow-Koser
http://www.marnen.org
[email protected]