2006/6/22, Yukihiro M. [email protected]:
randomly. Do you have any idea to simplify things?
I am eager to hear.
So what will be semantic for encoding tag:
a) weak suggestion?
b) strong assertion?
If encoding tag is only weak suggestion (and for now I see it will be
just that), it will imply:
- performance win (no need to check conformance to told encoding)
- win in having less complexity (most tasks use source code, text
data input and output all in the same [default host] encoding)
- portability drawbacks (assumtions made by original coders will be
implicit, but they have to be figured out, when porting to another
environement)
- reliability drawbacks (weak suggestions are too often ignored, and
you don’t know when, where and why they will hit your app, but someday
they will!)
If encoding tag is strong assertion, it will imply:
- probable performance loss:
- to assure this string with encoding = “none” (raw) represents
valid encoding sequence of bytes,
at the same price as String#length
- need to recode bytes, when changing tag
- slightly more complexity (developers will have to declare these
assertions explicitly)
- portability win
- reliability win
What compromise on this issues would be acceptable?
I’d prefer encoding tag as strong assertion, mostly for reliability
reasons.
And for operations on Strings with different encodings, I’d like
implicit automatic encoding coercion:
NOTES:
a) String#recode!(new_encoding) replaces current internal byte
representation with new byte sequence,
that is recoded current. must raise IncompatibleCharError, if
can’t convert char to destination encoding
b) downgrading string from some stated encoding to “none” tag must
be done only explicitly.
it is not an option for implicit conversion
c) $APPLICATION_UNIVERSAL_ENCODING is a global var, allowed to be
set once and only once per application run.
Intent: we want all strings which aren’t raw bytes to be in one
single predefined encoding,
so all operations on string must return string in conformant
encoding.
Desired encoding is value of $APPLICATION_UNIVERSAL_ENCODING.
If $APPLICATION_UNIVERSAL_ENCODING is nil, we go in "democracy
mode", see below.
def coerce_encodings(str1, str2)
enc1 = str1.encoding
enc2 = str2.encoding
simple case, same encodings, will return fast in most cases
return if enc1 == enc2
another simple but rare case, totally incompatible encodings, as
they represent incompatible charsets
if fully_incompatible_charsets?(enc1, enc2)
raise(IncompatibleCharError, “incompatible charsets %s and %s”,
enc1, enc2)
end
uncertainity, handling “none” and preset encoding
if enc1 == “none” || enc2 == “none”
raise(UnknownIntentEncodingError, “can’t implicitly coerce
encodings %s and %s, use explicit conversion”, enc1, enc2)
end
Tirany mode:
we want all strings which aren’t raw bytes to be in one single
predefined encoding
if $APPLICATION_UNIVERSAL_ENCODING
str1.recode!($APPLICATION_UNIVERSAL_ENCODING)
str2.recode!($APPLICATION_UNIVERSAL_ENCODING)
return
end
Democracy mode:
first try to perform non-loss conversion from one encoding to
another:
1) direct conversion, without loss, to another encoding, e.g. UTF8
-
UTF16
if exists_direct_non_loss_conversion?(enc1, enc2)
if exists_direct_non_loss_conversion?(enc2, enc1)
performance hint if both available
if str1.byte_length < str2.byte_length
str1.recode!(enc2)
else
str2.recode!(enc1)
end
else
str1.recode!(enc2)
end
return
end
if exists_direct_non_loss_conversion?(enc2, enc1)
str2.recode!(enc1)
return
end
2) non-loss conversion to superset
(I see no reason to raise exception on KOI8R + CP1251, returning
string in Unicode will be OK)
if superset_encoding = find_superset_non_loss_conversion?(enc1, enc2)
str1.recode!(superset_encoding)
str2.recode!(superset_encoding)
return
end
A case for incomplete compatibility:
Check if subset of enc1 is also subset of enc2,
so some strings in enc1 can be safely recoded to enc2,
e.g. two pure ASCII strings, whatever ASCII-compatible encoding
they have
if exists_partial_loss_conversion?(enc1, enc2)
if exists_partial_loss_conversion?(enc2, enc1)
# performance hint if both available
if str1.byte_length < str2.byte_length
str1.recode!(enc2)
else
str2.recode!(enc1)
end
else
str1.recode!(enc2)
end
return
end
the last thing we can try
str2.recode!(enc1)
end
So, when operation involves two Strings or String and Regexp, with
different encodings, automatic coercion should be done, as described
above.
That will, probably, solve coding problems (no need to think about
encodings most time), but can have following impacts:
- after several operations, when one sends string to external IO, it
might be internally encoded in superset of that IO encoding. One has
to remember that and perform external IO accordingly, i.e. to resolve
- to fail on invalid chars or use replacement chars (like U+FFFD),-
but that is unavoidable.
- some performance hits, which I expect to be rare.
Besides, there can be another class of problems with automatic
coercion: how to ensure consistent work of character ranges in Regexps
and String methods like [count, delete, squeeze, tr, succ, next, upto]
when encodings are coerced?
What I, as Ruby user, wish for Unicode/M17N support:
- reliability and consistency:
a) String should be abstraction for character sequence,
b) String methods shouldn’t allow me to garble internal
representation;
c) treating String as byte sequence is handy, but must be explict
stated.
- coding comfort:
a) no need to care what encodings have strings while working with
them;
b) no need to care what encodings have strings returned from
third-party code;
c) using explicit stated conversion options for external IO.
- on Unicode and i18n : at least to have a set of classes for
Unicode-specific tasks (collation, normalization, string search,
locale-aware formatting etc.) that would efficiently work with Ruby
strings.
And, for all out there, just ask “Which charset/encoding will fit all
the [present and future] needs?”. You know the exact answer: “NONE”.
I understand the challenge, but I don’t think it is common to run some
part of your program in legacy encoding (without conversion), and
other part in UTF-8. You need to convert them into universal encoding
anyway for most of the cases. That’s why I said it rare.
uhm, how to convert compiled extension library?