Chris W. wrote in post #1020283:
Python won’t save you from the complexities of encoding.
No, but the language does have very clearly defined semantics. It also
makes a clear distinction between “a character” and “an encoded
representation of that character”, and it has two distinct classes for
those things.
It is then pretty much foolproof: if you forget to decode your bytes
into characters, or encode your characters into bytes, or try to combine
characters with bytes, then you’ll get an immediate and consistent
runtime error.
Ruby has a hazy notion of these things, and hazy(*) rules which allow
you sometimes to combine strings of characters and binary strings, and
sometimes not. If your program runs successfully once, it doesn’t mean
that it’s going to run successfully with different input data.
Furthermore, any library in ruby 1.9 which either accepts a String or
returns a String needs to document its encoding-related behaviour;
almost none of them do. In Python 3, all you have to say is whether it
uses String or Bytes.
(*) Even data which I explicity tag as being BINARY is taken to be
ASCII-8BIT, whether that is true or not.
You have to
remember, Ruby has its base in Japan. In Japan you roughly have the
following encodings to deal with:
- UTF*- EUC-JP- SJIS- ISO-2022-JP
The confusion between “encodings” and “character sets” is pretty
endemic, and I have fallen prey to it myself many times.
Python partly dodges this issue because it supports only one character
set - unicode - and then various encodings of it (like UTF*) and
encodings of subsets of the character set (like ISO-8859-*)
I understand that there are various Asian character sets which are not
proper subsets of unicode, and so can’t be converted losslessly to and
from unicode. If Python3 were to be extended to handle them, then I
imagine there would be separate classes for EUCJPString and GB2312String
or whatever, and methods to transcode between them (and options for what
to do about missing characters)
And of course, Ruby 1.9 doesn’t really handle ISO-2022-JP anyway,
because it’s a stateful encoding; I’m pretty sure you can’t index or
take the length or regexp-match an ISO-2022-JP string in ruby 1.9,
without first transcoding it.
This is just a very broad generalization. There are even more issues
such as multiple versions of SJIS.
Absolutely. So it’s vital to have a clear distinction between
encoded sequence <-----------> set of characters
of bytes
which Python 3 has; whereas ruby 1.9 tries to work with the encoded
sequence of bytes as-is, hoping you’ve remembered to tag the encoding
correctly every time, and remorselessly tagging binary data as being
text anyway.
just remember that every language has its ups and
downs. Python 3 for example has many external libraries, including
Django and some of the ui toolkits, that are not supported.
That’s true, and it’s Django which keeps me from skipping python 2
entirely and just going to 3.
Regards,
Brian.