Yukihiro M. wrote:
In message “Re: [ENCODING] UTF8 hell”
on Tue, 23 Feb 2010 20:10:20 +0900, Xavier Noëlle [email protected] writes:
|self.each_byte {|b| print "#{b} "} => 109 233 100 105 99 97 108 115
|
|233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg.
|self.gsub(‘ruby’, ‘zorglub’)) on this string leads to: `gsub’: invalid
|byte sequence in UTF-8 (ArgumentError).
233 is not a valid UTF-8 character. The byte sequence for médicals is
<109 195 169 100 105 99 97 108 115>.
A general hint for debugging encoding troubles: the UTF-8 encoding
guarantees that every Unicode codepoint is either encoded into a
single octet with its most significant bit cleared to 0 (i.e. a
decimal value between 0 and 127) or into a sequence of 2 to 6
octets, all of which have their MSB set to 1 (i.e. a decimal value
between 128 and 255).
A single octet with its MSB set to 1 can never be a valid UTF-8
character, it can only be part of a multi-octet character, i.e. it
must appear either immediately before or after or between another
octet with its MSB set. However, in your string there is no
multi-octet character sequence, there is only a single character with
its MSB set (the second one with the decimal value 233), so you can
see without having to look at any code tables that this string
cannot possibly be a UTF-8 string.
As Rick already hinted, it is either an ISO/IEC 8859-1, ISO/IEC
8859-2, ISO/IEC 8859-3, ISO/IEC 8859-4, ISO/IEC 8859-9, ISO/IEC
8859-10, ISO/IEC 8859-13, ISO/IEC 8859-14, ISO/IEC 8859-15, ISO/IEC
8859-16, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-9,
ISO-8859-10, ISO-8859-13, ISO-8859-14, ISO-8859-15, ISO-8859-16 or
Windows-1252 string (it’s impossible to tell, but makes no difference
in this case). My guess is on ISO-8859-15.
[This property is BTW what makes UTF-8 compatible with ASCII, because
it guarantees that every Unicode character which is also in ASCII,
will be encoded the same way as it would be in ASCII and every Unicode
character which is not in ASCII will be encoded as a sequence of
octets each of which is illegal in ASCII. It also provides some
robustness against 8-bit encodings such as the ISO8859 family, because
statistically it is very likely that somewhere in the text, there
will be a single octet with its MSB set (in this case, it’s the é and
in my name it’s the ö), which is surrounded by octets with their MSB
cleared, which cannot ever happen in UTF-8.]
jwm