On 05/07/07, Yukihiro M. [email protected] wrote:
|will return “\343”, which according to my screaming mysql utf-8 encoded
|database is not a valid utf-8 sequence
Not a bug, unfortunately. Since your client sent a binary sequence
“\343” in URL encoding, unescapeHTML() decoded it back. Specifying
$KCODE=‘u’ does not affect encoding your clients send. You have to
check (or convert) input from your clients explicitly, anyway.
If I understand HTML correctly, it is pretty much a bug, although it’s
perhaps more of a reflection of Ruby’s limited encoding support (which
has already been well discussed on this list!).
According to the HTML4 specification, ‘The syntax “&#xH;” or
“&#XH;”, where H is a hexadecimal number, refers to the ISO 10646
hexadecimal character number H.’ ISO 10646 is (more or less) Unicode,
so this should be a Unicode codepoint regardless of the document
CGI decodes the numerical entities into their byte representations:
this works for ISO-8859-1 (because ISO-8859-1 characters match Unicode
codepoints up to U+00FF), but an HTML document can specify entities
that cannot be represented in a single-byte encoding.
To process a received HTML or XHTML file properly, one needs to:
- Convert the document from the transfer encoding to a Unicode
- Convert any entities in the document to their corresponding
With a bit of self-promotion, one solution to Esad’s problem would
$KCODE = ‘u’