In message “Re: Bug in CGI::unescapeHTML?”
on Thu, 5 Jul 2007 13:00:02 +0900, Esad H. [email protected] writes:
|I think there’s a bug in CGI::unescapeHTML. Or am I doing something wrong?
|
|$KCODE=‘u’
|CGI::unescapeHTML(“ã”)
|
|will return “\343”, which according to my screaming mysql utf-8 encoded
|database is not a valid utf-8 sequence
Not a bug, unfortunately. Since your client sent a binary sequence
“\343” in URL encoding, unescapeHTML() decoded it back. Specifying
$KCODE=‘u’ does not affect encoding your clients send. You have to
check (or convert) input from your clients explicitly, anyway.
|$KCODE=‘u’
|CGI::unescapeHTML(“ã”)
|
|will return “\343”, which according to my screaming mysql utf-8 encoded
|database is not a valid utf-8 sequence
Not a bug, unfortunately. Since your client sent a binary sequence
“\343” in URL encoding, unescapeHTML() decoded it back. Specifying
$KCODE=‘u’ does not affect encoding your clients send. You have to
check (or convert) input from your clients explicitly, anyway.
If I understand HTML correctly, it is pretty much a bug, although it’s
perhaps more of a reflection of Ruby’s limited encoding support (which
has already been well discussed on this list!).
According to the HTML4 specification[1], ‘The syntax “&#xH;” or
“&#XH;”, where H is a hexadecimal number, refers to the ISO 10646
hexadecimal character number H.’ ISO 10646 is (more or less) Unicode,
so this should be a Unicode codepoint regardless of the document
transfer encoding.
CGI decodes the numerical entities into their byte representations:
this works for ISO-8859-1 (because ISO-8859-1 characters match Unicode
codepoints up to U+00FF), but an HTML document can specify entities
that cannot be represented in a single-byte encoding.
To process a received HTML or XHTML file properly, one needs to:
Convert the document from the transfer encoding to a Unicode
representation
Convert any entities in the document to their corresponding
codepoints.
With a bit of self-promotion[2], one solution to Esad’s problem would
be:
If I understand HTML correctly, it is pretty much a bug, although it’s
perhaps more of a reflection of Ruby’s limited encoding support (which
has already been well discussed on this list!).
I tend to agree. I just fixed a bug in one of my apps where I blindly
used CGI.unescapeHTML which, as the original poster mentionned,
generates output that isn’t welcomed by a system configured to use UTF-8
all the way, especially the database (PostgreSQL in my case)…
Thanks for htmlentities, it saved my day.
Lionel
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.