Bug in CGI::unescapeHTML?


#1

Hi,

I think there’s a bug in CGI::unescapeHTML. Or am I doing something
wrong?

$KCODE=‘u’
CGI::unescapeHTML(“ã”)

will return “\343”, which according to my screaming mysql utf-8 encoded
database is not a valid utf-8 sequence

The source of CGI::unescapeHTML reveals that all values < 255 are simply
translated to the ascii value using the value.chr

Greetings,

Esad


#2

Hi,

In message “Re: Bug in CGI::unescapeHTML?”
on Thu, 5 Jul 2007 13:00:02 +0900, Esad H.
removed_email_address@domain.invalid writes:

|I think there’s a bug in CGI::unescapeHTML. Or am I doing something wrong?
|
|$KCODE=‘u’
|CGI::unescapeHTML(“ã”)
|
|will return “\343”, which according to my screaming mysql utf-8 encoded
|database is not a valid utf-8 sequence

Not a bug, unfortunately. Since your client sent a binary sequence
“\343” in URL encoding, unescapeHTML() decoded it back. Specifying
$KCODE=‘u’ does not affect encoding your clients send. You have to
check (or convert) input from your clients explicitly, anyway.

          matz.

#3

Hi,

On 05/07/07, Yukihiro M. removed_email_address@domain.invalid wrote:

|$KCODE=‘u’
|CGI::unescapeHTML(“ã”)
|
|will return “\343”, which according to my screaming mysql utf-8 encoded
|database is not a valid utf-8 sequence

Not a bug, unfortunately. Since your client sent a binary sequence
“\343” in URL encoding, unescapeHTML() decoded it back. Specifying
$KCODE=‘u’ does not affect encoding your clients send. You have to
check (or convert) input from your clients explicitly, anyway.

If I understand HTML correctly, it is pretty much a bug, although it’s
perhaps more of a reflection of Ruby’s limited encoding support (which
has already been well discussed on this list!).

According to the HTML4 specification[1], ‘The syntax “&#xH;” or
“&#XH;”, where H is a hexadecimal number, refers to the ISO 10646
hexadecimal character number H.’ ISO 10646 is (more or less) Unicode,
so this should be a Unicode codepoint regardless of the document
transfer encoding.

  1. http://www.w3.org/TR/html4/charset.html#h-5.3.1

CGI decodes the numerical entities into their byte representations:
this works for ISO-8859-1 (because ISO-8859-1 characters match Unicode
codepoints up to U+00FF), but an HTML document can specify entities
that cannot be represented in a single-byte encoding.

To process a received HTML or XHTML file properly, one needs to:

  • Convert the document from the transfer encoding to a Unicode
    representation
  • Convert any entities in the document to their corresponding
    codepoints.

With a bit of self-promotion[2], one solution to Esad’s problem would
be:

require ‘htmlentities’
=> true

$KCODE = ‘u’
=> “u”

HTMLEntities.new.decode(‘ã’)
=> “ã”

  1. http://htmlentities.rubyforge.org/

Paul.


#4

Paul B. wrote the following on 05.07.2007 23:10 :

If I understand HTML correctly, it is pretty much a bug, although it’s
perhaps more of a reflection of Ruby’s limited encoding support (which
has already been well discussed on this list!).

I tend to agree. I just fixed a bug in one of my apps where I blindly
used CGI.unescapeHTML which, as the original poster mentionned,
generates output that isn’t welcomed by a system configured to use UTF-8
all the way, especially the database (PostgreSQL in my case)…

Thanks for htmlentities, it saved my day.

Lionel