Bug in CGI::unescapeHTML?

Hi,

I think there’s a bug in CGI::unescapeHTML. Or am I doing something
wrong?

$KCODE=‘u’
CGI::unescapeHTML(“ã”)

will return “\343”, which according to my screaming mysql utf-8 encoded
database is not a valid utf-8 sequence

The source of CGI::unescapeHTML reveals that all values < 255 are simply
translated to the ascii value using the value.chr

Greetings,

Esad

Hi,

In message “Re: Bug in CGI::unescapeHTML?”
on Thu, 5 Jul 2007 13:00:02 +0900, Esad H.
[email protected] writes:

|I think there’s a bug in CGI::unescapeHTML. Or am I doing something wrong?
|
|$KCODE=‘u’
|CGI::unescapeHTML(“ã”)
|
|will return “\343”, which according to my screaming mysql utf-8 encoded
|database is not a valid utf-8 sequence

Not a bug, unfortunately. Since your client sent a binary sequence
“\343” in URL encoding, unescapeHTML() decoded it back. Specifying
$KCODE=‘u’ does not affect encoding your clients send. You have to
check (or convert) input from your clients explicitly, anyway.

          matz.

Hi,

On 05/07/07, Yukihiro M. [email protected] wrote:

|$KCODE=‘u’
|CGI::unescapeHTML(“ã”)
|
|will return “\343”, which according to my screaming mysql utf-8 encoded
|database is not a valid utf-8 sequence

Not a bug, unfortunately. Since your client sent a binary sequence
“\343” in URL encoding, unescapeHTML() decoded it back. Specifying
$KCODE=‘u’ does not affect encoding your clients send. You have to
check (or convert) input from your clients explicitly, anyway.

If I understand HTML correctly, it is pretty much a bug, although it’s
perhaps more of a reflection of Ruby’s limited encoding support (which
has already been well discussed on this list!).

According to the HTML4 specification[1], ‘The syntax “&#xH;” or
“&#XH;”, where H is a hexadecimal number, refers to the ISO 10646
hexadecimal character number H.’ ISO 10646 is (more or less) Unicode,
so this should be a Unicode codepoint regardless of the document
transfer encoding.

  1. http://www.w3.org/TR/html4/charset.html#h-5.3.1

CGI decodes the numerical entities into their byte representations:
this works for ISO-8859-1 (because ISO-8859-1 characters match Unicode
codepoints up to U+00FF), but an HTML document can specify entities
that cannot be represented in a single-byte encoding.

To process a received HTML or XHTML file properly, one needs to:

  • Convert the document from the transfer encoding to a Unicode
    representation
  • Convert any entities in the document to their corresponding
    codepoints.

With a bit of self-promotion[2], one solution to Esad’s problem would
be:

require ‘htmlentities’
=> true

$KCODE = ‘u’
=> “u”

HTMLEntities.new.decode(‘ã’)
=> “ã”

  1. http://htmlentities.rubyforge.org/

Paul.

Paul B. wrote the following on 05.07.2007 23:10 :

If I understand HTML correctly, it is pretty much a bug, although it’s
perhaps more of a reflection of Ruby’s limited encoding support (which
has already been well discussed on this list!).

I tend to agree. I just fixed a bug in one of my apps where I blindly
used CGI.unescapeHTML which, as the original poster mentionned,
generates output that isn’t welcomed by a system configured to use UTF-8
all the way, especially the database (PostgreSQL in my case)…

Thanks for htmlentities, it saved my day.

Lionel

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs