Forum: Ruby Bug in CGI::unescapeHTML?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Ad590872e5d5e31e8d5fc3910ccebb3a?d=identicon&s=25 Esad Hajdarevic (Guest)
on 2007-07-05 06:00
(Received via mailing list)
Hi,

I think there's a bug in CGI::unescapeHTML. Or am I doing something
wrong?

$KCODE='u'
CGI::unescapeHTML("ã")

will return "\343", which according to my screaming mysql utf-8 encoded
database is not a valid utf-8 sequence

The source of CGI::unescapeHTML reveals that all values < 255 are simply
translated to the ascii value using the value.chr

Greetings,

Esad
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2007-07-05 18:43
(Received via mailing list)
Hi,

In message "Re: Bug in CGI::unescapeHTML?"
    on Thu, 5 Jul 2007 13:00:02 +0900, Esad Hajdarevic
<esad.talks@esse.at> writes:

|I think there's a bug in CGI::unescapeHTML. Or am I doing something wrong?
|
|$KCODE='u'
|CGI::unescapeHTML("&#xE3;")
|
|will return "\343", which according to my screaming mysql utf-8 encoded
|database is not a valid utf-8 sequence

Not a bug, unfortunately.  Since your client sent a binary sequence
"\343" in URL encoding, unescapeHTML() decoded it back.  Specifying
$KCODE='u' does not affect encoding your clients send.  You have to
check (or convert) input from your clients explicitly, anyway.

              matz.
2abf5beb51d5d66211d525a72c5cb39d?d=identicon&s=25 Paul Battley (Guest)
on 2007-07-05 23:11
(Received via mailing list)
Hi,

On 05/07/07, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> |$KCODE='u'
> |CGI::unescapeHTML("&#xE3;")
> |
> |will return "\343", which according to my screaming mysql utf-8 encoded
> |database is not a valid utf-8 sequence
>
> Not a bug, unfortunately.  Since your client sent a binary sequence
> "\343" in URL encoding, unescapeHTML() decoded it back.  Specifying
> $KCODE='u' does not affect encoding your clients send.  You have to
> check (or convert) input from your clients explicitly, anyway.

If I understand HTML correctly, it is pretty much a bug, although it's
perhaps more of a reflection of Ruby's limited encoding support (which
has already been well discussed on this list!).

According to the HTML4 specification[1], 'The syntax "&#xH;" or
"&#XH;", where H is a hexadecimal number, refers to the ISO 10646
hexadecimal character number H.' ISO 10646 is (more or less) Unicode,
so this should be a Unicode codepoint regardless of the document
transfer encoding.

1. http://www.w3.org/TR/html4/charset.html#h-5.3.1

CGI decodes the numerical entities into their byte representations:
this works for ISO-8859-1 (because ISO-8859-1 characters match Unicode
codepoints up to U+00FF), but an HTML document can specify entities
that cannot be represented in a single-byte encoding.

To process a received HTML or XHTML file properly, one needs to:
- Convert the document from the transfer encoding to a Unicode
representation
- Convert any entities in the document to their corresponding
codepoints.

With a bit of self-promotion[2], one solution to Esad's problem would
be:

  >> require 'htmlentities'
  => true
  >> $KCODE = 'u'
  => "u"
  >> HTMLEntities.new.decode('&#xE3;')
  => "ã"

2. http://htmlentities.rubyforge.org/

Paul.
Bef7ff8a0537495a1876ffebdc9f8e51?d=identicon&s=25 Lionel Bouton (Guest)
on 2007-07-06 18:33
(Received via mailing list)
Paul Battley wrote the following on 05.07.2007 23:10 :
>
> If I understand HTML correctly, it is pretty much a bug, although it's
> perhaps more of a reflection of Ruby's limited encoding support (which
> has already been well discussed on this list!).
>

I tend to agree. I just fixed a bug in one of my apps where I blindly
used CGI.unescapeHTML which, as the original poster mentionned,
generates output that isn't welcomed by a system configured to use UTF-8
all the way, especially the database (PostgreSQL in my case)...

Thanks for htmlentities, it saved my day.

Lionel
This topic is locked and can not be replied to.