H() doesn't have any parameter for encoding being used?

winterheat · September 28, 2008, 5:01am

it seems that there is no parameter for the function h() (html_escape())
to indicate the character encoding being used?

for PHP, its htmlspecialchars() function has a dozen encoding possible,
such as UTF-8, Chinese Big5, Chinese GB, Russia, Japanese.

i think thought, h() will work for UTF-8, since h() will only touch the
4 special characters

< > & "

and replace them with < etc and those 4 characters are all in the
0x00 to 0x7F range, and h() will leave the other bytes intact
(unchanged). Now, since a character in UTF-8 can be 1 to 4 bytes, and
that any ASCII will be represented as 1 byte, which is 0x00 to 0x7F
itself, and that 0x80 to 0xFF and other unicode characters will be 2 to
4 bytes long, but with the 1st to 4th bytes all being in the 0x80 to
0xFF range (see UTF-8 UTF-8 - Wikipedia ), so when h()
replaces those 4 ASCII characters, it will successfully do so when h()
sees those 4 characters as a 1-byte character, and then it will bypass
all the 1st to 4th bytes characters because those characters are in the
0x80 to 0xFF range, and therefore can never be matched as one of those 4
special characters, so the job of replacing those 4 characters will be
done with no side effect whatsoever done to the non-ASCII characters.

winterheat · September 28, 2008, 6:18am

I don’t think Rails supports UTF8 yet… but I could be wrong.

Ryan B.
Freelancer

winterheat · September 28, 2008, 11:22am

On Sat, Sep 27, 2008 at 9:18 PM, Ryan B. [email protected]
wrote:

I don’t think Rails supports UTF8 yet… but I could be wrong.

The default charset for action renderings is UTF-8 since Rails 1.2.

-Conrad

winterheat · September 28, 2008, 9:14pm

On Sun, Sep 28, 2008 at 5:01 AM, SpringFlowers AutumnMoon
[email protected] wrote:

0x80 to 0xFF range, and therefore can never be matched as one of those 4
special characters, so the job of replacing those 4 characters will be
done with no side effect whatsoever done to the non-ASCII characters.

Ruby 1.8 has a global idea of character enconding, which is configured
in the $KCODE global variable.

Rails 1.2 and above by default set $KCODE to a value that means
everything is UTF-8. Source code, strings, regexps, etc. It also sets
a HTTP header that tells the client (X)HTML goes as UTF-8. Thus, the
client sends form data back in UTF-8 as well. And everything works
transparently.

When you do I/O you are responsible for knowing the encoding of
incoming data, and the expected encoding of outgoing data. You use
iconv if needed to guarantee them. Any I/O operation has to be in
control of the involved character encodings.

Some stuff in Ruby 1.8 does not play well with UTF-8, for example you
cannot compute the length of a string with String#length because that
method counts bytes. But some other stuff do work, like pattern
matching. For example “.” really matches a character, which may not be
a byte in UTF-8, as you point out.

So, if you are using regexps you are safe in that regard. The helper
#h is really an ERb alias of the ERb method #html_escape (it is not a
Rails helper), and that method is implemented using regexps:

def html_escape(s)
s.to_s.gsub(/&/, “&”).gsub(/"/, “"”).gsub(/>/,
“>”).gsub(/</, “<”)
end

Hence, it works correctly in UTF-8.