Utf-8 or encoding problems, need help!

dimon · October 4, 2006, 4:23pm

Hi there!

Just don’t know what to do: any advice appreciated:

My DB is InnoDB utf-8 and I have no problems when I get information from
db, displaying it in the broswer or saving it.

Except one case:
I’m saving the text with umlaut. This works fine if my browser’s
character encoding is set to utf-8.
But: if I switch encoding to western ( ISO-8859-1 ), then copy the
string with umlaut to the field, and click on save I get the following
error:

ActiveRecord::StatementInvalid

Mysql::Error: #22001Data too long for column ‘_text’ at row 1: UPDATE
company_description SET creation_time = ‘2006-10-04 21:17:23’, _text
= ‘Stefan Haï¿½’, <…>

The POST request is sent with Content-Type:
application/x-www-form-urlencoded,
So it looks like it sents request in western encoding.
POST Content differes for the case when I have browser encoding set to
utf8 and when it is set to western.

It could be quite OK if I hadn’t that error. BTW, I don’t have this
error under linux, only under windows (probably because under linux ruby
version is 1.8.4 and under windows 1.8.5?)

I tried everything I found about unicode, I tried to find the way to
escape string before saving it, but no success((

Any ideas, please?

dimon · October 4, 2006, 4:45pm

upd: the same error with ruby1.8.4 under windows…

dimon · October 4, 2006, 4:48pm

I am scratching my head here, but isn’t there a way to have the form
force a character encoding in the POST request?

i.e. you can explicitly state that the POST character data is utf-8?

THat way the user can mess around with their character encodings
all they want, but whatever characters submitted will be valid utf-8
characters (but possibly/probably not the characters they expected).

dimon · October 4, 2006, 4:54pm

Richard C. wrote:

I am scratching my head here, but isn’t there a way to have the form
force a character encoding in the POST request?

i.e. you can explicitly state that the POST character data is utf-8?

THat way the user can mess around with their character encodings
all they want, but whatever characters submitted will be valid utf-8
characters (but possibly/probably not the characters they expected).

but how can I do that? i.e. have the form force a character encoding in
the POST request???

dimon · October 4, 2006, 5:02pm

On 10/4/06, Dmitry H. [email protected] wrote:

upd: the same error with ruby1.8.4 under windows…

Dmitry, my gut feeling is that you have to enforce POST encoding in the
form at least, or otherwise detect when you have not received a utf-8
encoded POST data string.

I am at a loss as to how a latin-1 string ended up bigger than a UTF-8
one, but its possible that you might have encountered some cut&paste
artifacts. Try entering an umlaut using the character map (i.e. more
naturally).

dimon · October 4, 2006, 5:30pm

On 10/4/06, Dmitry H. [email protected] wrote:

ok, ill try this; anyway i think its bad that it`s possible to pass
such data to application…

Well yes, but its not Rails fault. In fact anyone can pass any
kind of information to any kind of web system. Your system has
to be robust enough to handle it.

Even by your best efforts to ensure everything comes across as
utf-8, users can still force it to be something that won’t display
properly, like latin-1, Shift-JIS or whatever. In those cases you
have to detect that you have received an invalid encoding and
either convert it to utf-8 or send back an error message.

I just thought that a particularly clever hacker might be able
to exploit encoding confusion with multi-byte encoding
systems to get around cross-site-scripting defences. Its just a thought,
and I am thinking in general, not in a Rails context (which has
some fairly serious XSS defences)

dimon · October 4, 2006, 5:06pm

Richard C. wrote:

On 10/4/06, Dmitry H. [email protected] wrote:

upd: the same error with ruby1.8.4 under windows…

Dmitry, my gut feeling is that you have to enforce POST encoding in the
form at least, or otherwise detect when you have not received a utf-8
encoded POST data string.

I am at a loss as to how a latin-1 string ended up bigger than a UTF-8
one, but its possible that you might have encountered some cut&paste
artifacts. Try entering an umlaut using the character map (i.e. more
naturally).

ok, ill try this; anyway i think its bad that it`s possible to pass
such data to application…

dimon · October 4, 2006, 6:02pm

On 4-okt-2006, at 16:54, Dmitry H. wrote:

but how can I do that? i.e. have the form force a character
encoding in
the POST request???

That’s what the accept-charset is for on the form element. However,
if your page is by itself explicitly UTF-8 (via output headers or he

element, or both) all forms that you postback or get should be automagically in UTF-8 as well.

dimon · October 5, 2006, 7:42am

Julian ‘Julik’ Tarkhanov wrote:

On 4-okt-2006, at 16:54, Dmitry H. wrote:

but how can I do that? i.e. have the form force a character
encoding in
the POST request???

That’s what the accept-charset is for on the form element. However,
if your page is by itself explicitly UTF-8 (via output headers or he
element, or both) all forms that you postback or get should be automagically in UTF-8 as well.

So if it’s possible to fake the encoding and headers, should I check
that all the contents of the params hash is in the utf-8; say by using
before_filter in application.rb? Wouldn’t it affect performance of the
whole application? Maybe there is a better way to avoid such errors?

Thanks

dimon · October 5, 2006, 9:59am

Just found better solution:)

class ApplicationController < ActionController::Base
before_filter :convert_request

def convert_request
    convert_hash(params) #if request.post?
end

def convert_hash(hash)
  for k, v in hash
    case v when String: hash[k] = Kconv.toutf8(v).to_s when Array:

hash[k] = v.collect { |v| Kconv.toutf8(v).to_s } when Hash:
convert_hash(v) end
end
end
end

dimon · October 5, 2006, 9:25am

Dmitry H. wrote:

Julian ‘Julik’ Tarkhanov wrote:

On 4-okt-2006, at 16:54, Dmitry H. wrote:

but how can I do that? i.e. have the form force a character
encoding in
the POST request???

That’s what the accept-charset is for on the form element. However,
if your page is by itself explicitly UTF-8 (via output headers or he
element, or both) all forms that you postback or get should be automagically in UTF-8 as well.

So if it’s possible to fake the encoding and headers, should I check
that all the contents of the params hash is in the utf-8; say by using
before_filter in application.rb? Wouldn’t it affect performance of the
whole application? Maybe there is a better way to avoid such errors?

Thanks

I created the following filter to check incoming requests. Is there
better and faster way to do the same?

class ApplicationController < ActionController::Base
require ‘iconv’
ICONV = Iconv.new( ‘UTF-8’, ‘UTF-8’ )
before_filter :convert_request

def convert_request
    convert_hash(params) #if request.post?
end

def convert_hash(hash)
  begin
    for k, v in hash
        case v when String: ICONV.iconv(v) when Array: v.collect {

|v| ICONV.iconv(v) } when Hash: convert_hash(v) end
end
rescue Iconv::Failure => iconv_exception
hash[k] = iconv_exception.success
flash[:error] = ‘Request was sent in invalid encoding (not
utf-8). Text was truncated.’
end
end
end

dimon · November 23, 2006, 10:51am

Hi, I’m having a related utf8-problem I would like to share in this
topic.

When I’m submitting swedish characters (such as åäö) in a ajaxcall
(:observe_field) the åäö-characters gets translated into weird
characters that causes the postgresql to display the error:
PGError: ERROR: invalid byte sequence for encoding “UTF8”: 0xf6f6f6f6

Anyone know a fix?

dimon · November 23, 2006, 2:05pm

Hi, thanks for the reply. The answer is 2 : the user is entering the
text. I don’t understand why rails processes the characters correctly
through normal posts, but not when I do the ajax obversefield-call.

Anyhow, how do I specify to use UTF-8 in that ajax-call?

My database is UTF-8.

This is what I have right now:

In applicationcontroller:
before_filter :set_charset

Sets default character set to UTF-8

def set_charset
if request.xhr?
@headers[“Content-Type”] = “text/javascript; charset=utf-8”
else
@headers[“Content-Type”] = “text/html; charset=utf-8”
end
end

At the bottom of environment.rb I have:
$KCODE = ‘u’
require ‘jcode’

dimon · November 23, 2006, 12:03pm

On 11/23/06, [email protected] [email protected] wrote:

Hi, I’m having a related utf8-problem I would like to share in this
topic.

When I’m submitting swedish characters (such as åäö) in a ajaxcall
(:observe_field) the åäö-characters gets translated into weird
characters that causes the postgresql to display the error:
PGError: ERROR: invalid byte sequence for encoding “UTF8”: 0xf6f6f6f6

Its pretty clear that you are inserting high bit latin-1 (?) characters
into
your UTF-8 database. (?) I am assuming that scandinavian countries
are part of the latin-1 (ISO-8859-1) character set.

Anyone know a fix?

Are you (1) typing these characters into your source file
or are you (2) letting the user enter them directly from a form?

(1) I suspect you will need to escape the characters in your source
file. I don’t know enough about using non-ASCII characters in Ruby
source to help you further

(2) should work as long as you just let Rails pass them through,
and screen them for being valid utf-8. Make sure your browser
knows the page is UTF-8 (you will need to do something in your
Rails config to enforce this) and that any POSTs are encoding
the data as UTF-8.