ActiveRecord to_json encoding

calvin98115 · August 10, 2009, 8:36pm

Question:
Hi, our company is using Ruby 1.8.6 with Rails 2.2.2. Does anyone
know we can explicitly specify what encoding to use when calling
Description:
We have some multibyte characters in our database. For example we
have a table with a name column that has this French accented e: CafÃ©
Records. When we serialize this object using ActiveRecord’s to_xml()
everything looks fine in the browser and with our json objects.

When we render JSON using to_json() we are seeing problems where the
accented ‘e’ character is getting mangled and causes our calling web
client to fail since it’s expecting properly UTF-8 encoded characters.

If we use the browser to submit HTTP Get requesting JSON format, save
the file and view it in binary mode in Hexadecimal representation,
this is what we get. It looks like this is using extended ASCII.

Bytes Text
43 61 66 E9 C a f (should be accented e but get weird
block unprintable character)

If we save that same file from above and convert it to UTF-8, we get an
extra byte that seems to be proper UTF-8 encoding as shown below.

Bytes Text
43 61 66 C3 A9 C a f (should be accented e)

Can someone tell me how they’ve made to_json() UTF-8 compliant?
Thanks in advance, Calvin.

calvin98115 · August 10, 2009, 10:39pm

Calvin N. wrote:

Question:
Hi, our company is using Ruby 1.8.6 with Rails 2.2.2. Does anyone
know we can explicitly specify what encoding to use when calling

Answer:
Yes, almost certainly someone does. But you will be more likely to find
them on a Rails mailing list. This is the mailing list for Ruby, the
programming language.

If you install ruby 1.9 somewhere, it has some features you can use to
experiment with encodings. For example:

$ irb19 --simple-prompt

str = “CafÃ©”
=> “CafÃ©”

str.bytes.map { |x| “%02x” % x }
=> [“43”, “61”, “66”, “c3”, “a9”]

str.encode!(“ISO-8859-1”)
=> "Caf

str.bytes.map { |x| “%02x” % x }
=> [“43”, “61”, “66”, “e9”]

But I suggest you keep your 1.8.6 for production use.

calvin98115 · August 10, 2009, 10:55pm

Brian, I really appreciate the quick reply and the insight. Our team
was contemplating upgrading to Ruby 1.9 for all the new features but
have concerns with any regression. Will you tell me why you reccommend
staying on 1.8.6 for production rather than 1.9? Also how do you enter
the accented e on “CafÃ©”
in the console (IRB)?

Thanks in advance.

Brian C. wrote:

Calvin N. wrote:

Question:
Hi, our company is using Ruby 1.8.6 with Rails 2.2.2. Does anyone
know we can explicitly specify what encoding to use when calling

Answer:
Yes, almost certainly someone does. But you will be more likely to find
them on a Rails mailing list. This is the mailing list for Ruby, the
programming language.

If you install ruby 1.9 somewhere, it has some features you can use to
experiment with encodings. For example:

$ irb19 --simple-prompt

str = “CafÃ©”
=> “CafÃ©”

str.bytes.map { |x| “%02x” % x }
=> [“43”, “61”, “66”, “c3”, “a9”]

str.encode!(“ISO-8859-1”)
=> "Caf

str.bytes.map { |x| “%02x” % x }
=> [“43”, “61”, “66”, “e9”]

But I suggest you keep your 1.8.6 for production use.

calvin98115 · August 11, 2009, 12:19pm

Calvin N. wrote:

Brian, I really appreciate the quick reply and the insight. Our team
was contemplating upgrading to Ruby 1.9 for all the new features but
have concerns with any regression. Will you tell me why you reccommend
staying on 1.8.6 for production rather than 1.9?

The 1.9 language has a lot of backwards-incompatible changes which mean
that third-party libraries need modifying to make them work with 1.9.
Furthermore, the execution engine is completely new (YARV) and is still
shaking out bugs. It’s not the incremental change you’d expect from the
minor 1.8->1.9 number change.

So unless you enjoy debugging other people’s libraries and/or the
platform itself, you’ll be better served by 1.8.6 for now IMO.

Also how do you enter
the accented e on “CafÃ©”
in the console (IRB)?

On my keyboard, I typed Right-Alt-Gr + semicolon, followed by e.

I’m running Ubuntu Linux (Hardy in this particular case), and this UTF8
stuff works “out of the box”.

What platform are you using? It seems you can type accents into a web
page, but not on the irb command line. Did you build ruby/irb from
source, or install it from a package?

Regards,

Brian.

calvin98115 · August 11, 2009, 3:25am

On Aug 10, 2009, at 1:36 PM, Calvin N. wrote:

Question:
Hi, our company is using Ruby 1.8.6 with Rails 2.2.2. Does anyone
know we can explicitly specify what encoding to use when calling

Rails pretty much assumes UTF-8 data everywhere. The path of least
pain is definitely to try to work exclusively with UTF-8, since that’s
mainly what Ruby 1.8.x can handle.

However, I believe the default encoding of a web page is ISO-8859-1,
unless you specify otherwise. If you served up a form, a browser sent
you some data from that form, you saved it into the database, and you
never specified an encoding or tried to transcode the content, your
data is probably in ISO-8859-1. Indeed, that seems to be the case,
from what you are showing:

If we use the browser to submit HTTP Get requesting JSON format, save
the file and view it in binary mode in Hexadecimal representation,
this is what we get. It looks like this is using extended ASCII.

Bytes Text
43 61 66 E9 C a f (should be accented e but get
weird
block unprintable character)

That’s ISO-8859-1 data:

$ ruby -KU -r iconv -e ‘puts Iconv.conv(“UTF-8”, “ISO-8859-1”, [0x43,
0x61, 0x66, 0xE9].pack(“C*”))’
Café

Thus, you need to transcode it to UTF-8 before using operations like
to_json() that assume UTF-8, using the reverse of the transform I just
showed. Even better, you could transcode the existing data in your
database to UTF-8 and then mark all pages on your site as UTF-8
encoded, possibly by adding this line to the head of your HTML:

You may also want to instruct your web server to return a proper
Content-Type encoding header.

I hope that helps.

James Edward G. II