Problem with GET args and UTF-8 encoding (output of Rack::Utils.unescape() ?)

Hi folks,

Here’s my basic issue, hopefully this is clear. I’m trying to submit
some UTF-8 values in my query string, but they are coming out mangled
on the other end. It seems like the problem is that what
Rack::Utils.unescape() pushes out gets converted to UTF-8 somewhere in
the chain (using 3.0.7, and Ruby 1.9.2, by the way), and it’s mangling
characters which are two bytes (for example, “%20,” which is space and
a one byte character, gets converted fine). I feel like I’ve almost
figured this out, but I’m still stumped. Here’s my “evidence:”

Example UTF-8 string:

“Adlade de Hongrie”

GET string (obviously URI encoded):

Started GET “/registers/results?filter[title][]=Ad%E9la%EFde%20de
%20Hongrie&search=&limit=4” for 127.0.0.1 at 2011-05-16 14:17:33 +0700

What Rack produces/Rails sees (in Controller):

Parameters: {“filter”=>{“title”=>[“Ad\xE9la\xEFde de Hongrie”]},
“search”=>"", “limit”=>“4”}

Error I’m getting, when I try to “do stuff” with the above string:

ArgumentError (invalid byte sequence in UTF-8):

What would actually be a valid string with hex UTF code points in

the format above:

“Ad\xC3\xA9la\xC3\xAFde de Hongrie”

Or, in the “\u …” format (see anything interesting here? Something
obvious is eluding me…):

"Ad\u{E9}la\u{EF}de de Hongrie

To be clear, this is not a form, but an ajax query. I’ve tried adding
the ‘utf8’ snowman thing manually too, but that doesn’t seem to do
anything…of course, maybe I’m doing that wrong.

Any thoughts/questions/pointing out of obvious errors or confused ways
of thinking? I’d also appreciate any pointers to Rails documentation
which describes in more detail how this stuff happens; I’ve just been
digging through the code and it’s slow going for me.

Help much appreciated!

Cheers,
Dave

Okay, I’m still not there but I’ve realized I’ve been confusing a few
things. This stackoverflow answer helped a lot:

I was conflating Unicode with UTF-8. But, I think that’s also
essentially what is happening somewhere in the process of ASCII-8BIT
(output of Rack::Utils.unescape()) getting converted to UTF-8. I have
to figure out how to override unescape() in my own initializer, I
suppose, or intercept unescape()'s output and properly encode that.

I think I’m close to a solution, since I’m starting to understand what
all the values should be and what is happening. But any help will
still be greatly appreciated, since there is still something eluding
my understanding.

Thanks,
Dave

On 16 May 2011, at 14:47, ddellacosta [email protected] wrote:

Example UTF-8 string:

“Adlade de Hongrie”

GET string (obviously URI encoded):

Started GET “/registers/results?filter[title][]=Ad%E9la%EFde%20de
%20Hongrie&search=&limit=4” for 127.0.0.1 at 2011-05-16 14:17:33 +0700

Who is producing this query string? They should be generating %c3%a9 if
they are UTF8 friendly, since %e9 is just URL speak for \xe9, which
smells like iso-Latin-something

Fred

Thanks for pointing out the obvious Frederick (seriously, thank you).
The problem was completely on the JavaScript/browser side; the
function which prepared the query string was using escape() rather
than encodeURIComponent(). I replaced all the calls to escape and
things started to magically work, how about that?

Thank you, I really appreciate the help!! I can’t believe how much
time I spent looking in the wrong places…at least I learned a fair
amount about Rails internals as well as encoding issues though…haha.

Cheers,
Dave