Convert \uXXXX to character

born_in_USSR · June 27, 2010, 2:33pm

I have string: ‘\u041f\u0440\u0438\u0432\u0435\u0442!’ and i need to
convert it to string such as ‘Ð¿Ñ€Ð¸Ð²ÐµÑ‚!’.
I can convert string to ‘041f 0440 0438 0432 0435 0442’, then convert to
decimal and at the end convert each code to character with function:

str.scan(/[0-9]+/).each {|x| result_str << x.to_i}

but i don’t think that it is the most rational way.

born_in_USSR · June 27, 2010, 8:06pm

I recommend you:
http://blog.grayproductions.net/articles/understanding_m17n

2010/6/27 born in USSR [email protected]:

born_in_USSR · June 27, 2010, 8:38pm

On 06/27/2010 05:33 AM, born in USSR wrote:

I have string: ‘\u041f\u0440\u0438\u0432\u0435\u0442!’ and i need to
convert it to string such as ‘Ð¿Ñ€Ð¸Ð²ÐµÑ‚!’.
I can convert string to ‘041f 0440 0438 0432 0435 0442’, then convert to
decimal and at the end convert each code to character with function:

str.scan(/[0-9]+/).each {|x| result_str<< x.to_i}

but i don’t think that it is the most rational way.

irb(main):001:0> RUBY_VERSION
=> “1.9.1”
irb(main):002:0> puts ‘\u041f\u0440\u0438\u0432\u0435\u0442!’
\u041f\u0440\u0438\u0432\u0435\u0442!
=> nil
irb(main):003:0> puts “\u041f\u0440\u0438\u0432\u0435\u0442!”
ÐŸÑ€Ð¸Ð²ÐµÑ‚!
=> nil

Note the difference in single quotes versus double quotes.

-Justin

born_in_USSR · June 28, 2010, 6:23am

On Jun 27, 2010, at 8:33 AM, born in USSR wrote:

I have string: ‘\u041f\u0440\u0438\u0432\u0435\u0442!’ and i need to
convert it to string such as ‘Ð¿Ñ€Ð¸Ð²ÐµÑ‚!’.
I can convert string to ‘041f 0440 0438 0432 0435 0442’, then convert to
decimal and at the end convert each code to character with function:

If I understand you correctly you can leverage Ruby’s parser to
interpret your string literal:

irb> x = ‘\u041f\u0440\u0438\u0432\u0435\u0442!’
=> “\u041f\u0440\u0438\u0432\u0435\u0442!”
irb> eval(""#{x}"")
=> “ÐŸÑ€Ð¸Ð²ÐµÑ‚!”

Be careful though with eval, make sure your string to be evaluated
doesn’t contain any untrusted code.

Gary W.

born_in_USSR · June 28, 2010, 3:25pm

On 28 June 2010 07:39, Markus S. [email protected] wrote:

IMHO better than eval

str = ‘\u041f\u0440\u0438\u0432\u0435\u0442!’
p str.gsub(/\u(\h{4})/) {
$1.to_i(16).chr(‘UTF-8’)
}

What do you say of this?
Well, I was searching something in the line of String#unpack, like

p str.gsub(/\u(\h{4})/) {
[$1.to_i(16)].pack(‘U’)
}

but as we are scanning one by one, it is not interesting and need an
extra array like in JSON (but it is 1.8 compatible).

B.D.

born_in_USSR · June 28, 2010, 6:40am

I think the JSON parser is able to decode this unicode escapes
correctly!

The JSON parser will not decode an pure string to you have to wrap the
string into array syntax, and extract after parsing:

mbj@mbj ~ $ irb
irb(main):001:0> require ‘json’
=> true
irb(main):002:0> x = ‘\u041f\u0440\u0438\u0432\u0435\u0442!’
=> “\u041f\u0440\u0438\u0432\u0435\u0442!”
irb(main):003:0> JSON.parse(’["’+x+’"]’)[0]
=> “ÐŸÑ€Ð¸Ð²ÐµÑ‚!”
irb(main):004:0>

IMHO better than eval

born_in_USSR · June 28, 2010, 4:44pm

On 28.06.2010 15:24, Benoit D. wrote:

On 28 June 2010 07:39, Markus S. [email protected] wrote:

IMHO better than eval

str = ‘\u041f\u0440\u0438\u0432\u0435\u0442!’
p str.gsub(/\u(\h{4})/) {
$1.to_i(16).chr(‘UTF-8’)
}

Donâ€™t forget that Unicode Code Points not only cover the BMP and can be
up to 6 hex digits long
[Unicode - Wikipedia].

What do you do if the string contained some escaped backslashes, like in
str = ‘\u041f\u0440’? Does it contain Surrogates?

â€“ Matthias