A Code Point's Tale: There and Back Again

luislavena · April 30, 2011, 6:12am

This is probably obvious in the docs and I’m just missing it, but here
goes: So, I see there is str.each_codepoint, which I want to use in a
function to convert Unicode Strings to a list of Unicode code points.
But what can I do if I have a list of Unicode code points and want to
convert them back into a String?

callyworld · April 30, 2011, 6:28am

I hope this is what u r looking for
http://ruby-unicode.rubyforge.org/doc/

callyworld · April 30, 2011, 11:00am

Content preview: Hi, On 30.04.2011 06:12, Terry M. wrote: > This
is probably
obvious in the docs and I’m just missing it, but here > goes: So, I
see there
is str.each_codepoint, which I want to use in a > function to
convert Unicode
Strings to a list of Unicode code points. > But what can I do if I
have a
list of Unicode code points and want to > convert them back into a
String?
[…]

Content analysis details: (-2.9 points, 5.0 required)

pts rule name description

-1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
X-Cloudmark-Analysis: v=1.1
cv=HQ3F56nxkum+cgCiDL7AXQpbvw7DWrWCBJRnYYnM0Zc= c=1 sm=0
a=aofHTkXiRO8A:10 a=F4rxgqsZPjUA:10 a=IkcTkHD0fZMA:10
a=eSU-C1wW4WoJ4zxOtLcA:9 a=QEXdDO2ut3YA:10
a=HpAAvcLHHh0Zw7uRqdWCyQ==:117
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Precedence: bulk
Lines: 18
List-Id: ruby-talk.ruby-lang.org
List-Software: fml [fml 4.0.3 release (20011202/4.0.3)]
List-Post: mailto:[email protected]
List-Owner: mailto:[email protected]
List-Help: mailto:[email protected]?body=help
List-Unsubscribe: mailto:[email protected]?body=unsubscribe
Received-SPF: none (Address does not pass the Sender Policy Framework)
SPF=FROM;
[email protected];
remoteip=::ffff:221.186.184.68;
remotehost=carbon.ruby-lang.org;
helo=carbon.ruby-lang.org;
receiver=eq4.andreas-s.net;

Hi,

On 30.04.2011 06:12, Terry M. wrote:

This is probably obvious in the docs and I’m just missing it, but here
goes: So, I see there is str.each_codepoint, which I want to use in a
function to convert Unicode Strings to a list of Unicode code points.
But what can I do if I have a list of Unicode code points and want to
convert them back into a String?

I think you can use Array#pack for that:

$ irb
ruby-1.9.2-p180 :001 > “f뀀oöbß”.each_codepoint.to_a
=> [102, 45056, 111, 246, 98, 223]
ruby-1.9.2-p180 :002 > “f뀀oöbß”.each_codepoint.to_a.pack(“U*”)
=> “f뀀oöbß”

cheers

callyworld · May 1, 2011, 4:24am

Terry M. wrote in post #995906:

This is probably obvious in the docs and I’m just missing it,

You will never learn ruby unicode by reading the docs. Head over to
James Edward G. II’s website for some lessons:

http://blog.grayproductions.net/articles/ruby_19s_three_default_encodings

but here
goes: So, I see there is str.each_codepoint, which I want to use in a
function to convert Unicode Strings to a list of Unicode code points.
But what can I do if I have a list of Unicode code points and want to
convert them back into a String?

#encoding: UTF-8

#That comment tells ruby to treat string literals in my
#source code, like the one below, as utf-8 encoded.

str = “\xE2\x82\xAC\xE2\x82\xAC”

codes = str.each_codepoint.to_a

p codes
puts codes.map {|code| code.chr(Encoding::UTF_8) }.join

–output:–
[8364, 8364]
€€

(You should see two euro symbols as the last line of output.)

I don’t know where you are getting your string from, but you can always
do this:

str = “\xE2\x82\xAC\xE2\x82\xAC”
puts str.encoding

str.force_encoding(“UTF-8”)
puts str.encoding

codes = str.each_codepoint.to_a

p codes
puts codes.map {|code| code.chr(Encoding::UTF_8) }.join

–output:–
ASCII-8BIT
UTF-8
[8364, 8364]
€€

(You should see two euro symbols as the last line of output.)

callyworld · May 1, 2011, 4:41am

7stud – wrote in post #996022:

Terry M. wrote in post #995906:

This is probably obvious in the docs and I’m just missing it,

You will never learn ruby unicode by reading the docs. Head over to
James Edward G. II’s website for some lessons:

http://blog.grayproductions.net/articles/ruby_19s_three_default_encodings

Someone else blogged in great detail about all the intricacies of ruby
unicode and its problems, but I can’t find the link now.

callyworld · May 1, 2011, 5:22am

Maybe each_char() will work for you? Take a look at the following code.

str = “\xE2\x82\xAC\xE2\x82\xAC”
puts str.encoding

str.force_encoding(“UTF-8”)
puts str.encoding

chars = str.each_char.to_a
p chars

puts chars[0].encoding

puts chars.join

–output:–
ASCII-8BIT
UTF-8
["\u20AC", “\u20AC”]
UTF-8
€€

(You should see two euro symbols as the last line of output.)

The output implies that a string with unicode escapes is given a UTF-8
encoding by default. And that seems to be the case:

str = “\u20AC\u20AC”
puts str.encoding

–output:–
UTF-8