Using unpack on a UTF-8 string

On my system:

‘€’.unpack(‘U*’)

Produces:

=> [8364]

I would have expected this:

=> [342, 202, 254]

In fact, I could have sworn that things used to work this way… Am I
going crazy? The following seems to confirm that the string is indeed
using a UTF-8 representation internally.

‘€’.collect
=> ["\342\202\254"]

I get exactly the same results whether $KCODE is set to ‘NONE’ or ‘u’.

Cheers,
Greg

[Greg H. [email protected], 2007-02-24 20.00 CET]

=> [342, 202, 254]

In fact, I could have sworn that things used to work this way… Am I
going crazy? The following seems to confirm that the string is indeed
using a UTF-8 representation internally.

‘€’.collect
=> ["\342\202\254"]

I get exactly the same results whether $KCODE is set to ‘NONE’ or ‘u’.

The UNICODE codepoint for the euro sign is 8364. In your string you have
that number encoded as a sequence of bytes [226, 130, 172]. That
encoding is
known as UTF-8. #unpack decodifies that sequence of bytes and gives you
the
number.

For analogy, think as if you had the string “\272!\000\000” and did an
#unpack(“I”). The sequence of bytes [186, 33, 0, 0] also represent the
number 8364, but this time encoded in the internal format my computer
uses.
#unpack retrieves that number. The fact that UTF-8 is used for encoding
UNICODE codepoints is incidental to this.

To unpack the bytes from a string use #unpack(“C*”).

HTH.

On 24 feb, 23:47, Carlos [email protected] wrote:

UNICODE codepoints is incidental to this.

To unpack the bytes from a string use #unpack(“C*”).

Thanks a million, Carlos. I never would have figured that out for
myself. I misunderstood the documentation for String#unpack:

C | Fixnum | extract a character as an unsigned integer
U | Integer | UTF-8 characters as unsigned integers

unpack(‘C*’) does indeed give me what I want…

Cheers,
Greg

Greg H. wrote:

Thanks a million, Carlos. I never would have figured that out for
myself. I misunderstood the documentation for String#unpack:
C | Fixnum | extract a character as an unsigned integer
U | Integer | UTF-8 characters as unsigned integers

The problem here is the inconsistent use of character in the
documentation. A character is not a byte. The documentation
should be revised to use the two words only in their correct
contexts, with annotations to remind people of this use.

Clifford H…

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs