Forum: Ruby UTF-8

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Ec2a9a9da5c8a5f14a0fe2361ae4e08a?d=identicon&s=25 Mark J.Reed (Guest)
on 2006-01-06 16:09
(Received via mailing list)
Okay, last I checked, strings were just treated as collections of bytes,
and
any multibyte character semantics were up to the programmer to
implement.  But
I just noticed that in 1.8.3, utf8string.split(//) yeilds an array of
strings, each containing a single UTF-8 character, irrespective of byte
count.

So are regexes in general Unicode-aware now?  Any other UTF-8 tidbits
in there I should know about?

Thanks!
2abf5beb51d5d66211d525a72c5cb39d?d=identicon&s=25 Paul Battley (Guest)
on 2006-01-06 19:42
(Received via mailing list)
On 06/01/06, Mark J. Reed <mreed@thereeds.org> wrote:
> Okay, last I checked, strings were just treated as collections of bytes, and
> any multibyte character semantics were up to the programmer to implement.  But
> I just noticed that in 1.8.3, utf8string.split(//) yeilds an array of
> strings, each containing a single UTF-8 character, irrespective of byte
> count.
>
> So are regexes in general Unicode-aware now?

Regular expressions are UTF-8-aware if $KCODE is set to 'u' or there
is a u specifier after the regular expression (e.g. /./u). This is the
case since 1.8.2 at least (I don't have any other versions to hand to
check right at this moment, but I'm pretty confident that 1.8.1,
1.8.3, and 1.8.4 operate similarly).

> Any other UTF-8 tidbits in there I should know about?

In regular expressions? You should be aware that /./u matches a UTF-8
codepoint, but ranges only work on byte values (e.g. /[\x00-\xff]/).
As UTF-8 sequences are distinct (that is, a byte sequence is not a
subset of a longer sequence with a different meaning), matching is not
generally a problem. When replacing, you have to make sure that you
aren't replacing a part of a byte sequence, or you'll end up with
illegal sequences.

Here's a UTF-8 regular expression trick to truncate a string safely:
string[/.{0,#{max_length}}/u]

There are plenty of other UTF-8 tricks to be done using pack/unpack
with 'U*', as well...

Paul.
This topic is locked and can not be replied to.