Utf-8

Mark_J.Reed · January 6, 2006, 4:09pm

Okay, last I checked, strings were just treated as collections of bytes,
and
any multibyte character semantics were up to the programmer to
implement. But
I just noticed that in 1.8.3, utf8string.split(//) yeilds an array of
strings, each containing a single UTF-8 character, irrespective of byte
count.

So are regexes in general Unicode-aware now? Any other UTF-8 tidbits
in there I should know about?

Thanks!

Mark_J.Reed · January 6, 2006, 7:42pm

On 06/01/06, Mark J. Reed [email protected] wrote:

Okay, last I checked, strings were just treated as collections of bytes, and
any multibyte character semantics were up to the programmer to implement. But
I just noticed that in 1.8.3, utf8string.split(//) yeilds an array of
strings, each containing a single UTF-8 character, irrespective of byte
count.

So are regexes in general Unicode-aware now?

Regular expressions are UTF-8-aware if $KCODE is set to ‘u’ or there
is a u specifier after the regular expression (e.g. /./u). This is the
case since 1.8.2 at least (I don’t have any other versions to hand to
check right at this moment, but I’m pretty confident that 1.8.1,
1.8.3, and 1.8.4 operate similarly).

Any other UTF-8 tidbits in there I should know about?

In regular expressions? You should be aware that /./u matches a UTF-8
codepoint, but ranges only work on byte values (e.g. /[\x00-\xff]/).
As UTF-8 sequences are distinct (that is, a byte sequence is not a
subset of a longer sequence with a different meaning), matching is not
generally a problem. When replacing, you have to make sure that you
aren’t replacing a part of a byte sequence, or you’ll end up with
illegal sequences.

Here’s a UTF-8 regular expression trick to truncate a string safely:
string[/.{0,#{max_length}}/u]

There are plenty of other UTF-8 tricks to be done using pack/unpack
with ‘U*’, as well…

Paul.