Unicode in Regex

S-gw · December 4, 2007, 5:56am

Jordan Callicoat wrote:

On Dec 3, 1:47 pm, Greg W. [email protected] wrote:

Seems like Ruby doesn’t like to use the hex \x{7HHHHHHH} variant?
Oniguruma is not in ruby 1.8 (though you can install it as a gem). It
is in 1.9.

Oh. I always thought Oniguruma was the engine in Ruby.

Anyway – everyone, thanks for all the input. I believe I’m headed in
the right direction now, and have a better hands on understanding of
UTF-8.

– gw

S-gw · December 5, 2007, 3:04am

Daniel DeLorme wrote:

Heavy compared to what? Once compiled, regex are orders of magnitude
faster than jumping in and out of ruby interpreted code.

Sorry to beat a dead horse, but I just did an interesting little
experiment with 1.9:

str = “abcde”*1000
str.encoding
=> Encoding:ASCII-8BIT

Benchmark.measure{10.times{ 1000.times{|i|str[i]} }}.real
=> 0.010282039642334

str.force_encoding ‘utf-8’
Benchmark.measure{10.times{ 1000.times{|i|str[i]} }}.real
=> 1.29934501647949

arr = str.scan(/./u)
Benchmark.measure{10.times{ 1000.times{|i|arr[i]} }}.real
=> 0.00343608856201172

indexing into UTF-8 strings is expensive

Daniel

S-gw · December 5, 2007, 8:48am

Daniel DeLorme wrote:

Benchmark.measure{10.times{ 1000.times{|i|str[i]} }}.real
=> 0.010282039642334

str.force_encoding ‘utf-8’
Benchmark.measure{10.times{ 1000.times{|i|str[i]} }}.real
=> 1.29934501647949

arr = str.scan(/./u)
Benchmark.measure{10.times{ 1000.times{|i|arr[i]} }}.real
=> 0.00343608856201172

indexing into UTF-8 strings is expensive

…but correct. I’d rather have correct than broken.

Charlie

S-gw · December 5, 2007, 2:59am

MonkeeSage wrote:

I guess we were talking about different things then. I never meant to
imply that the regexp engine can’t match unicode characters

Since regular expressions are embedded in the very syntax of ruby just
as arrays and hashes, IMHO that qualifies as unicode support. So yeah,
it seems like we have a semantic disagreement.

I, like Charles (and I think most people), was referring to the
ability to index into strings by characters, find their lengths in
characters

That is certainly one way of supporting unicodde but by no means the
only way. My belief is that you can do most string manipulations in a
way that obviates the need for char indexing & char length, if only you
change your mindset from “operating on individual characters” to
“operating on the string as a whole”. And since regex are a specialized
language for string manipulation, they’re also a lot faster. It’s a
little like imperative vs functional programming; if I told you about a
programming language that has no variable assignments you might think
it’s completely broken, and yet that’s how functional languages work.

to compose and decompose composite characters, to
normalize characters, convert them to other encodings like shift-jis,
and other such things.

Converting encodings is a worthy goal but unrelated to unicode support.
As for character [de]composition that would be a very nice thing to have
if it was handled automatically (e.g. “a\314\200”=="\303\240") but if
the programmer has to worry about it then you might as well leave it to
a specialized library. Well, it’s not like ruby lets us abstract away
composite characters either in 1.8 or 1.9… I never claimed unicode
support was 100%, just good enough for most needs.

just a difference of opinion. I don’t mind being wrong (happens a
lot! I just don’t like being accused of spreading FUD about ruby,
which to my mind implies malice of forethought rather that simply
mistake.

Yes, that was too harsh on my part. My apologies.

Daniel

S-gw · December 5, 2007, 12:06pm

On Dec 4, 7:58 pm, Daniel DeLorme [email protected] wrote:

characters

That is certainly one way of supporting unicodde but by no means the
only way. My belief is that you can do most string manipulations in a
way that obviates the need for char indexing & char length, if only you
change your mindset from “operating on individual characters” to
“operating on the string as a whole”. And since regex are a specialized
language for string manipulation, they’re also a lot faster. It’s a
little like imperative vs functional programming; if I told you about a
programming language that has no variable assignments you might think
it’s completely broken, and yet that’s how functional languages work.

I think we’ll just have to agree to disagree. But there is one
point…

main = do
let i_like = "I like "
putStrLn $ i_like ++ haskell
where haskell = “a functional language”

support was 100%, just good enough for most needs.

just a difference of opinion. I don’t mind being wrong (happens a
lot! I just don’t like being accused of spreading FUD about ruby,
which to my mind implies malice of forethought rather that simply
mistake.

Yes, that was too harsh on my part. My apologies.

No worries. I apologize as well for responding by saying you were
lying about unicode support; I see that we just have a difference of
opinion and were talking past each another.

Daniel

Regards,
Jordan

S-gw · December 5, 2007, 10:35pm

Daniel DeLorme said…

YES! And that’s exactyly how it should be. Who is it that spread the
flawed idea that strings are fundamentally made of characters?

Are you being ironic?

I’d like
to slap him around a little. Fundamentally, ever since the word “string”
was applied to computing, strings were made of 8-BIT CHARS, not n-bit
characters. If only the creators of C has called that datatype “byte”
instead of “char” it would have saved us so many misunderstandings.

And look at the trouble we’re having ditching the waterfall method, all
because someone misread a paper in the 1700s or thereabouts.

You might want to spar with Tim B. from Sun who presented at RubyConf
2006, where his slides state:

“99.99999% of the time, programmers want to deal with characters not
bytes. I know of one exception: running a state machine on UTF8-encoded
text. This is done by the Expat XML parser.”

“In 2006, programmers around the world expect that, in modern languages,
strings are Unicode and string APIs provide Unicode semantics correctly
& efficiently, by default. Otherwise, they perceive this as an offense
against their language and their culture. Humanities/computing academics
often need to work outside Unicode. Few others do.”

He reviews his chat here:

ongoing by Tim Bray · Unicode and Ruby

and the slides are here:

http://www.tbray.org/talks/rubyconf2006.pdf

S-gw · December 6, 2007, 2:25am

On Dec 5, 6:15 pm, Daniel DeLorme [email protected] wrote:

representation. If strings were fundamentally made of characters then we
believe that in 99% of those cases you would be better served with regex
than this pretend “array” disguise.

Daniel

Here is a micro-benchmark on three common string operations (split,
index, length), using bytestrings and unicode regexp, verses native
utf-8 strings in 1.9.0 (release).

$ ruby19 -v
ruby 1.9.0 (2007-10-15 patchlevel 0) [i686-linux]

$ echo && cat bench.rb
#!/usr/bin/ruby19

-- coding: ascii --

require “benchmark”
require “test/unit/assertions”
include Test::Unit::Assertions

$KCODE = “u”

$target = “!e$BF|K\8le(B!” * 100
$unichr = “e$BK\e(B”.force_encoding(‘utf-8’)
$regchr = /[e$BK\e(B]/u

def uni_split
$target.split($unichr)
end
def reg_split
$target.split($regchr)
end

def uni_index
$target.index($unichr)
end
def reg_index
$target =~ $regchr
end

def uni_chars
$target.length
end
def reg_chars
$target.unpack(“U*”).length

this is alot slower

$target.scan(/./u).length

end

$target.force_encoding(“ascii”)
a = reg_split
$target.force_encoding(“utf-8”)
b = uni_split
assert_equal(a.length, b.length)

$target.force_encoding(“ascii”)
a = reg_index
$target.force_encoding(“utf-8”)
b = uni_index
assert_equal(a-2, b)

$target.force_encoding(“ascii”)
a = reg_chars
$target.force_encoding(“utf-8”)
b = uni_chars
assert_equal(a, b)

n = 10_000
Benchmark.bm(12) { | x |
$target.force_encoding(“ascii”)
x.report(“reg_split”) { n.times { reg_split } }
$target.force_encoding(“utf-8”)
x.report(“uni_split”) { n.times { uni_split } }
puts
$target.force_encoding(“ascii”)
x.report(“reg_index”) { n.times { reg_index } }
$target.force_encoding(“utf-8”)
x.report(“uni_index”) { n.times { uni_index } }
puts
$target.force_encoding(“ascii”)
x.report(“reg_chars”) { n.times { reg_chars } }
$target.force_encoding(“utf-8”)
x.report(“uni_chars”) { n.times { uni_chars } }
}

====

With caches initialized, an 5 prior runs, I got these numbers:

$ ruby19 bench.rb
user system total real
reg_split 2.550000 0.010000 2.560000 ( 2.799292)
uni_split 1.820000 0.020000 1.840000 ( 2.026265)

reg_index 0.040000 0.000000 0.040000 ( 0.097672)
uni_index 0.150000 0.000000 0.150000 ( 0.202700)

reg_chars 0.790000 0.010000 0.800000 ( 0.919995)
uni_chars 0.130000 0.000000 0.130000 ( 0.193307)

====

So String#=~ with a bytestring and unicode regexp is faster than
String#index by a fator or ~0.5. In the other two cases, the opposite
is true.

Ps. BTW, in case there is any confusion, bytestrings aren’t going
away; you can, as you see above, specify a magic encoding comment to
ensure that you have bytestrings by default. You can also explicitly
decode from utf-8 back to ascii. and you can get a byte enumerator (or
array from calling to_a on the enumerator) from String#bytes, and an
iterator from #each_byte, irregardless of the encoding.

Regards,
Jordan

S-gw · December 6, 2007, 3:32am

MonkeeSage wrote:

Here is a micro-benchmark on three common string operations (split,
index, length), using bytestrings and unicode regexp, verses native
utf-8 strings in 1.9.0 (release).

That’s nice, but split and index do not operate using integer indexing
into the string, so they are rather irrelevant to the topic at hand.
They produce the same results in ruby1.8, i.e. uni_split==reg_split and
uni_index==reg_index.

I also stated that the point of regex manipulation is to obviate the
need for methods like index and length. So a more accurate benchmark
might be something like:
reg_chars N/A N/A N/A ( N/A )
uni_chars 0.130000 0.000000 0.130000 ( 0.193307)

Ps. BTW, in case there is any confusion, bytestrings aren’t going
away; you can, as you see above, specify a magic encoding comment to
ensure that you have bytestrings by default.

Yes, it’s still possible to access bytes but it’s not possible to run a
utf8 regex on a bytestring if it contains extended characters:

$ ruby1.9 -ve ‘“abc” =~ /b/u’
ruby 1.9.0 (2007-12-03 patchlevel 0) [i686-linux]
$ ruby1.9 -ve ‘“e$BF|K\8le(B” =~ /e$BK\e(B/u’
ruby 1.9.0 (2007-12-03 patchlevel 0) [i686-linux]
-e:1:in `’: character encodings differ (ArgumentError)

And that kinda kills my whole approach.

Daniel

S-gw · December 6, 2007, 1:19am

marc wrote:

Daniel DeLorme said…

MonkeeSage wrote:

Everything in ruby is a bytestring.
YES! And that’s exactyly how it should be. Who is it that spread the
flawed idea that strings are fundamentally made of characters?

Are you being ironic?

Not at all. By “fundamentally” I mean the fundamental, lowest level of
representation. If strings were fundamentally made of characters then we
wouldn’t be able to access individual bytes because that’s a lower level
than the fundamental level, which is by definition impossible.

If you are using UCS2 it makes sense to consider strings as arrays of
characters because that’s what they are. But UTF8 strings do not follow
the characteristics of arrays at all. Each access into the “array” is
O(n) rather than O(1). So IMHO treating it as an array of characters is
a very leaky abstraction.

I agree that 99.9% of the time you want to deal with characters, and I
believe that in 99% of those cases you would be better served with regex
than this pretend “array” disguise.

Daniel

S-gw · December 6, 2007, 6:31am

MonkeeSage wrote:

Heh, if the topic at hand is only that indexing into a string is
slower with native utf-8 strings (don’t disagree), then I guess it’s
irrelevant. Regarding the idea that you can do everything just as
efficiently with regexps that you can do with native utf-8
encoding…it seems relevant.

How so? These methods work just as well in ruby1.8 which does not have
native utf8 encoding embedded in the strings. Of course, comparing a
string with a string is more efficient than comparing a string with a
regexp, but that is irrelevant of whether the string has “native” utf8
encoding or not:

$ ruby1.8 -rbenchmark -KU
puts Benchmark.measure{100000.times{ “e$BF|K\8le(B”.index(“e$BK\e(B”)
}}.real
puts Benchmark.measure{100000.times{ “e$BF|K\8le(B”.index(/[e$BK\e(B]/)
}}.real
puts Benchmark.measure{100000.times{ “e$BF|K\8le(B”.index(/[e$BK\e(B]/u)
}}.real
^D
0.225839138031006
0.304145097732544
0.313494920730591

$ ruby1.9 -rbenchmark -KU
puts Benchmark.measure{100000.times{ “e$BF|K\8le(B”.index(“e$BK\e(B”)
}}.real
puts Benchmark.measure{100000.times{ “e$BF|K\8le(B”.index(/[e$BK\e(B]/)
}}.real
puts Benchmark.measure{100000.times{ “e$BF|K\8le(B”.index(/[e$BK\e(B]/u)
}}.real
^D
0.183344841003418
0.255104064941406
0.263553857803345

1.9 is more performant (one would hope so!) but the performance ratio
between string comparison and regex comparison does not seem affected by
the encoding at all.

Someone just posted a question today about how to printf("%20s …",
a, …) when “a” contains unicode (it screws up the alignment since
printf only counts byte width, not character width). There is no
elegant solution in 1.8., regexps or otherwise.

It’s not perfect in 1.9 either. “%20s” % “e$BF|K\8le(B” results in a
string of
20 characters… that uses 23 columns of terminal space because the font
for Japanese uses double width. In other words neither bytes nor
characters have an intrinsic “width” :-/

Daniel

S-gw · December 6, 2007, 4:11am

On Dec 5, 8:31 pm, Daniel DeLorme [email protected] wrote:

MonkeeSage wrote:

Here is a micro-benchmark on three common string operations (split,
index, length), using bytestrings and unicode regexp, verses native
utf-8 strings in 1.9.0 (release).

That’s nice, but split and index do not operate using integer indexing
into the string, so they are rather irrelevant to the topic at hand.

Heh, if the topic at hand is only that indexing into a string is
slower with native utf-8 strings (don’t disagree), then I guess it’s
irrelevant. Regarding the idea that you can do everything just as
efficiently with regexps that you can do with native utf-8
encoding…it seems relevant. In other words, it goes to show a
general behavior that is benefited by a native implementation (the
same reason we’re using native hashes rather than building our own
implementations out of arrays of pairs).

They produce the same results in ruby1.8, i.e. uni_split==reg_split and
uni_index==reg_index.

Yes. My point was to show how a native implementation of unicode
strings effects performance compared to using regular expressions on
bytestrings. The behavior should be the same (hence the asserts).

I also stated that the point of regex manipulation is to obviate the
need for methods like index and length. So a more accurate benchmark
might be something like:
reg_chars N/A N/A N/A ( N/A )
uni_chars 0.130000 0.000000 0.130000 ( 0.193307)

Someone just posted a question today about how to printf(“%20s …”,
a, …) when “a” contains unicode (it screws up the alignment since
printf only counts byte width, not character width). There is no
elegant solution in 1.8., regexps or otherwise. There are haskish
solutions (I provided one in that thread)…but the need was still
there. Another example is GtkTextView widgets from ruby-gtk2. They
deal with utf-8 in their C backend. So all the cursor functions that
deal with characters mean utf-8 characters, not bytestrings. So
without kludges, stuff doesn’t always work right.

ruby 1.9.0 (2007-12-03 patchlevel 0) [i686-linux]
-e:1:in `': character encodings differ (ArgumentError)

And that kinda kills my whole approach.

You can’t use mixed encodings (not just in regexps, not anywhere).
You’d have to use a proposed-but-not-implemented-in-1.9.0-release,
command line switch to set your encoding to ascii (or whatever), or
else use a magic comment [1] like I did above. That or explicitly
encode both objects in the same encoding.

Daniel

Regards,
Jordan

[1] http://www.ruby-forum.com/topic/127831

S-gw · December 7, 2007, 11:06am

Re: Unicode in Regex
Posted by Jordan Callicoat (monkeesage) on 03.12.2007 02:50

This seems to work…

$KCODE = “UTF8”
p /^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF.’-\ ]*?/u =~ “Jï¿½sp…it works”

=> 0

…
However, it looks to me like it would be more robust to use a slightly
modified version of UTF8REGEX (found in the link Jimmy posted
above)…

UTF8REGEX = /\A(?:
[a-zA-Z.-’\ ]
| [\xC2-\xDF][\x80-\xBF]
| \xE0[\xA0-\xBF][\x80-\xBF]
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}
| \xED[\x80-\x9F][\x80-\xBF]
| \xF0[\x90-\xBF][\x80-\xBF]{2}
| [\xF1-\xF3][\x80-\xBF]{3}
| \xF4[\x80-\x8F][\x80-\xBF]{2}
)*\z/mnx

Just to avoid confusion over the meaning of ‘UTF8’ in UTF8REGEX: the n
option sets the encoding of UTF8REGEX to none!

Cheers,

j. k.

S-gw · December 6, 2007, 8:06am

On Dec 5, 11:29 pm, Daniel DeLorme [email protected] wrote:

regexp, but that is irrelevant of whether the string has “native” utf8

between string comparison and regex comparison does not seem affected by
the encoding at all.

Ok, I wasn’t being clear. What I was trying to say is, yes the methods
perform the same on bytestrings – whether using regex or standard
string operations. The problem is in their behavior, not performance
considered in the abstract. In 1.9, using ascii default encoding, this
bytestring acts just like 1.8:

“ÆüËÜ¸ì”.index(“ËÜ”) #=> 3

That’s fine! Faster than a regexp, no problems. That is, unless I want
to know where the character match is (for whatever reason – take work-
necessitated interoperability with some software that required it).
For that I’d have to do something hackish and likely fragile. It’s
possible, but not desirable; however, being able to do this gains
performance and ruby already does all the work for you:

“ÆüËÜ¸ì”.force_encoding(“utf-8”).index(“ËÜ”.force_encoding(“utf-8”)) #=> 1

But it’s obviously not better to type! But that’s because I’m using
ascii default encoding. There is, as I understand it, going to be a
way to specify default encoding from the command-line, and probably
from within ruby, rather than just the magic comments and
String#force_encoding; so this extra typing is incidental and will go
away. Actually, it goes away right now if you use utf-8 default and
use the byte api to get at the underlying bytestrings.

Daniel
It works as expected in 1.9, you just have to set the right encoding:

printf(“%20s\n”.force_encoding(“utf-8”),
“ni\xc3\xb1o”.force_encoding(“utf-8”))
#=> ni«Ðo

printf(“%20s\n”, “nino”)
#=> ni«Ðo

Any case, I just don’t think there is any reason to dislike the new
string api. It adds another tool to the toolbox. It doesn’t make sense
to use it always, everywhere (like trying to make data that naturally
has the shape of an array, fit into a hash); but I see no reason to
try and cobble it together ourselves either (like building a hash api
from arrays ourselves). And with that, I’m going to sleep. Have to
think more on it tomorrow.

Peace,
Jordan