Is \d supposed to match Unicode Numbers?

luislavena · August 9, 2011, 10:28pm

I posted this as a question here:

Summarized:
The Oniguruma docs[1] seem to say that \d is supposed to match the
Unicode “Decimal_Number” category. However, in Ruby 1.9.1 and 1.9.2 it
only matches Latin 0-9 characters. Is this the correct behavior for
Ruby? Are the Oniguruma docs wrong? Am I misreading them? Do they not
apply to how Oniguruma is used within Ruby?

Test program:

#encoding: utf-8
require ‘open-uri’
html =
open(“Unicode Characters in the 'Number, Decimal Digit' Category”).read
digits = html.scan(/U+([\da-f]{4})/i).flatten.map{ |s| s.to_i(16)
}.pack(‘U*’)

puts digits.encoding, digits
#=> UTF-8
#=> 0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…

p RUBY_DESCRIPTION, digits.scan(/\d/)
#=> “ruby 1.9.2p180 (2011-02-18) [i386-mingw32]”
#=> [“0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, “9”]

Feel free to discuss here, or answer on Stack Overflow if you have a
solid answer and want the rep

[1] サービス終了のお知らせ

phrogz · August 9, 2011, 11:38pm

Gavin K. wrote in post #1015799:

Is this the correct behavior for
Ruby? Are the Oniguruma docs wrong? Am I misreading them? Do they not
apply to how Oniguruma is used within Ruby?

irb(main):001:0>
“0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…”.scan(/\d/)
=> [“0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, “9”]
irb(main):002:0>
“0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…”.scan(/[[:digit:]]/)
=> [“0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, “9”, “٠”, “١”, “٢”,
“٣”, “٤”, “٥”, “٦”, “٧”, “٨”, “٩”, “۰”, “۱”, “۲”, “۳”, “۴”, “۵”, “۶”,
“۷”, “۸”, “۹”, “߀”, “߁”, “߂”, “߃”, “߄”, “߅”, “߆”, “߇”, “߈”, “߉”, “०”,
“१”, “२”, “३”, “४”, “५”, “६”, “७”, “८”, “९”, “০”, “১”, “২”, “৩”, “৪”,
“৫”, “৬”, “৭”, “৮”, “৯”, “੦”, “੧”, “੨”]
irb(main):003:0>

irb(main):004:0> “abcdé”.scan(/\w/)
=> [“a”, “b”, “c”, “d”]
irb(main):005:0> “abcdé”.scan(/[[:alpha:]]/)
=> [“a”, “b”, “c”, “d”, “é”]

So I think it’s intentional and consistent behaviour (for some
definition of consistent):

\w and \d match only Latin letters and digits
[[:alpha:]] and [[:digit:]] match the full unicode set

phrogz · August 10, 2011, 12:25am

On Aug 09, 2011, at 03:38 PM, Brian C. [email protected] wrote:
Gavin K. wrote in post #1015799:

Is this the correct behavior for
Ruby? Are the Oniguruma docs wrong? Am I misreading them? Do they not
apply to how Oniguruma is used within Ruby?

\w and \d match only Latin letters and digits
[[:alpha:]] and [[:digit:]] match the full unicode set

Definitely helpful in achieving the end goal - thanks!

Any guess as to how to reconcile this behavior with what the Oniguruma
“ONIG_SYNTAX_RUBY” document says? Looking at the secton on \w, we may
have a clue:

\w word character
Not Unicode:alphanumeric, “_” and multibyte char.
Unicode:General_Category –
(Letter|Mark|Number|Connector_Punctuation)
[…]
\d decimal digit char
Unicode: General_Category – Decimal_Number

Perhaps “Not Unicode” means “this is how it behaves in some non-Unicode
mode”, and “Unicode” means “this is how it behaves in some Unicode
mode”. And perhaps missing from the doc for \d is something like “Not
Unicode: 0-9”.

If that is correct then the next question for me is how to enable
Unicode-mode for Oniguruma. The /u flag on a regexp does not do it,
since:

“ab”.scan(/\w/) == “ab”.scan(/\w/u)

phrogz · August 10, 2011, 1:27am

On 10.08.2011 00:52, Gavin K. wrote:

This discussion was from almost 2 years ago, but sadly I have not been
able to find an official Ruby 1.9 version of RE.txt.

Hmm, I almost always refer to
https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc for the regex doc
and therein it’s pretty accurate:

[…]

/\d/ - A digit character ([0-9])
[…]
For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas
/[[:digit:]]/ matches any character in the Unicode Nd category
[…]

HTH,

Markus

phrogz · August 10, 2011, 12:54am

On Aug 09, 2011, at 02:28 PM, Gavin K. [email protected] wrote:
I posted this as a question here:

Summarized:
The Oniguruma docs[1] seem to say that \d is supposed to match the
Unicode “Decimal_Number” category. However, in Ruby 1.9.1 and 1.9.2 it
only matches Latin 0-9 characters. Is this the correct behavior for
Ruby? Are the Oniguruma docs wrong? Am I misreading them? Do they not
apply to how Oniguruma is used within Ruby?

[1]サービス終了のお知らせ

It seems that the above Oniguruma document is not directly applicable to
Ruby 1.9.See this ticket discussion[2], which includes these posts:

[Yui NARUSE] “RE.txt is for original Oniguruma, not for Ruby 1.9’s
regexp. We may need our own document.”
[Matz] “Our Oniguruma is forked one. The original Oniguruma found in
geocities.jp has not been changed.”

This discussion was from almost 2 years ago, but sadly I have not been
able to find an official Ruby 1.9 version of RE.txt.

[2]Feature #1889: Teach Onigurma Unicode 5.0 Character Properties - Ruby master - Ruby Issue Tracking System

phrogz · August 10, 2011, 5:39pm

On Aug 9, 2011, at 16:25, Markus F. [email protected] wrote:

Hmm, I almost always refer to
https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc for the regex doc
and therein it’s pretty accurate:

[…]

/\d/ - A digit character ([0-9])
[…]
For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas
/[[:digit:]]/ matches any character in the Unicode Nd category
[…]

This same content is found in:

ri Regexp

phrogz · August 10, 2011, 5:22am

On Aug 9, 2011, at 5:25 PM, Markus F. wrote:

On 10.08.2011 00:52, Gavin K. wrote:

This discussion was from almost 2 years ago, but sadly I have not been
able to find an official Ruby 1.9 version of RE.txt.

Hmm, I almost always refer to
https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc for the regex doc
and therein it’s pretty accurate:
[]
HTH,

It does help, thanks!