Is \d supposed to match Unicode Numbers?

I posted this as a question here:

Summarized:
The Oniguruma docs[1] seem to say that \d is supposed to match the
Unicode “Decimal_Number” category. However, in Ruby 1.9.1 and 1.9.2 it
only matches Latin 0-9 characters. Is this the correct behavior for
Ruby? Are the Oniguruma docs wrong? Am I misreading them? Do they not
apply to how Oniguruma is used within Ruby?

Test program:

#encoding: utf-8
require ‘open-uri’
html =
open(“http://www.fileformat.info/info/unicode/category/Nd/list.htm”).read
digits = html.scan(/U+([\da-f]{4})/i).flatten.map{ |s| s.to_i(16)
}.pack(‘U*’)

puts digits.encoding, digits
#=> UTF-8
#=> 0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…

p RUBY_DESCRIPTION, digits.scan(/\d/)
#=> “ruby 1.9.2p180 (2011-02-18) [i386-mingw32]”
#=> [“0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, “9”]

Feel free to discuss here, or answer on Stack Overflow if you have a
solid answer and want the rep :slight_smile:

[1] http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt

Gavin K. wrote in post #1015799:

Is this the correct behavior for
Ruby? Are the Oniguruma docs wrong? Am I misreading them? Do they not
apply to how Oniguruma is used within Ruby?

irb(main):001:0>
“0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…”.scan(/\d/)
=> [“0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, “9”]
irb(main):002:0>
“0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…”.scan(/[[:digit:]]/)
=> [“0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, “9”, “٠”, “١”, “٢”,
“٣”, “٤”, “٥”, “٦”, “٧”, “٨”, “٩”, “۰”, “۱”, “۲”, “۳”, “۴”, “۵”, “۶”,
“۷”, “۸”, “۹”, “߀”, “߁”, “߂”, “߃”, “߄”, “߅”, “߆”, “߇”, “߈”, “߉”, “०”,
“१”, “२”, “३”, “४”, “५”, “६”, “७”, “८”, “९”, “০”, “১”, “২”, “৩”, “৪”,
“৫”, “৬”, “৭”, “৮”, “৯”, “੦”, “੧”, “੨”]
irb(main):003:0>

irb(main):004:0> “abcdé”.scan(/\w/)
=> [“a”, “b”, “c”, “d”]
irb(main):005:0> “abcdé”.scan(/[[:alpha:]]/)
=> [“a”, “b”, “c”, “d”, “é”]

So I think it’s intentional and consistent behaviour (for some
definition of consistent):

  • \w and \d match only Latin letters and digits
  • [[:alpha:]] and [[:digit:]] match the full unicode set

On Aug 09, 2011, at 03:38 PM, Brian C. [email protected] wrote:
Gavin K. wrote in post #1015799:

Is this the correct behavior for
Ruby? Are the Oniguruma docs wrong? Am I misreading them? Do they not
apply to how Oniguruma is used within Ruby?

  • \w and \d match only Latin letters and digits
  • [[:alpha:]] and [[:digit:]] match the full unicode set

Definitely helpful in achieving the end goal - thanks!

Any guess as to how to reconcile this behavior with what the Oniguruma
“ONIG_SYNTAX_RUBY” document says? Looking at the secton on \w, we may
have a clue:

\w word character
Not Unicode:alphanumeric, “_” and multibyte char.
Unicode:General_Category –
(Letter|Mark|Number|Connector_Punctuation)
[…]
\d decimal digit char
Unicode: General_Category – Decimal_Number

Perhaps “Not Unicode” means “this is how it behaves in some non-Unicode
mode”, and “Unicode” means “this is how it behaves in some Unicode
mode”. And perhaps missing from the doc for \d is something like “Not
Unicode: 0-9”.

If that is correct then the next question for me is how to enable
Unicode-mode for Oniguruma. The /u flag on a regexp does not do it,
since:

“ab”.scan(/\w/) == “ab”.scan(/\w/u)

On 10.08.2011 00:52, Gavin K. wrote:

This discussion was from almost 2 years ago, but sadly I have not been
able to find an official Ruby 1.9 version of RE.txt.

Hmm, I almost always refer to
https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc for the regex doc
and therein it’s pretty accurate:

[…]

  • /\d/ - A digit character ([0-9])
    […]
    For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas
    /[[:digit:]]/ matches any character in the Unicode Nd category
    […]

HTH,

  • Markus

On Aug 09, 2011, at 02:28 PM, Gavin K. [email protected] wrote:
I posted this as a question here:
http://stackoverflow.com/questions/6998713/scanning-for-unicode-numbers-in-a-string-with-d

Summarized:
The Oniguruma docs[1] seem to say that \d is supposed to match the
Unicode “Decimal_Number” category. However, in Ruby 1.9.1 and 1.9.2 it
only matches Latin 0-9 characters. Is this the correct behavior for
Ruby? Are the Oniguruma docs wrong? Am I misreading them? Do they not
apply to how Oniguruma is used within Ruby?

[1]http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt

It seems that the above Oniguruma document is not directly applicable to
Ruby 1.9.See this ticket discussion[2], which includes these posts:

[Yui NARUSE] “RE.txt is for original Oniguruma, not for Ruby 1.9’s
regexp. We may need our own document.”
[Matz] “Our Oniguruma is forked one. The original Oniguruma found in
geocities.jp has not been changed.”

This discussion was from almost 2 years ago, but sadly I have not been
able to find an official Ruby 1.9 version of RE.txt.

[2]http://redmine.ruby-lang.org/issues/1889#note-28

On Aug 9, 2011, at 16:25, Markus F. [email protected] wrote:

Hmm, I almost always refer to
https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc for the regex doc
and therein it’s pretty accurate:

[…]

  • /\d/ - A digit character ([0-9])
    […]
    For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas
    /[[:digit:]]/ matches any character in the Unicode Nd category
    […]

This same content is found in:

ri Regexp

On Aug 9, 2011, at 5:25 PM, Markus F. wrote:

On 10.08.2011 00:52, Gavin K. wrote:

This discussion was from almost 2 years ago, but sadly I have not been
able to find an official Ruby 1.9 version of RE.txt.

Hmm, I almost always refer to
https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc for the regex doc
and therein it’s pretty accurate:
[]
HTH,

It does help, thanks! :slight_smile: