Forum: Ruby Is \d supposed to match Unicode Numbers?

9e117053562873e63fe19420463ebe02?d=identicon&s=25 Gavin Kistner (Guest)
on 2011-08-09 22:28
(Received via mailing list)
I posted this as a question here:
http://stackoverflow.com/questions/6998713/scannin...

Summarized:
The Oniguruma docs[1] seem to say that \d is supposed to match the
Unicode "Decimal_Number" category. However, in Ruby 1.9.1 and 1.9.2 it
only matches Latin 0-9 characters. Is this the correct behavior for
Ruby? Are the Oniguruma docs wrong? Am I misreading them? Do they not
apply to how Oniguruma is used within Ruby?

Test program:

#encoding: utf-8
require 'open-uri'
html =
open("http://www.fileformat.info/info/unicode/category/N...
digits = html.scan(/U\+([\da-f]{4})/i).flatten.map{ |s| s.to_i(16)
}.pack('U*')

puts digits.encoding, digits
#=> UTF-8
#=> 0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…

p RUBY_DESCRIPTION, digits.scan(/\d/)
#=> "ruby 1.9.2p180 (2011-02-18) [i386-mingw32]"
#=> ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]

Feel free to discuss here, or answer on Stack Overflow if you have a
solid answer and want the rep :)


[1] http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt
753dcb78b3a3651127665da4bed3c782?d=identicon&s=25 Brian Candler (candlerb)
on 2011-08-09 23:38
Gavin Kistner wrote in post #1015799:
> Is this the correct behavior for
> Ruby? Are the Oniguruma docs wrong? Am I misreading them? Do they not
> apply to how Oniguruma is used within Ruby?

irb(main):001:0>
"0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…".scan(/\d/)
=> ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
irb(main):002:0>
"0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…".scan(/[[:digit:]]/)
=> ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "٠", "١", "٢",
"٣", "٤", "٥", "٦", "٧", "٨", "٩", "۰", "۱", "۲", "۳", "۴", "۵", "۶",
"۷", "۸", "۹", "߀", "߁", "߂", "߃", "߄", "߅", "߆", "߇", "߈", "߉", "०",
"१", "२", "३", "४", "५", "६", "७", "८", "९", "০", "১", "২", "৩", "৪",
"৫", "৬", "৭", "৮", "৯", "੦", "੧", "੨"]
irb(main):003:0>

irb(main):004:0> "abcdé".scan(/\w/)
=> ["a", "b", "c", "d"]
irb(main):005:0> "abcdé".scan(/[[:alpha:]]/)
=> ["a", "b", "c", "d", "é"]

So I think it's intentional and consistent behaviour (for some
definition of consistent):

* \w and \d match only Latin letters and digits
* [[:alpha:]] and [[:digit:]] match the full unicode set
9e117053562873e63fe19420463ebe02?d=identicon&s=25 Gavin Kistner (Guest)
on 2011-08-10 00:25
(Received via mailing list)
On Aug 09, 2011, at 03:38 PM, Brian Candler <b.candler@pobox.com> wrote:
Gavin Kistner wrote in post #1015799:
> Is this the correct behavior for
> Ruby? Are the Oniguruma docs wrong? Am I misreading them? Do they not
> apply to how Oniguruma is used within Ruby?

* \w and \d match only Latin letters and digits
* [[:alpha:]] and [[:digit:]] match the full unicode set

Definitely helpful in achieving the end goal - thanks!

Any guess as to how to reconcile this behavior with what the Oniguruma
"ONIG_SYNTAX_RUBY" document says? Looking at the secton on \w, we may
have a clue:

  \w word character
    Not Unicode:alphanumeric, "_" and multibyte char.
    Unicode:General_Category --
(Letter|Mark|Number|Connector_Punctuation)
  [..]
  \d decimal digit char
    Unicode: General_Category -- Decimal_Number

Perhaps "Not Unicode" means "this is how it behaves in some non-Unicode
mode", and "Unicode" means "this is how it behaves in some Unicode
mode". And perhaps missing from the doc for \d is something like "Not
Unicode: 0-9".

If that is correct then the next question for me is how to enable
Unicode-mode for Oniguruma. The /u flag on a regexp does not do it,
since:

  "ab".scan(/\w/) == "ab".scan(/\w/u)
9e117053562873e63fe19420463ebe02?d=identicon&s=25 Gavin Kistner (Guest)
on 2011-08-10 00:54
(Received via mailing list)
On Aug 09, 2011, at 02:28 PM, Gavin Kistner <phrogz@me.com> wrote:
I posted this as a question here:
http://stackoverflow.com/questions/6998713/scannin...

Summarized:
The Oniguruma docs[1] seem to say that \d is supposed to match the
Unicode "Decimal_Number" category. However, in Ruby 1.9.1 and 1.9.2 it
only matches Latin 0-9 characters. Is this the correct behavior for
Ruby? Are the Oniguruma docs wrong? Am I misreading them? Do they not
apply to how Oniguruma is used within Ruby?

[1]http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt

It seems that the above Oniguruma document is not directly applicable to
Ruby 1.9.See this ticket discussion[2], which includes these posts:

[Yui NARUSE] "RE.txt is for original Oniguruma, not for Ruby 1.9's
regexp. We may need our own document."
[Matz] "Our Oniguruma is forked one. The original Oniguruma found in
geocities.jp has not been changed."

This discussion was from almost 2 years ago, but sadly I have not been
able to find an official Ruby 1.9 version of RE.txt.

[2]http://redmine.ruby-lang.org/issues/1889#note-28
Ee2809522b2e56d0d6b656486bc5e0db?d=identicon&s=25 Markus Fischer (Guest)
on 2011-08-10 01:27
(Received via mailing list)
On 10.08.2011 00:52, Gavin Kistner wrote:
> This discussion was from almost 2 years ago, but sadly I have not been
> able to find an official Ruby 1.9 version of RE.txt.

Hmm, I almost always refer to
https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc for the regex doc
and therein it's pretty accurate:

[...]
* /\d/ - A digit character ([0-9])
[...]
For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas
/[[:digit:]]/ matches any character in the Unicode Nd category
[...]

HTH,
- Markus
9e117053562873e63fe19420463ebe02?d=identicon&s=25 Gavin Kistner (Guest)
on 2011-08-10 05:22
(Received via mailing list)
On Aug 9, 2011, at 5:25 PM, Markus Fischer wrote:
> On 10.08.2011 00:52, Gavin Kistner wrote:
>> This discussion was from almost 2 years ago, but sadly I have not been
>> able to find an official Ruby 1.9 version of RE.txt.
>
> Hmm, I almost always refer to
> https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc for the regex doc
> and therein it's pretty accurate:
> []
> HTH,

It does help, thanks! :)
58479f76374a3ba3c69b9804163f39f4?d=identicon&s=25 Eric Hodel (Guest)
on 2011-08-10 17:39
(Received via mailing list)
On Aug 9, 2011, at 16:25, Markus Fischer <markus@fischer.name> wrote:

> Hmm, I almost always refer to
> https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc for the regex doc
> and therein it's pretty accurate:
>
> [...]
> * /\d/ - A digit character ([0-9])
> [...]
> For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas
> /[[:digit:]]/ matches any character in the Unicode Nd category
> [...]

This same content is found in:

ri Regexp
Please log in before posting. Registration is free and takes only a minute.
Existing account

NEW: Do you have a Google/GoogleMail, Yahoo or Facebook account? No registration required!
Log in with Google account | Log in with Yahoo account | Log in with Facebook account
No account? Register here.