Forum: Ruby Is \d supposed to match Unicode Numbers?

Posted by Gavin Kistner (Guest)
on 2011-08-09 22:28
(Received via mailing list)
I posted this as a question here:
http://stackoverflow.com/questions/6998713/scannin...

Summarized:
The Oniguruma docs[1] seem to say that \d is supposed to match the 
Unicode "Decimal_Number" category. However, in Ruby 1.9.1 and 1.9.2 it 
only matches Latin 0-9 characters. Is this the correct behavior for 
Ruby? Are the Oniguruma docs wrong? Am I misreading them? Do they not 
apply to how Oniguruma is used within Ruby?

Test program:

#encoding: utf-8
require 'open-uri'
html = 
open("http://www.fileformat.info/info/unicode/category/N...
digits = html.scan(/U\+([\da-f]{4})/i).flatten.map{ |s| s.to_i(16) 
}.pack('U*')

puts digits.encoding, digits
#=> UTF-8
#=> 0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…

p RUBY_DESCRIPTION, digits.scan(/\d/)
#=> "ruby 1.9.2p180 (2011-02-18) [i386-mingw32]"
#=> ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]

Feel free to discuss here, or answer on Stack Overflow if you have a 
solid answer and want the rep :)


[1] http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt
Posted by Brian Candler (candlerb)
on 2011-08-09 23:38
Gavin Kistner wrote in post #1015799:
> Is this the correct behavior for
> Ruby? Are the Oniguruma docs wrong? Am I misreading them? Do they not
> apply to how Oniguruma is used within Ruby?

irb(main):001:0> 
"0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…".scan(/\d/)
=> ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
irb(main):002:0> 
"0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…".scan(/[[:digit:]]/)
=> ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "٠", "١", "٢", 
"٣", "٤", "٥", "٦", "٧", "٨", "٩", "۰", "۱", "۲", "۳", "۴", "۵", "۶", 
"۷", "۸", "۹", "߀", "߁", "߂", "߃", "߄", "߅", "߆", "߇", "߈", "߉", "०", 
"१", "२", "३", "४", "५", "६", "७", "८", "९", "০", "১", "২", "৩", "৪", 
"৫", "৬", "৭", "৮", "৯", "੦", "੧", "੨"]
irb(main):003:0>

irb(main):004:0> "abcdé".scan(/\w/)
=> ["a", "b", "c", "d"]
irb(main):005:0> "abcdé".scan(/[[:alpha:]]/)
=> ["a", "b", "c", "d", "é"]

So I think it's intentional and consistent behaviour (for some 
definition of consistent):

* \w and \d match only Latin letters and digits
* [[:alpha:]] and [[:digit:]] match the full unicode set
Posted by Gavin Kistner (Guest)
on 2011-08-10 00:25
(Received via mailing list)
On Aug 09, 2011, at 03:38 PM, Brian Candler <b.candler@pobox.com> wrote:
Gavin Kistner wrote in post #1015799:
> Is this the correct behavior for
> Ruby? Are the Oniguruma docs wrong? Am I misreading them? Do they not
> apply to how Oniguruma is used within Ruby?

* \w and \d match only Latin letters and digits
* [[:alpha:]] and [[:digit:]] match the full unicode set

Definitely helpful in achieving the end goal - thanks!

Any guess as to how to reconcile this behavior with what the Oniguruma 
"ONIG_SYNTAX_RUBY" document says? Looking at the secton on \w, we may 
have a clue:

  \w word character
    Not Unicode:alphanumeric, "_" and multibyte char.
    Unicode:General_Category -- 
(Letter|Mark|Number|Connector_Punctuation)
  [..]
  \d decimal digit char
    Unicode: General_Category -- Decimal_Number

Perhaps "Not Unicode" means "this is how it behaves in some non-Unicode 
mode", and "Unicode" means "this is how it behaves in some Unicode 
mode". And perhaps missing from the doc for \d is something like "Not 
Unicode: 0-9".

If that is correct then the next question for me is how to enable 
Unicode-mode for Oniguruma. The /u flag on a regexp does not do it, 
since:

  "ab".scan(/\w/) == "ab".scan(/\w/u)
Posted by Gavin Kistner (Guest)
on 2011-08-10 00:54
(Received via mailing list)
On Aug 09, 2011, at 02:28 PM, Gavin Kistner <phrogz@me.com> wrote:
I posted this as a question here:
http://stackoverflow.com/questions/6998713/scannin...

Summarized:
The Oniguruma docs[1] seem to say that \d is supposed to match the 
Unicode "Decimal_Number" category. However, in Ruby 1.9.1 and 1.9.2 it 
only matches Latin 0-9 characters. Is this the correct behavior for 
Ruby? Are the Oniguruma docs wrong? Am I misreading them? Do they not 
apply to how Oniguruma is used within Ruby?

[1]http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt

It seems that the above Oniguruma document is not directly applicable to 
Ruby 1.9.See this ticket discussion[2], which includes these posts:

[Yui NARUSE] "RE.txt is for original Oniguruma, not for Ruby 1.9's 
regexp. We may need our own document."
[Matz] "Our Oniguruma is forked one. The original Oniguruma found in 
geocities.jp has not been changed."

This discussion was from almost 2 years ago, but sadly I have not been 
able to find an official Ruby 1.9 version of RE.txt.

[2]http://redmine.ruby-lang.org/issues/1889#note-28
Posted by Markus Fischer (Guest)
on 2011-08-10 01:27
(Received via mailing list)
On 10.08.2011 00:52, Gavin Kistner wrote:
> This discussion was from almost 2 years ago, but sadly I have not been
> able to find an official Ruby 1.9 version of RE.txt.

Hmm, I almost always refer to
https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc for the regex doc
and therein it's pretty accurate:

[...]
* /\d/ - A digit character ([0-9])
[...]
For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas
/[[:digit:]]/ matches any character in the Unicode Nd category
[...]

HTH,
- Markus
Posted by Gavin Kistner (Guest)
on 2011-08-10 05:22
(Received via mailing list)
On Aug 9, 2011, at 5:25 PM, Markus Fischer wrote:
> On 10.08.2011 00:52, Gavin Kistner wrote:
>> This discussion was from almost 2 years ago, but sadly I have not been
>> able to find an official Ruby 1.9 version of RE.txt.
>
> Hmm, I almost always refer to
> https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc for the regex doc
> and therein it's pretty accurate:
> []
> HTH,

It does help, thanks! :)
Posted by Eric Hodel (Guest)
on 2011-08-10 17:39
(Received via mailing list)
On Aug 9, 2011, at 16:25, Markus Fischer <markus@fischer.name> wrote:

> Hmm, I almost always refer to
> https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc for the regex doc
> and therein it's pretty accurate:
>
> [...]
> * /\d/ - A digit character ([0-9])
> [...]
> For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas
> /[[:digit:]]/ matches any character in the Unicode Nd category
> [...]

This same content is found in:

ri Regexp
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.