I posted this as a question here: http://stackoverflow.com/questions/6998713/scannin... Summarized: The Oniguruma docs[1] seem to say that \d is supposed to match the Unicode "Decimal_Number" category. However, in Ruby 1.9.1 and 1.9.2 it only matches Latin 0-9 characters. Is this the correct behavior for Ruby? Are the Oniguruma docs wrong? Am I misreading them? Do they not apply to how Oniguruma is used within Ruby? Test program: #encoding: utf-8 require 'open-uri' html = open("http://www.fileformat.info/info/unicode/category/N... digits = html.scan(/U\+([\da-f]{4})/i).flatten.map{ |s| s.to_i(16) }.pack('U*') puts digits.encoding, digits #=> UTF-8 #=> 0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨… p RUBY_DESCRIPTION, digits.scan(/\d/) #=> "ruby 1.9.2p180 (2011-02-18) [i386-mingw32]" #=> ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"] Feel free to discuss here, or answer on Stack Overflow if you have a solid answer and want the rep :) [1] http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt
on 2011-08-09 22:28
on 2011-08-09 23:38
Gavin Kistner wrote in post #1015799: > Is this the correct behavior for > Ruby? Are the Oniguruma docs wrong? Am I misreading them? Do they not > apply to how Oniguruma is used within Ruby? irb(main):001:0> "0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…".scan(/\d/) => ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"] irb(main):002:0> "0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…".scan(/[[:digit:]]/) => ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "٠", "١", "٢", "٣", "٤", "٥", "٦", "٧", "٨", "٩", "۰", "۱", "۲", "۳", "۴", "۵", "۶", "۷", "۸", "۹", "߀", "߁", "߂", "߃", "߄", "߅", "߆", "߇", "߈", "߉", "०", "१", "२", "३", "४", "५", "६", "७", "८", "९", "০", "১", "২", "৩", "৪", "৫", "৬", "৭", "৮", "৯", "੦", "੧", "੨"] irb(main):003:0> irb(main):004:0> "abcdé".scan(/\w/) => ["a", "b", "c", "d"] irb(main):005:0> "abcdé".scan(/[[:alpha:]]/) => ["a", "b", "c", "d", "é"] So I think it's intentional and consistent behaviour (for some definition of consistent): * \w and \d match only Latin letters and digits * [[:alpha:]] and [[:digit:]] match the full unicode set
on 2011-08-10 00:25
On Aug 09, 2011, at 03:38 PM, Brian Candler <b.candler@pobox.com> wrote: Gavin Kistner wrote in post #1015799: > Is this the correct behavior for > Ruby? Are the Oniguruma docs wrong? Am I misreading them? Do they not > apply to how Oniguruma is used within Ruby? * \w and \d match only Latin letters and digits * [[:alpha:]] and [[:digit:]] match the full unicode set Definitely helpful in achieving the end goal - thanks! Any guess as to how to reconcile this behavior with what the Oniguruma "ONIG_SYNTAX_RUBY" document says? Looking at the secton on \w, we may have a clue: \w word character Not Unicode:alphanumeric, "_" and multibyte char. Unicode:General_Category -- (Letter|Mark|Number|Connector_Punctuation) [..] \d decimal digit char Unicode: General_Category -- Decimal_Number Perhaps "Not Unicode" means "this is how it behaves in some non-Unicode mode", and "Unicode" means "this is how it behaves in some Unicode mode". And perhaps missing from the doc for \d is something like "Not Unicode: 0-9". If that is correct then the next question for me is how to enable Unicode-mode for Oniguruma. The /u flag on a regexp does not do it, since: "ab".scan(/\w/) == "ab".scan(/\w/u)
on 2011-08-10 00:54
On Aug 09, 2011, at 02:28 PM, Gavin Kistner <phrogz@me.com> wrote: I posted this as a question here: http://stackoverflow.com/questions/6998713/scannin... Summarized: The Oniguruma docs[1] seem to say that \d is supposed to match the Unicode "Decimal_Number" category. However, in Ruby 1.9.1 and 1.9.2 it only matches Latin 0-9 characters. Is this the correct behavior for Ruby? Are the Oniguruma docs wrong? Am I misreading them? Do they not apply to how Oniguruma is used within Ruby? [1]http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt It seems that the above Oniguruma document is not directly applicable to Ruby 1.9.See this ticket discussion[2], which includes these posts: [Yui NARUSE] "RE.txt is for original Oniguruma, not for Ruby 1.9's regexp. We may need our own document." [Matz] "Our Oniguruma is forked one. The original Oniguruma found in geocities.jp has not been changed." This discussion was from almost 2 years ago, but sadly I have not been able to find an official Ruby 1.9 version of RE.txt. [2]http://redmine.ruby-lang.org/issues/1889#note-28
on 2011-08-10 01:27
On 10.08.2011 00:52, Gavin Kistner wrote: > This discussion was from almost 2 years ago, but sadly I have not been > able to find an official Ruby 1.9 version of RE.txt. Hmm, I almost always refer to https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc for the regex doc and therein it's pretty accurate: [...] * /\d/ - A digit character ([0-9]) [...] For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas /[[:digit:]]/ matches any character in the Unicode Nd category [...] HTH, - Markus
on 2011-08-10 05:22
On Aug 9, 2011, at 5:25 PM, Markus Fischer wrote: > On 10.08.2011 00:52, Gavin Kistner wrote: >> This discussion was from almost 2 years ago, but sadly I have not been >> able to find an official Ruby 1.9 version of RE.txt. > > Hmm, I almost always refer to > https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc for the regex doc > and therein it's pretty accurate: > [] > HTH, It does help, thanks! :)
on 2011-08-10 17:39
On Aug 9, 2011, at 16:25, Markus Fischer <markus@fischer.name> wrote: > Hmm, I almost always refer to > https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc for the regex doc > and therein it's pretty accurate: > > [...] > * /\d/ - A digit character ([0-9]) > [...] > For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas > /[[:digit:]]/ matches any character in the Unicode Nd category > [...] This same content is found in: ri Regexp
Please log in before posting. Registration is free and takes only a minute.
Existing account
(Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
Log in with Google account | Log in with Yahoo account
No account? Register here.