Ruby 1.9.2-p0, built from the tarball on Mac OS X 10.6.4 with the Xcode
3.2.5 tools.
Consider the following string:
STR = “sarà la cortesia del gran Lombardo”
Ruby (through irb) correctly identifies this as a Unicode string; the
first word (in case something swallows it on the way to your screen)
ends with an a-grave.
I’d like to naïvely split this line into words. The obvious way to do
this is:
words = STR.split /\W+/
adding the u qualifier to the regexp doesn’t matter
words becomes
=> [“sar”, “la”, “cortesia”, “del”, “gran”, “Lombardo”]
The a-grave gets interpreted as a non-word (and therefore separator)
character. Apparently the regex implementation doesn’t know about
non-ASCII letter classes. Unicode word separation is hard, but Ruby is
famous enough for its Unicode support that I’d hoped it would handle it.
Changing the separator regex to something like /[- .,’;: ]+/ will do as
a workaround, if I’m willing to iterate my code till I’ve found all the
separators. But I hate non-general solutions.
Is there any way to fix this? I can rebuild Ruby if need be, though I
see nothing obvious in “./configure --help”.
— F