String#split regex \W on non-ASCII text

okkezSS · November 9, 2010, 6:44pm

Ruby 1.9.2-p0, built from the tarball on Mac OS X 10.6.4 with the Xcode
3.2.5 tools.

Consider the following string:
STR = “sarà la cortesia del gran Lombardo”

Ruby (through irb) correctly identifies this as a Unicode string; the
first word (in case something swallows it on the way to your screen)
ends with an a-grave.

I’d like to naïvely split this line into words. The obvious way to do
this is:

words = STR.split /\W+/

adding the u qualifier to the regexp doesn’t matter

words becomes
=> [“sar”, “la”, “cortesia”, “del”, “gran”, “Lombardo”]

The a-grave gets interpreted as a non-word (and therefore separator)
character. Apparently the regex implementation doesn’t know about
non-ASCII letter classes. Unicode word separation is hard, but Ruby is
famous enough for its Unicode support that I’d hoped it would handle it.

Changing the separator regex to something like /[- .,’;: ]+/ will do as
a workaround, if I’m willing to iterate my code till I’ve found all the
separators. But I hate non-general solutions.

Is there any way to fix this? I can rebuild Ruby if need be, though I
see nothing obvious in “./configure --help”.

— F

fritza · November 9, 2010, 7:11pm

On Tue, Nov 9, 2010 at 7:44 PM, Fritz A. [email protected]
wrote:

I’d like to naïvely split this line into words. The obvious way to do
this is:

words = STR.split /\W+/

adding the u qualifier to the regexp doesn’t matter

words becomes
=> [“sar”, “la”, “cortesia”, “del”, “gran”, “Lombardo”]

Use the unicode property for separators.

words = STR.split /\p{Z}/
=> [“sarà”, “la”, “cortesia”, “del”, “gran”, “Lombardo”]

Regards,
Ammar
words = STR.split /\p{Z}/

fritza · November 9, 2010, 8:31pm

Character properties are what I needed, thanks. I really did want
non-word separators, but you led me to \P{L}.

Thanks again.

— F

fritza · December 5, 2010, 7:54pm

Am 09.11.2010 um 19:09 schrieb Ammar A.: