UTF-8 regular expressions

blewis · December 24, 2005, 10:40pm

Hi,
So I read the post from awhile back about packing multi-byte UTF-8
characters as octal:

r = Regexp.compile(“ab\304\243cd”, 0, “UTF-8”)

or
r = Regexp.compile(“ab#{[0x123].pack(‘U’)}cd”, 0, “UTF-8”)

So this seems to be a way to list out individual multi-byte UTF-8
characters
I was wondering if there’s then a convenient way to specify a range of
UTF-8 characters?

For instance the darn
0x2002-2003
0x2013-2014
0x2018-201E
characters?

Thanks,
Ben