Regex \w allows non english characters

ehudros · May 10, 2007, 5:10pm

Hi everyone…
I’m looking for a way to only allow english characters through a
simple regex.
It seems that \w (altough the documentation states is equivalent to [a-
zA-Z0-9] still allows
non english characters (in my case hebrew).

Has anyone come up with a solution other than specifying [abcdef…]?

Thanks!
Ehud

ehudros · May 10, 2007, 5:16pm

On 10.05.2007 17:05, Ehud wrote:

Hi everyone…
I’m looking for a way to only allow english characters through a
simple regex.
It seems that \w (altough the documentation states is equivalent to [a-
zA-Z0-9] still allows
non english characters (in my case hebrew).

Has anyone come up with a solution other than specifying [abcdef…]?

[a-zA-Z]

robert

ehudros · May 10, 2007, 9:26pm

I’m making a guess here, but ruby is probably looking at the Hebrew
characters as a normal range of chars, with a character encoding. Now
what encoding Hebrew uses I’m not sure, but for instance the ascii
code for ‘a’ is 97. The code for one of the Hebrew characters is
probably 97 also. Since ruby doesn’t really do UTF, it just sees two
characters, both with a code of 97, and lets them through.

–Kyle

ehudros · May 11, 2007, 7:16pm

On Fri, 11 May 2007 04:25:42 +0900, Kyle S. wrote:

I’m making a guess here, but ruby is probably looking at the Hebrew
characters as a normal range of chars, with a character encoding. Now
what encoding Hebrew uses I’m not sure, but for instance the ascii code
for ‘a’ is 97. The code for one of the Hebrew characters is probably 97
also. Since ruby doesn’t really do UTF, it just sees two characters,
both with a code of 97, and lets them through.

Unless you’re using special fonts that do a special mapping (which is
generally no longer done these days), non-English characters are always
found in characters 128-255. Different encodings are simply different
ways of mapping these characters to different languages. 0-127 are
always
the same English ASCII characters.

×©×‘×ª ×©×œ×•×
–Ken B.

ehudros · May 11, 2007, 6:47am

Hi,

At Fri, 11 May 2007 04:25:42 +0900,
Kyle S. wrote in [ruby-talk:251082]:

I’m making a guess here, but ruby is probably looking at the Hebrew
characters as a normal range of chars, with a character encoding. Now
what encoding Hebrew uses I’m not sure, but for instance the ascii
code for ‘a’ is 97. The code for one of the Hebrew characters is
probably 97 also. Since ruby doesn’t really do UTF, it just sees two
characters, both with a code of 97, and lets them through.

/[[:alpha:]]/u

ehudros · May 12, 2007, 8:09am

The meaning of \w can change if you alter the global $KCODE variable.
It’s best to specify exactly what you mean if you know exactly what
you want (eg, follow Robert’s advice). Specifying \w says that you
want “wordful,” non-breaking characters; this includes non-English
characters, even CJK.

irb(main):001:0> s = “×©×‘×ª ×©×œ×•×”
=> “\327\251\327\221\327\252 \327\251\327\234\327\225\327\235”
irb(main):002:0> s =~ /\w/ ? “match” : “no match”
=> “no match”
irb(main):003:0> $KCODE = “u”
=> “u”
irb(main):004:0> s =~ /\w/ ? “match” : “no match”
=> “match”

ehudros · May 11, 2007, 9:28pm

As I said a guess That’s really interesting though. So had it
been for chars outside of english, I would have been on the ball…
Any chances that ruby’s regex will work on utf8(or 16 or 7 or any of
the variants)?

ehudros · May 14, 2007, 5:50am

Depends on what you mean by “work.” If you don’t set a global $KCODE
and you don’t specify a language as part of the regex options, all
regular expressions will work on the byte level. It appears ruby
(1.8.x) only supports utf-8 if you set $KCODE = “u” or pass in a u as
a regex option.

“ä½ å¥½” =~ /(\w)/u and $1
=> "ä½ "

Iconv.iconv(“utf-16”, “utf-8”, “ä½ å¥½”) =~ /(\w)/u and $1
=> false