How to detect if a string contains any funny characters from non English alphabets

Chris_Nowlan · July 22, 2008, 11:16pm

Hello!

I have a specific problem that maybe you can help with.

Given an input of string (up to 1000 characters), how do I detect if
the string is not written in “Roman Alphabets (A-Z)”, and by that I mean
that the string maybe in Chinese, Korean, or contain non-standard
English or western language alphabets, etc.

I just need a function that will return “True” or “False”.

Thanks in advance!

-Chris

Chris_Nowlan · July 23, 2008, 1:01am

Well I looked at the API and thought that str.each_char {} might
work, but it’s not recognized in my version of Ruby (1.8.6 on Ubuntu)
so that seems to be a dead end. Here’s an ugly hack that works though:

def nonroman_test(str)
if nonroman(str) then
puts “#{str} has nonroman characters!”
else
puts “#{str} does not have nonroman characters!”
end
end

def nonroman (str)
(/^[\w\s!@#$%^\&()][,.?]$/ =~ str) == nil
end

nonroman_test(“abc”)
nonroman_test(“abcá´š”)

nonroman(str) return true if the string contains any characters
besides letters, digits, whitespace, and the following: !@#$%^&*()
[],.?

You can alter the regular expression to change what is allows. Just
add any additional allowed characters before the final ] on the line
in nonroman(). Some characters may need to have a \ in front of them
to work.

Hope that helps!

Regards,
David Alves

Chris_Nowlan · July 23, 2008, 1:11am

Hi –

On Tue, 22 Jul 2008, Dave wrote:

Well I looked at the API and thought that str.each_char {} might
work, but it’s not recognized in my version of Ruby (1.8.6 on Ubuntu)

Right now, the latest release of Ruby 1.8 is 1.8.7, which is basically
a backport of many features from 1.9. The result is that the current
API docs have a very 1.9-ish flavor, and you’ll see lots of things in
there that don’t exist in 1.8.6. It’s potentially kind of confusing
since many of us are still using 1.8.6, and 1.8.7 sounds like it will
be more like 1.8.6 than like 1.9. But you can still get the 1.8.6 docs
too.

def nonroman (str)
(/^[\w\s!@#$%^\&()][,.?]$/ =~ str) == nil
end

A better way might be:

def nonroman(str)
str =~ /[^\w\s!..]/
end

(with whatever regex you use). This way, you’re testing for the first
non-roman character, rather than testing all the characters. It
returns nil or a digit; change as needed if you specifically need
true/false.

Also, don’t forget that ^ and $ are line anchors, not string anchors.

David

–
Rails training from David A. Black and Ruby Power and Light:
Intro to Ruby on Rails July 21-24 Edison, NJ

Advancing With Rails August 18-21 Edison, NJ
Co-taught by D.A. Black and Erik Kastner
See http://www.rubypal.com for details and updates!