Separate Chinese and English! with Ruby

xain · May 7, 2007, 9:39am

Don’t get me wrong, because I just want to know how to separate English
words from a string with ruby.
There are strings (UTF-8 encoded) to record people’s name,
like:

æ‘©æ ¹Â·å¼—é‡Œæ›¼ Morgan Freeman
å¸ƒé²æ–¯Â·å¨åˆ©æ–¯ Bruce Willis
æŽå°æ˜Ž Lee xiao ming
these strings containing Chinese name(without space between characters),
separated by a space, following an English name

or
Frank Darabont
Just an English name.

Would you give me an idea how to separate these Chinese characters(if
any)?

xain · May 7, 2007, 11:16am

On May 7, 2:39 pm, Nanyang Z. [email protected] wrote:

or
Frank Darabont
Just an English name.

Would you give me an idea how to separate these Chinese characters(if
any)?

–
Posted viahttp://www.ruby-forum.com/.

a = File.open(‘a.txt’)
a.each {|x| puts x.split(’ ', 2) }
Output:
æ‘©æ ¹Â·å¼—é‡Œæ›¼
Morgan Freeman
å¸ƒé²æ–¯Â·å¨åˆ©æ–¯
Bruce Willis
æŽå°æ˜Ž
Lee xiao ming

xain · May 7, 2007, 11:21am

On May 7, 4:12 pm, akbarhome [email protected] wrote:

å¸ƒé²æ–¯Â·å¨åˆ©æ–¯ Bruce Willis

æŽå°æ˜Ž
Lee xiao ming

Sorry. Fixed version:
a.each {|x|
if x[0].to_i > 128 then
puts x.split(’ ', 2)
else
puts x
end
}

This code is quick and dirty.

xain · May 7, 2007, 12:04pm

On 2007-05-07 16:39:12 +0900 (Mon, May), Nanyang Z. wrote:

or
Frank Darabont
Just an English name.

Would you give me an idea how to separate these Chinese characters(if
any)?

Maybe a regexp similiar to
/^([^qazwsxedcrfvtgbyhnujmikolpQAZWSXEDCRFVTGBYHNUJMIKOLP ]+)/
would help?

Does [a-zA-Z] include Chinese characters? In Polish locale it includes
Polish non-ASCII characters, so I guess it might include Chinese ones.

I guess you want split a given string into words (separated by space),
and then check whether the first word starts or includes at least one
Chinese character.

xain · May 7, 2007, 12:17pm

Akbar H. wrote:

On May 7, 4:12 pm, akbarhome [email protected] wrote:

å¸ƒé²æ–¯Â·å¨åˆ©æ–¯ Bruce Willis

æŽå°æ˜Ž
Lee xiao ming

Sorry. Fixed version:
a.each {|x|
if x[0].to_i > 128 then
puts x.split(’ ', 2)
else
puts x
end
}

This code is quick and dirty.
Thanks.
But I was wrong. There are more Characters than Chinese and English that
compose the strings. Now I see characters like Ã”, Ã©, Ã¡… if x is one of
these, x[0]> 128 as Chinese does, but I only want to separate Chinese.

so do you know what exactly range of the value Chinese Characters will
return? or you can tell me where I can find this kind of information.

xain · May 7, 2007, 12:21pm

On 5/7/07, Nanyang Z. [email protected] wrote:

Try something like this.

t = str.split(//).partition {|x| x=~/[a-z]|[A-Z]/ }
p t[0].join
p t[1].join

Harry

xain · May 7, 2007, 1:18pm

On 5/7/07, Nanyang Z. [email protected] wrote:

 puts x.split(' ', 2)
so do you know what exactly range of the value Chinese Characters will
return? or you can tell me where I can find this kind of information.

–
Posted via http://www.ruby-forum.com/.

Or this

str.split(//).partition {|x| x.length == 1 }

Harry

xain · May 7, 2007, 1:35pm

On May 7, 5:17 pm, Nanyang Z. [email protected] wrote:

 puts x.split(' ', 2)
these, x[0]> 128 as Chinese does, but I only want to separate Chinese.

so do you know what exactly range of the value Chinese Characters will
return? or you can tell me where I can find this kind of information.

–
Posted viahttp://www.ruby-forum.com/.

These:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/197946

should get you done.

ustr
=> +“æ‘©æ ¹Â·å¼—é‡Œæ›¼”
irb(main):027:0> ustr[0]
=> U+6469
irb(main):028:0> format “%X”, ustr[0].to_i.to_s
=> “6469”
irb(main):029:0>

xain · May 7, 2007, 2:33pm

On May 7, 2007, at 8:35 PM, akbarhome wrote:

if x[0].to_i > 128 then
English that
Posted viahttp://www.ruby-forum.com/.
=> U+6469
irb(main):028:0> format “%X”, ustr[0].to_i.to_s
=> “6469”
irb(main):029:0>

You could identify the encoding or just make it unicode, then check
if the characters fall into a range in unicode, that will identify them.
One shortcut is checking for leading zeros in the unicode character’s
code.

xain · May 7, 2007, 2:22pm

Harry K. wrote:

On 5/7/07, Nanyang Z. [email protected] wrote:

Try something like this.

t = str.split(//).partition {|x| x=~/[a-z]|[A-Z]/ }
p t[0].join
p t[1].join

Harry
Thanks, KaKuEKi, but:
!!!below code were tested under Ruby on Rails console!!!

str1 = “ä¸æ–‡ English Words”
=> “ä¸æ–‡ English Words”
str2 = “Ã”kami: chi”
=> “Ã”kami: chi”
t = str2.split(//).partition { |x| x=~/[a-z]|[A-Z]/}
=> [[“k”, “a”, “m”, “i”, “c”, “h”, “i”], [“Ã””, “:”, " "]]
p t[0].join
“kamichi” ##########I want all non Chinese characters remained.
=> nil
t = str1.split(//).partition { |x| x=~/[a-z]|[A-Z]/}
=> [[“E”, “n”, “g”, “l”, “i”, “s”, “h”, “W”, “o”, “r”, “d”, “s”], [“ä¸”,
“æ–‡”, " ", " "]]
p t[0].join
“EnglishWords” #######no space
=> nil

Harry K. wrote:

Or this

str.split(//).partition {|x| x.length == 1 }

Harry

this time spaces are kept:

t = str1.split(//).partition {|x| x.length == 1 }
=> [[" ", “E”, “n”, “g”, “l”, “i”, “s”, “h”, " “, “W”, “o”, “r”, “d”,
“s”], [“ä¸”, “æ–‡”]]
t[0].join
=> " English Words”
t = str2.split(//).partition {|x| x.length == 1 }
=> [[“k”, “a”, “m”, “i”, “:”, " ", “c”, “h”, “i”], [“Ã””]]
t[0].join
=> “kami: chi”

I think “Ã”” may just like Chinese characters, so it is hard to take it
out.

xain · May 7, 2007, 2:34pm

Akbar H. wrote:

These:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/197946
CJK Unicode Tables

should get you done.

str1 = “ä¸æ–‡ English Words”
=> “ä¸æ–‡ English Words”
str1[0]
=> 228
str2 = “Ã”kami: chi”
=> “Ã”kami: chi”
str2[0]
=> 195
str3 = “English Words”
=> “English Words”
str3[0]
=> 69

if only I known which number Chinese Characters start and end…

xain · May 7, 2007, 2:43pm

John J. wrote:

On May 7, 2007, at 8:35 PM, akbarhome wrote:

if x[0].to_i > 128 then
English that
Posted viahttp://www.ruby-forum.com/.
=> U+6469
irb(main):028:0> format “%X”, ustr[0].to_i.to_s
=> “6469”
irb(main):029:0>

You could identify the encoding or just make it unicode, then check
if the characters fall into a range in unicode, that will identify them.
One shortcut is checking for leading zeros in the unicode character’s
code.

John J., Thank you for your explanation.
Now I get akbarhome’s idea. So I need to download the unicode lib here
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/197946
Then covert the strings into unicode, and then compare the characters
with the CJK Unicode Table from here:

Yes,It must work!

but look this:

str1 = “ä¸æ–‡ English Words”
=> “ä¸æ–‡ English Words”
str1[0]
=> 228
str2 = “Ã”kami: chi”
=> “Ã”kami: chi”
str2[0]
=> 195
str3 = “English Words”
=> “English Words”
str3[0]
=> 69

may be there are numbers that are right for Chinese,
if only I known which number Chinese Characters start and end, there
will be a much simple solution.

xain · May 7, 2007, 3:19pm

On May 7, 2007, at 9:43 PM, Nanyang Z. wrote:

Then covert the strings into unicode, and then compare the characters
=> “Ã”kami: chi”

–
Posted via http://www.ruby-forum.com/.

yes, that’s pretty much how unicode is supposed to work.
In theory you could take a sample range of characters to guess the
document language even.
The problem is that unicode allows multilanguage documents, which in
some cases is difficult because of fonts and systems’ implementations.
But yes you’re on the right track now (IMHO).

And yes, the overhead will be greater, but that’s just a fact of
unicode and large character sets like chinese and japanese.
You will also want to check which chinese!
Chinese is split into two (politically safe) names : Traditional and
Simpllified.
If you were doing Japanese text, separating English or other western
languages wouldn’t be so easy, since Japanese essentially includes a
number of other languages’ character sets in its unicode set and in
everyday usage.

xain · May 7, 2007, 6:26pm

John J. wrote:

And yes, the overhead will be greater, but that’s just a fact of
unicode and large character sets like chinese and japanese.
You will also want to check which chinese!
Chinese is split into two (politically safe) names : Traditional and
Simpllified.
If you were doing Japanese text, separating English or other western
languages wouldn’t be so easy, since Japanese essentially includes a
number of other languages’ character sets in its unicode set and in
everyday usage.

You are right. And let alone the characters, there is a different set of
punctuations!

So, you don’t think there is a doc about the number range string[0]
return with a specified language?

I wonder what those number mean…

xain · May 7, 2007, 7:10pm

On May 8, 2007, at 1:26 AM, Nanyang Z. wrote:

everyday usage.

–
Posted via http://www.ruby-forum.com/.

there is a doc.
go to
www.unicode.org
There should be a pdf (many actually)
I don’t know if the two main chinese sets are encoded as different
ranges or simply declared in some way.
In general in Unicode a character is the same character even when it
appears in a different language.

xain · May 7, 2007, 9:13pm

I believe the range is (in hex) 3400 to 97A5

Cheers

Gary

xain · May 8, 2007, 12:05am

NZ
another English site on Unicode that may be easier to understand (it
was for me)

There must surely be some docs in Chinese somewhere.
I know here in Japan there are many books on the subject. (in
Japanese) Since computer science in Japan does deal with it a lot.
I’ve been interested in this subject myself, but it is a big one.
Unicode.org published the print version of 5.0 and I have browsed the
book in the bookstore, it is worth checking out. Maybe a nearby
university library would have it also.

It certainly seems like a point where a compiled language would be
helpful, such as C
Most interpreted languages are only reaching partial unicode support
now because of the overhead of processing many languages and the
sheer volume of material to deal with, AND the various algorithms
necessary for languages whose writing depends on context. (arabic,
hebrew, indic languages, etc…)

Perhaps Perl and Ruby and Python and PHP should get hooks from Apple
and Microsoft to help these languages be more productive by using
their implementations.

xain · May 7, 2007, 10:43pm

NZ,
You might want to check the RubyGems gem unihan
At the command line type:
gem list --remote uni
and it will show up.
then
gem install unihan --include-dependencies

I haven’t checked it out yet, but after installing it, check the
documentation.
It seems to be an API to the Unihan online database.
Could be quite useful.

John J.

xain · May 8, 2007, 8:46am

John J. a écrit :

NZ
another English site on Unicode that may be easier to understand (it was
for me)
Unicode and multilingual support in HTML, fonts, Web browsers and other applications

May I also suggest this plain english introduction, I’m quoting:
The Absolute Mininmum Every Software Developer Absolutely, Positively
Must Know About Unicode and Caracter Sets (No Excuses!)
http://www.joelonsoftware.com/articles/Unicode.html
J-P

xain · May 8, 2007, 8:44am

There is documentation:

ri String#[]

Although it is a little vague about what “character code” means. By
default (in ruby 1.8.x) the number returned by some_string[i] is a
fixnum in the range [0,255] – even for UTF-8 encoded strings. Ruby
will just treat the string as a string of 8-bit bytes and give you
back whatever byte you asked for.

irb(main):001:0> s = “å¤§æ™ºè‹¥æ„š”
=> “\345\244\247\346\231\272\350\213\245\346\204\232”
irb(main):002:0> s[0]
=> 229
irb(main):003:0> s.length
=> 12