Separate Chinese and English! with Ruby

xain · May 8, 2007, 8:48am

On Mon, 07 May 2007 22:18:36 +0900, John J.
[email protected] wrote:

If you were doing Japanese text, separating English or other western
languages wouldn’t be so easy, since Japanese essentially includes a
number of other languages’ character sets in its unicode set and in
everyday usage.

If the goal is to separate the western languages from the Japanese
Kanji and Kana, then it appears to not be too bad when using a lib
like this:

http://raa.ruby-lang.org/project/moji/

http://gimite.net/gimite/rubymess/moji.html

Zev

xain · May 8, 2007, 9:16am

Nooo! Those are the first BYTES of the UTF-8 encoding of the
punctuation that you listed. MANY Unicode characters (when encoded in
UTF-8) can start with those bytes, so if you remove them from a given
string, you’re going to get back a poorly encoded UTF-8 string which
will is definitely not what you want.

If you want to split on those separators, then why not do so
explicitly?

fill up c as you’ve done below

“asdfï¼›asdfasdf”.split(/#{c.join(’|’)}/)
=> [“asdf”, “asdfasdf”]

xain · May 8, 2007, 8:54am

John J. wrote:

I don’t know if the two main chinese sets are encoded as different
ranges or simply declared in some way.
In general in Unicode a character is the same character even when it
appears in a different language.

Many characters of these two set of Chinese(in fact, including Chinese
Characters in Japanese and Korean…) are the same. Aren’t they encoded
to the same codes when they are identical?

Gary Thomas wrote:

I believe the range is (in hex) 3400 to 97A5
You must mean Unicode range.
CJK Unicode Tables

John J. wrote:

You might want to check the RubyGems gem unihan
… hmmmmm… if only I could find out what it does…
John J. wrote:

Unicode and multilingual support in HTML, fonts, Web browsers and other applications

I’ve been interested in this subject myself, but it is a big one.

Interesting subject indeed it is.

Today I tried this(!!!under RoR console!!!):

c=%w{â€œ â€ã€‚ ï¼Œ ï¼ ï¼œ ï½› ï¼› â€˜ ï¼ ï¼ ï¼ƒ ï¼„ ï¼… â€¦ ï¼Š ï¼ˆ ï¼‰ ä¸€ ä¿¿ å€€ å‡¿ å‹¿ å¿ å“¿ å›¿ å§¿ å¯¿ å´ å¿„å¿¿ æ˜ æ‰‰ æŽµ æ›† æ¡¶ æª— æ³— æ¿— ç€– ç‡¿ ç‹§ ç— ç—¿ çœ€ ç§Š ç«— ç¯¿ ç´€ ç¿¹ é€€ é‡½ éŽ· é–ˆ é˜€ éŸ— é¥§ éª é¶† é¾¥}
=> [“â€œ”, “â€ã€‚”, “ï¼Œ”, “ï¼”, “ï¼œ”, “ï½›”, “ï¼›”, “â€˜”, “ï¼”, "ï¼ “, “ï¼ƒ”, “ï¼„”, “ï¼…”,
“â€¦”, “ï¼Š”, “ï¼ˆ”, “ï¼‰”, “ä¸€”, “ä¿¿”, “å€€”, “å‡¿”, “å‹¿”, “å¿”, “å“¿”, “å›¿”, “å§¿”, " å¯¿”,
“å´”, “å¿„å¿¿”, “æ˜”, “æ‰‰”, “æŽµ”, “æ›†”, “æ¡¶”, “æª—”, “æ³—”, “æ¿—”, “ç€–”, “ç‡¿”, “ç‹§”, “ç—”,
“ç—¿”, “çœ€”, “ç§Š”, “ç«—”, “ç¯¿”, “ç´€”, “ç¿¹”, “é€€”, “é‡½”, “éŽ·”, “é–ˆ”, “é˜€”, “éŸ—”, “é¥§”,
"éª ", “é¶†”, “é¾¥”]
c.collect.map{|o| o[0]}
=> [226, 226, 239, 239, 239, 239, 239, 226, 239, 239, 239, 239, 239,
226, 239, 239, 239, 228, 228, 229, 229, 229, 229, 229, 229, 229, 229,
229, 229, 230, 230, 230, 230, 230, 230, 230, 230, 231, 231, 231, 231,
231, 231, 231, 231, 231, 231, 231, 233, 233, 233, 233, 233, 233, 233,
233, 233, 233]
c.collect.map{|o| o[0]}.sort
=> [226, 226, 226, 226, 228, 228, 229, 229, 229, 229, 229, 229, 229,
229, 229, 229, 230, 230, 230, 230, 230, 230, 230, 230, 231, 231, 231,
231, 231, 231, 231, 231, 231, 231, 231, 233, 233, 233, 233, 233, 233,
233, 233, 233, 233, 239, 239, 239, 239, 239, 239, 239, 239, 239, 239,
239, 239, 239]
c.collect.map{|o| o[0]}.sort.uniq
=> [226, 228, 229, 230, 231, 233, 239]

There punctuations are those commonly used in China.
There Chinese characters are randomly pickup from

(from all the six pages.)

maybe 226 to 239 is the range I need.

xain · May 8, 2007, 10:35am

On Tue, 08 May 2007 17:22:11 +0900, John J.
[email protected] wrote:

Many characters of these two set of Chinese(in fact, including Chinese
Characters in Japanese and Korean…) are the same. Aren’t they encoded
to the same codes when they are identical?

Yes. There is lots of overlap. So there is not always a clean separation
line. But, the Japanese and Korean phonetic characters will be in a
range. You might never use all the kanji/hanzi chinese characters, and a
few are Japanese only (very few).

Here is an API that might help “guess” if the text is Japanese, Korean
or
Chinese:

http://raa.ruby-lang.org/project/libguess-ruby/

http://www.honeyplanet.jp/download.html#libguess

Cheers,
Zev

xain · May 8, 2007, 10:24am

On May 8, 2007, at 3:54 PM, Nanyang Z. wrote:

to the same codes when they are identical?

Yes. There is lots of overlap. So there is not always a clean
separation line. But, the Japanese and Korean phonetic characters
will be in a range. You might never use all the kanji/hanzi chinese
characters, and a few are Japanese only (very few).

Gary Thomas wrote:

I believe the range is (in hex) 3400 to 97A5
You must mean Unicode range.
CJK Unicode Tables

Yes that’s exactly what he means.

John J. wrote:

You might want to check the RubyGems gem unihan
… hmmmmm… if only I could find out what it does…
John J. wrote:

I took a look at it. It’s the database of characters, sort of. It is
a big text file list. Not a proper gem at all actually. The same db
file can be downloaded from Unicode.org separately. It doesn’t
contain the actual characters, just their codes and some comments and
groupings.

ç¯¿ ç´€ ç¿¹ é€€ é‡½ éŽ· é–ˆ é˜€ éŸ— é¥§ éª é¶† é¾¥}
=> [226, 226, 239, 239, 239, 239, 239, 226, 239, 239, 239, 239, 239,
c.collect.map{|o| o[0]}.sort.uniq
Posted via http://www.ruby-forum.com/.

If you have access to a Macintosh, the character pallette is pretty
helpful for exploring CJK character ranges as subgroupings within the
range.

xain · May 8, 2007, 12:04pm

On 2007-05-08 17:47:58 +0900 (Tue, May), Nanyang Z. wrote:

“ä¸æ–‡ãƒ»å¦ä¸€äº›ä¸æ–‡ western words” ï¼ƒChinese characters may be separated by
punctuations, or/and space like:
“ä¸æ–‡ å‰æœ‰ç©ºæ ¼ western words”
Almost all Chinese phrases are at the beginning of the strings.
But some may contain numbers, like:
“2007å¹´çš„æ—¥è®° diary of 2007”
or some time English or alphabets are used as part of Chinese
phrases,like:
“BBæ—¥è®° diary of my baby”

[…]

As you point out, [226, 228, 229, 230, 231, 233, 239] are not safe to
identify Chinese, is there any other easy way to identify Chinese
characters?

Just a random idea - maybe, if there is a problem with finding Chinese
characters, you can define the range of non-Chinese (defined for this
purpose as western)
characters?
Maybe just finding words composed of only Latin and Common scripts would
be enough?
Or do you plan to have a Chinese - Japanese pairs? You said about
‘westen words’ and your examples were in English…

xain · May 8, 2007, 2:44pm

On 08/05/07, Nanyang Z. [email protected] wrote:

Eden Li wrote:
…
“ä¸æ–‡ãƒ»å¦ä¸€äº›ä¸æ–‡ western words” ï¼ƒChinese characters may be separated by

Nooo! Those are the first BYTES of the UTF-8 encoding of the
splited by space, divided into parts at the beginning) containing more
X%, say 60%, of this kind of characters, then I would mark this parts as
Chinese phrase, then take it out of string.

I still want to use this strategy. but
As you point out, [226, 228, 229, 230, 231, 233, 239] are not safe to
identify Chinese, is there any other easy way to identify Chinese
characters?

I guess this should give you what you want:

irb(main):001:0> s = “å¤§æ™ºè‹¥æ„š asdfaf sdgs”
=> “\345\244\247\346\231\272\350\213\245\346\204\232 asdfaf sdgs”
irb(main):002:0> s.unpack “U*”
=> [22823, 26234, 33509, 24858, 32, 97, 115, 100, 102, 97, 102, 32,
115, 100, 103, 115]
irb(main):003:0>

The “U*” specifier should give you a list of unicode codepoints (see
the high numbers for Chinese characters). You can use one of the
unicode links mentioned earlier to find the codepoint range for EA
scripts. 32 is a space so you can find the last EA character, the
first space following that, and pack the two parts back into strings.

Thanks

Michal

xain · May 8, 2007, 10:47am

Eden Li wrote:

for UTF-8 encoded strings. Ruby
will just treat the string as a string of 8-bit bytes and give you
back whatever byte you asked for.

irb(main):001:0> s = “å¤§æ™ºè‹¥æ„š”
=> “\345\244\247\346\231\272\350\213\245\346\204\232”
irb(main):002:0> s[0]
=> 229
irb(main):003:0> s.length
=> 12

in RoR console, I can see the string I put in:

s = “å¤§æ™ºè‹¥æ„š”
=> “å¤§æ™ºè‹¥æ„š”
s[0]
=> 229
s.length
=> 12

Zev B. wrote:

If the goal is to separate the western languages from the Japanese
Kanji and Kana, then it appears to not be too bad when using a lib
like this:

http://raa.ruby-lang.org/project/moji/

moji

Thanks, Zev. but my current problem is about Chinese.
I am going to figure out a way to separate Chinese string from a string
mix with other characters.
What I mean other Characters are alphabets from English or/and other
languages, like Ã”, Ã©, Ã¡… (may I call them western words?)

This string may be containing no Chinese:
“String without Chinese” ,I don’t need to do anything about it, other
than identify such strings.
“ä¸æ–‡ Western Words” #Chinese characters + space + western words.
“ä¸æ–‡ãƒ»å¦ä¸€äº›ä¸æ–‡ western words” ï¼ƒChinese characters may be separated by
punctuations, or/and space like:
“ä¸æ–‡ å‰æœ‰ç©ºæ ¼ western words”
Almost all Chinese phrases are at the beginning of the strings.
But some may contain numbers, like:
“2007å¹´çš„æ—¥è®° diary of 2007”
or some time English or alphabets are used as part of Chinese
phrases,like:
“BBæ—¥è®° diary of my baby”

Eden Li wrote:

Nooo! Those are the first BYTES of the UTF-8 encoding of the
punctuation that you listed.

Finally, I know what those number are. Thanks.

so if you remove them from a givenstring, you’re going to get back a poorly encoded UTF-8 string

In fact, I wanted to use those number to test whether a character is
Chinese or not (if ‘character[0]’ fit the range of [226, 228, 229, 230,
231, 233, 239], then it was likely to be a Chinese). (Now I know it may
be wrong.)
Then depend on this judgment, if this part of string ( string would be
splited by space, divided into parts at the beginning) containing more
X%, say 60%, of this kind of characters, then I would mark this parts as
Chinese phrase, then take it out of string.

I still want to use this strategy. but
As you point out, [226, 228, 229, 230, 231, 233, 239] are not safe to
identify Chinese, is there any other easy way to identify Chinese
characters?

If you want to split on those separators, then why not do so
explicitly?

fill up c as you’ve done below

“asdfï¼›asdfasdf”.split(/#{c.join(‘|’)}/)
=> [“asdf”, “asdfasdf”]

I don’t get it. what this code does?

xain · May 8, 2007, 6:21pm

On May 9, 2007, at 12:22 AM, Nanyang Z. wrote:

Chinese character start from 4e00 to 9fa5 at the unicode table, and
CJK
symbols and punctuation range from 3000 to 303f.

I just used my strategy combining this new way (unpack “U*”) to
identify
Chinese, It picked out 100% Chinese phrases from the strings. (1000
strings are tested)

All of you that have replied and helped, thank you! Enjoy!

NZ, could you share your final combined code? It might be useful to
anyone using CJK, since Ruby originates in Japan that means a lot of
people might find it useful. Might consider making a little gem out
of it.

xain · May 8, 2007, 5:22pm

Michal S. wrote:

I guess this should give you what you want:

irb(main):001:0> s = “å¤§æ™ºè‹¥æ„š asdfaf sdgs”
=> “\345\244\247\346\231\272\350\213\245\346\204\232 asdfaf sdgs”
irb(main):002:0> s.unpack “U*”
=> [22823, 26234, 33509, 24858, 32, 97, 115, 100, 102, 97, 102, 32,
115, 100, 103, 115]

Michal, Thanks!
Chinese character start from 4e00 to 9fa5 at the unicode table, and CJK
symbols and punctuation range from 3000 to 303f.

I just used my strategy combining this new way (unpack “U*”) to identify
Chinese, It picked out 100% Chinese phrases from the strings. (1000
strings are tested)

All of you that have replied and helped, thank you! Enjoy!

xain · May 9, 2007, 4:55am

On May 8, 4:47 pm, Nanyang Z. [email protected] wrote:

If you want to split on those separators, then why not do so
explicitly?

fill up c as you’ve done below

“asdfï¼›asdfasdf”.split(/#{c.join(‘|’)}/)
=> [“asdf”, “asdfasdf”]

I don’t get it. what this code does?

This code just splits the string at any separator listed in c (no
matter how long it is, byte-wise). I was guessing at what you were
trying to do, but I understand now. It looks like you’ve gotten all
you need now

Chinese character start from 4e00 to 9fa5 at the unicode table, and CJK
symbols and punctuation range from 3000 to 303f.

There are also a few other ranges, but I’m not sure how popular they
are (from Unicode Blocks):
CJK Compatibility Forms U+FE30 U+FE4F (32)
CJK Compatibility Ideographs U+F900 U+FAFF (467)
CJK Compatibility U+3300 U+33FF (256)
CJK Unified Ideographs Extension A U+3400 U+4DBF (6582)
CJK Unified Ideographs Extension B U+20000 U+2A6DF (42711)
CJK Compatibility Ideographs Supplement U+2F800 U+2FA1F (542)

xain · May 10, 2007, 5:18am

NZ, could you share your final combined code? It might be useful to
anyone using CJK, since Ruby originates in Japan that means a lot of
people might find it useful. Might consider making a little gem out
of it.

I do think it will be much helpful, because it only solve a very
specified problem.
But anywhere, I paste it here. Maybe it could inspire somebody… who
knows…

#This !!!RoR!!! snippet is used to separate Chinese phrase from
specified formated strings:
#These strings may contain no Chinese:
#“a string without Chinese”
#or Chinese characters + space + western words: “ä¸æ–‡ Western Words”.
#“ä¸æ–‡ãƒ»å¦ä¸€äº›ä¸æ–‡ western words” ï¼ƒChinese characters may be separated by
punctuations,
#or/and space like:
#“ä¸æ–‡ å‰æœ‰ç©ºæ ¼ western words”
#Almost all Chinese phrases are at the beginning of the strings.
#But some may contain numbers, like:
#“2007å¹´çš„æ—¥è®° diary of 2007”
#or some time English or alphabets are used as part of Chinese
#phrases,like:
#“BBæ—¥è®° diary of my baby”

#usage:
#separate_chinese(“a string without Chinese”) => “|||a string without
Chinese”
#separate_chinese(“2007å¹´çš„æ—¥è®° diary of 2007”) => “2007å¹´çš„æ—¥è®°|||diary of
2007”
#chinese_str, other_str = separate_chinese(“ä¸æ–‡ å‰æœ‰ç©ºæ ¼ western
words”).split("|||")
#chinese_str => “ä¸æ–‡ å‰æœ‰ç©ºæ ¼”
#other_str => “western words”

class Foo < ActiveRecord::Base
def self.separate_chinese(n)
ns = n.split(" “)
i = ns.size
ns.reverse.each do |p|
i -= 1
return ns.values_at(0…i).join(” “) + “|||” + ns.values_at((i +
1)…(ns.size - 1) ).join(” ") if is_chinese§
end
“|||” << n
end

def self.is_chinese(n)
cs = n.unpack(“U*”)
chinese_character_num = 0
cs.each do |unicode|
#comparing character’s unicode to test if it is Chinese character
#19968-40869: unicode Chinese Character
#12288-12351: unicode CJK symbols and punctuation
#Note: as Eden Li have mentioned, there are a few more could be used in
a Chinese Document.
chinese_character_num += 1 if (unicode >= 19968 and unicode <=
40869) or (unicode >= 12288 and unicode <= 12351)
end
#if more the 29% of the characters a phrase contains is Chinese, it is
Chinese phrase
#the value 29% servers well for my purpose, but use whatever you like.
return true if chinese_character_num.to_f/cs.size > 0.29
nil
end
end