Don’t get me wrong, because I just want to know how to separate English
words from a string with ruby.
There are strings (UTF-8 encoded) to record people’s name,
like:
On 2007-05-07 16:39:12 +0900 (Mon, May), Nanyang Z. wrote:
or
Frank Darabont
Just an English name.
Would you give me an idea how to separate these Chinese characters(if
any)?
Maybe a regexp similiar to
/^([^qazwsxedcrfvtgbyhnujmikolpQAZWSXEDCRFVTGBYHNUJMIKOLP ]+)/
would help?
Does [a-zA-Z] include Chinese characters? In Polish locale it includes
Polish non-ASCII characters, so I guess it might include Chinese ones.
I guess you want split a given string into words (separated by space),
and then check whether the first word starts or includes at least one
Chinese character.
if x[0].to_i > 128 then
English that
Posted viahttp://www.ruby-forum.com/.
=> U+6469
irb(main):028:0> format “%X”, ustr[0].to_i.to_s
=> “6469”
irb(main):029:0>
You could identify the encoding or just make it unicode, then check
if the characters fall into a range in unicode, that will identify them.
One shortcut is checking for leading zeros in the unicode character’s
code.
if x[0].to_i > 128 then
English that
Posted viahttp://www.ruby-forum.com/.
=> U+6469
irb(main):028:0> format “%X”, ustr[0].to_i.to_s
=> “6469”
irb(main):029:0>
You could identify the encoding or just make it unicode, then check
if the characters fall into a range in unicode, that will identify them.
One shortcut is checking for leading zeros in the unicode character’s
code.
John J., Thank you for your explanation.
Now I get akbarhome’s idea. So I need to download the unicode lib here http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/197946
Then covert the strings into unicode, and then compare the characters
with the CJK Unicode Table from here:
may be there are numbers that are right for Chinese,
if only I known which number Chinese Characters start and end, there
will be a much simple solution.
yes, that’s pretty much how unicode is supposed to work.
In theory you could take a sample range of characters to guess the
document language even.
The problem is that unicode allows multilanguage documents, which in
some cases is difficult because of fonts and systems’ implementations.
But yes you’re on the right track now (IMHO).
And yes, the overhead will be greater, but that’s just a fact of
unicode and large character sets like chinese and japanese.
You will also want to check which chinese!
Chinese is split into two (politically safe) names : Traditional and
Simpllified.
If you were doing Japanese text, separating English or other western
languages wouldn’t be so easy, since Japanese essentially includes a
number of other languages’ character sets in its unicode set and in
everyday usage.
And yes, the overhead will be greater, but that’s just a fact of
unicode and large character sets like chinese and japanese.
You will also want to check which chinese!
Chinese is split into two (politically safe) names : Traditional and
Simpllified.
If you were doing Japanese text, separating English or other western
languages wouldn’t be so easy, since Japanese essentially includes a
number of other languages’ character sets in its unicode set and in
everyday usage.
You are right. And let alone the characters, there is a different set of
punctuations!
So, you don’t think there is a doc about the number range string[0]
return with a specified language?
there is a doc.
go to www.unicode.org
There should be a pdf (many actually)
I don’t know if the two main chinese sets are encoded as different
ranges or simply declared in some way.
In general in Unicode a character is the same character even when it
appears in a different language.
NZ
another English site on Unicode that may be easier to understand (it
was for me)
There must surely be some docs in Chinese somewhere.
I know here in Japan there are many books on the subject. (in
Japanese) Since computer science in Japan does deal with it a lot.
I’ve been interested in this subject myself, but it is a big one. Unicode.org published the print version of 5.0 and I have browsed the
book in the bookstore, it is worth checking out. Maybe a nearby
university library would have it also.
It certainly seems like a point where a compiled language would be
helpful, such as C
Most interpreted languages are only reaching partial unicode support
now because of the overhead of processing many languages and the
sheer volume of material to deal with, AND the various algorithms
necessary for languages whose writing depends on context. (arabic,
hebrew, indic languages, etc…)
Perhaps Perl and Ruby and Python and PHP should get hooks from Apple
and Microsoft to help these languages be more productive by using
their implementations.
NZ,
You might want to check the RubyGems gem unihan
At the command line type:
gem list --remote uni
and it will show up.
then
gem install unihan --include-dependencies
I haven’t checked it out yet, but after installing it, check the
documentation.
It seems to be an API to the Unihan online database.
Could be quite useful.
May I also suggest this plain english introduction, I’m quoting:
The Absolute Mininmum Every Software Developer Absolutely, Positively
Must Know About Unicode and Caracter Sets (No Excuses!) http://www.joelonsoftware.com/articles/Unicode.html
J-P
Although it is a little vague about what “character code” means. By
default (in ruby 1.8.x) the number returned by some_string[i] is a
fixnum in the range [0,255] – even for UTF-8 encoded strings. Ruby
will just treat the string as a string of 8-bit bytes and give you
back whatever byte you asked for.