Detect whether unicode string is Japanese

bobm · September 29, 2008, 4:19pm

How can I get a tally of how many characters in a Unicode string are
Japanese (hiragana, katakana, kanji)? When I unpack a string, each
character comes out like \xE3\x81\x95, but I am trying to check if it’s
in the range 3040-309F (Hiragana) and I don’t understand how to convert
between the 3-byte representation and that range…

bobm · September 30, 2008, 7:17pm

On Monday 29 September 2008 16:18:53 Bob Marley wrote:

How can I get a tally of how many characters in a Unicode string are
Japanese (hiragana, katakana, kanji)? When I unpack a string, each
character comes out like \xE3\x81\x95, but I am trying to check if it’s
in the range 3040-309F (Hiragana) and I don’t understand how to convert
between the 3-byte representation and that range…

You may lookup the unicode mapping on google, but you will have to write
new
function for each possible encoding (UTF-8,UTF16LE…).

Or, with ruby 1.9, you can iterate string by characters (not bytes), and
use .ord function to get the unicode position number:

mystr.each_char do |ch|
puts ch.ord
end

Jan