Japanese / chinese characters

dubstep · January 10, 2011, 5:32pm

Is there a way to test is a string contains any japanese or chinese
character? Is that possible?

Thanks,

Luis

luisfgnr · January 10, 2011, 5:40pm

If you know the encoding of the input string, (preferably unicode) you
can
test if the unicode signature of the character strings falls within the
range assigned to the Han character sets.

So yeah, it is possible, though not trivial. You might want to look for
libraries that achieve the same thing.

luisfgnr · January 10, 2011, 5:53pm

I found out this:

irb(main):003:0> p “裏字幕組”.unpack(“U*”)
[35023, 23383, 24149, 32068]

So, I can unpack it and check if is between the range you talked about,
right? If so, now I just need to find the range for the chinese and
japanese characters…

Isn’t this an heavy operation? I have lots of sentences to test, with
size not bigger that 512 characters.

luisfgnr · January 10, 2011, 6:10pm

Yeah, that is one way of doing it.

With respect to the speed issue, the range boundaries that define the
han
characters (or any character range for that matter) have significance at
the
bit level. You could use bit algorithms for speed (though it is possible
that in Ruby you would not achieve the desired speed increase that you
might
get with C or Java.

You might also want to look into specifying unicode ranges in your
regexes.
I remember that the Java regular expression library had shortcuts for
specifying localised characters (like Han characters). I dont think the
Ruby
regex API has these shortcuts, but in the end it is just a unicode
range.