Forum: Ruby Encounter troubles with Regex in Chinese text splitting

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
E8bbaa4d176343191020c570ccc37c9e?d=identicon&s=25 meng.yan (Guest)
on 2005-12-03 06:43
(Received via mailing list)
Hi All,
  I'm a Ruby newbie. I'm writting a program to process a big chunk of
Chinese text. The first step is to split the chunk of text into a list
of sentences. In Chinese, all the characters are listed one by one
without any natural boundary tag like space in English. Sentences are
separated by one of three special characters(ã??ï¼?!). So at the
first glance, I thought it's a simple task:

# $chunk stores the text body
$sentenses = $chunk.split(/ã??|ï¼?|!/)
# now $sentenses holds the list of sentences.

  By when I checked the result, I found some of the sentenses didn't
split well. For instance, here is a sentense:
"你没ç??ï¼?ä»?å?¢ï¼?" (means "You are not sick, how about him?") . In
GB2312, "ç??ï¼?" is encoded to (hex) b2a1 a3ac, and "ã??" happens to be
encoded to (hex) a1a3. So the String#split method finds there is a
"ã??" in the middle of the sentense and incorrectly do the splitting.

  Certainly this is because the String#split (and the Ruby regex
engine) is byte-oriented instead of true character-oriented, and it's a
frequent problem in i18n domain. Is there any ways in Ruby to correct
split Chinese text?

  Thanks in advance.

  myan
E51d56251ec4affafe85ee9367228965?d=identicon&s=25 phasis68 (Guest)
on 2005-12-03 07:15
(Received via mailing list)
Hi,

>without any natural boundary tag like space in English. Sentences are
>GB2312, "??ï¼?" is encoded to (hex) b2a1 a3ac, and "??quot; happens to be
>   myan
>
>
Try the script with $KCODE = "E"

Hope this help,

Park Heesob
E8bbaa4d176343191020c570ccc37c9e?d=identicon&s=25 meng.yan (Guest)
on 2005-12-03 14:11
(Received via mailing list)
Hi Park,
   It works. Thank you very much!

   Could you please tell me the reason and where can I find relevant
documents?

   Thank you.

   myan
E51d56251ec4affafe85ee9367228965?d=identicon&s=25 phasis68 (Guest)
on 2005-12-03 15:12
(Received via mailing list)
Hi,
----- Original Message -----
From: "Mike Meng" <meng.yan@gmail.com>
Newsgroups: comp.lang.ruby
To: "ruby-talk ML" <ruby-talk@ruby-lang.org>
Sent: Saturday, December 03, 2005 10:07 PM
Subject: Re: Encounter troubles with Regex in Chinese text splitting


>
$KCODE is the character coding system Ruby handles. If the first
character
of $KCODE is `e' or `E', Ruby handles EUC. If it is `s' or `S', Ruby
handles
Shift_JIS. If it is `u' or `U', Ruby handles UTF-8. If it is `n' or `N',
Ruby doesn't handle multi-byte characters. The default value is "NONE".

Regards,

Park Heesob
This topic is locked and can not be replied to.