Forum: Ruby Multibyte regexps...

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
81cf8dab4b4af8aa3148c28421afd845?d=identicon&s=25 Horacio Sanson (Guest)
on 2005-12-21 11:02
(Received via mailing list)
I am having some issues with regular expressions when working with
japanese
strings.

Using ruby-1.8.3 on Windows XP home (Japanese version) I have this test:

irb(main):271:0> s = "é??"
=> "\212\223"
irb(main):272:0> l = "è¡?"
=> "\215s"
irb(main):273:0> l =~ /s/
=> 1
irb(main):274:0> puts "#{$`}<<#{$&}>>#{$'}"
E<s>>
=> nil
irb(main):275:0> "#{$`}<<#{$&}>>#{$'}"
=> "\215<<s>>"
irb(main):276:0> s =~ /l/
=> nil


As you can see comparing two totally different characters (kanji) gives
me a
match. Reversing the match gives nil.


How can I get ruby to match things correctly??

regards,
Horacio
0d68b919632b9a650522ccadd7d01d78?d=identicon&s=25 Chintan Trivedi (Guest)
on 2005-12-21 12:33
(Received via mailing list)
l =~ /s/   ??

  It will try to find a char "s" in string l and not the value remained
in variable s.



Horacio Sanson <hsanson@moegi.waseda.jp> wrote:

I am having some issues with regular expressions when working with
japanese
strings.

Using ruby-1.8.3 on Windows XP home (Japanese version) I have this test:

irb(main):271:0> s = "�"
=> "\212\223"
irb(main):272:0> l = "�"
=> "\215s"
irb(main):273:0> l =~ /s/
=> 1
irb(main):274:0> puts "#{$`}<<#{$&}>>#{$'}"
E>
=> nil
irb(main):275:0> "#{$`}<<#{$&}>>#{$'}"
=> "\215<>"
irb(main):276:0> s =~ /l/
=> nil


As you can see comparing two totally different characters (kanji) gives
me a
match. Reversing the match gives nil.


How can I get ruby to match things correctly??

regards,
Horacio
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2005-12-21 13:51
(Received via mailing list)
Hi,

In message "Re: Multibyte regexps..."
    on Wed, 21 Dec 2005 18:59:59 +0900, Horacio Sanson
<hsanson@moegi.waseda.jp> writes:

|I am having some issues with regular expressions when working with japanese
|strings.
|
|Using ruby-1.8.3 on Windows XP home (Japanese version) I have this test:
|
|irb(main):271:0> s = "$B3s(B"
|=> "\212\223"
|irb(main):272:0> l = "$B9T(B"
|=> "\215s"
|irb(main):273:0> l =~ /s/
|=> 1
|irb(main):274:0> puts "#{$`}<<#{$&}>>#{$'}"
|E<s>>
|=> nil
|irb(main):275:0> "#{$`}<<#{$&}>>#{$'}"
|=> "\215<<s>>"
|irb(main):276:0> s =~ /l/
|=> nil

The encoding seems to be Shift_JIS. You have to specify encoding
before you make regular expression matching.  Put s after every
regular expression.

$KCODE="sjis"  # to make p work right
p s = "$B3s(B"
p l = "$B9T(B"
p l =~ /s/s
puts "#{$`}<<#{$&}>>#{$'}"
p "#{$`}<<#{$&}>>#{$'}"
p s =~ /l/s

							matz.
81cf8dab4b4af8aa3148c28421afd845?d=identicon&s=25 Horacio Sanson (Guest)
on 2005-12-26 02:32
(Received via mailing list)
Thanks a lot...  this seems to work ok.

Where can I find documentation about this $KCODE global var and the "s"
thing
after each regexp? What does the s exactly mean?

Do I have to put it only in regexps with japanese characters or any
regexp? I
tried both and saw no difference.

When using Regexp.new to construct the regular expression how can I set
the s
to the end of it??

sorry for so many questions but I don't seem to find any docs about
these
options.


Horacio

Wednesday 21 December 2005 21:48$B!"(BYukihiro Matsumoto
$B$5$s$O=q$-$^$7$?(B:
> Hi,
>
> In message "Re: Multibyte regexps..."
>
>     on Wed, 21 Dec 2005 18:59:59 +0900, Horacio Sanson
<hsanson@moegi.waseda.jp> writes:
81cf8dab4b4af8aa3148c28421afd845?d=identicon&s=25 Horacio Sanson (Guest)
on 2005-12-26 03:53
(Received via mailing list)
I found some documentation about this. Thanks.

Just one question, it seems to me that I can make two different things
to
allow Regexp's to handle multibyte Shift_JIS strings.  One is to set the
$KCODE global variable to "sjis" and the other one is to use the "s"
modifier
when constructing the regular expresion.

The question is do I use only one of the two methods or shall I use the
"s"
modifier even if I set $KCODE to "sjis"??

My testing tells me that only setting the $KCODE global var is enough to
get
Shift_JIS strings and Regexp's to work correctly but I just want to make
sure.

thanks,
Horacio

Monday 26 December 2005 10:29$B!"(BHoracio Sanson
$B$5$s$O=q$-$^$7$?(B:
This topic is locked and can not be replied to.