How can I monitor a Regexp?

Because a regular expression can have different behaviors depending on
its kcode
(e.g. behavior of \w) I decided that all my code should specify the
kcode
explicitly (e.g. /\w+/n instead /\w+/). So I tried to set up some hooks
to
monitor the creation of each Regexp and raise an exception if the kcode
is
missing. Like this:

class Regexp
alias old_initialize initialize
def initialize(*args)
old_initialize(*args)
raise “NO KCODE!” if kcode.nil?
end
end

And it works fine if I use Regexp.new, but in the majority of cases the
regexp
is expressed as a literal and the initialize is NOT EXECUTED.

Regexp.new(“foobar”)
RuntimeError: NO KCODE!
/foobar/
=> /foobar/

So I tried an alternate approach and set the hook into the =~ operator,
but same
problem; the method override is completely ignored:
class String; def =~(o); raise “S”; end; end
class Regexp; def =~(o); raise “R”; end; end
“bar” =~ /bar/ #=> 0
/foo/ =~ “foo” #=> 0

So… anyone has any idea how I can tackle that problem?

ruby -v

==> ruby 1.8.4 (2005-12-24) [i486-linux]

class String; def =~(o); raise “S”; end; end
class Regexp; def =~(o); raise “R”; end; end

r = /x/
r =~ ‘a’

==> RuntimeError: R

    from (irb):2:in `=~'
    from (irb):4

‘a’ =~ r

==> RuntimeError: S

    from (irb):1:in `=~'
    from (irb):5

On 2/28/07, Daniel DeLorme [email protected] wrote:

   raise "NO KCODE!" if kcode.nil?

So I tried an alternate approach and set the hook into the =~ operator, but same
problem; the method override is completely ignored:
class String; def =~(o); raise “S”; end; end
class Regexp; def =~(o); raise “R”; end; end
“bar” =~ /bar/ #=> 0
/foo/ =~ “foo” #=> 0

So… anyone has any idea how I can tackle that problem?

Yes, well no, I had one, but prospects look bleak now, look at this

robert@swserver:/home/svn 11:49:44
555/56 > ruby -r profile -e ‘puts /a/’
(?-mix:a)
% cumulative self self total
time seconds seconds calls ms/call ms/call name
0.00 0.00 0.00 2 0.00 0.00 IO#write
0.00 0.00 0.00 1 0.00 0.00 Regexp#to_s
0.00 0.00 0.00 1 0.00 0.00 Kernel.puts
0.00 0.01 0.00 1 0.00 10.00 #toplevel
robert@swserver:/home/svn 11:49:50
556/57 > ruby -r profile -e ‘puts Regexp.new(“a”)’
(?-mix:a)
% cumulative self self total
time seconds seconds calls ms/call ms/call name
0.00 0.00 0.00 2 0.00 0.00 IO#write
0.00 0.00 0.00 1 0.00 0.00 Kernel.puts
0.00 0.00 0.00 1 0.00 0.00 Regexp#initialize
0.00 0.00 0.00 1 0.00 0.00 Class#new
0.00 0.00 0.00 1 0.00 0.00 Regexp#to_s
0.00 0.01 0.00 1 0.00 10.00 #toplevel

I just do not see any way to intercept on Ruby level, you would need
to hack ruby itself.
Maybe someone more clever than me?

Cheers
Robert

Jan F. wrote:

    from (irb):4

‘a’ =~ r

==> RuntimeError: S

    from (irb):1:in `=~'
    from (irb):5

Very interesting. If you assign the regexp to a variable you get
the overridden methods. I guess there’s some voodoo optimization
at work when you use =~ on a regexp literal?

Daniel

Daniel DeLorme wrote:

Because a regular expression can have different behaviors depending on
its kcode (e.g. behavior of \w) I decided that all my code should
specify the kcode explicitly (e.g. /\w+/n instead /\w+/).

As an addendum, I was wondering why \w matches extended characters in
utf8.
If extended characters are considered “word” characters, does it mean
they
are valid for identifiers? So I tried:

$KCODE=‘u’
=> “u”
def 日本語
“nihongo”
end
=> nil
日本語
=> “nihongo”

wow. O_O

Daniel