MetaRegexp: experimental extensions to Regexp (requesting feedback)

ammar · October 2, 2010, 6:05pm

Fellow Rubyists,

I put together a couple of Regexp extension experiments (aka hacks)
that I would like to run by those how might be interested and have the
time. The code is at:

GitHub - ammar/meta_re: An experimental regular expression "preprocessor"

There’s a small README that explains what this does with some
examples. There are also a few tests, but nothing I would call
exhaustive. They mainly test that the plumbing, well, plumbs.

I hope someone finds this useful, or at the very least amusing. I also
hope to hear from others if they think this would be useful as a gem.

Regards,
Ammar

ammar · October 3, 2010, 12:38am

On 10/2/10, Ammar A. [email protected] wrote:

exhaustive. They mainly test that the plumbing, well, plumbs.

I hope someone finds this useful, or at the very least amusing. I also
hope to hear from others if they think this would be useful as a gem.

Wow, that’s pretty cool, actually. I like it. A couple of criticism
(meant as constructive):

I’m a little concerned about what will happen with this regexp:
/(?:foo){1,10000}/. Currently, Regexp handles this just fine, but with
your lib I fear there might be a performance problem.

Your Regexp parser looks to me to be a little naive. I gave it only
the briefest of scannings, but it looks like you’re not handling
special characters escaped by [ and ], nor some of the other types of
ways of using backslashes, like \C-x, for example. I’m not entirely
clear what happens if there are two or more backslashes in a row. (A
character is only escaped by backslashes if preceded by an odd number
of them.)

I wrote a more complete (at least I hope so) regexp parser (but also
specialized for my purposes) as part of my lib sequence. See the
group_anchors method in this file:

github.com

coatl/sequence/blob/master/lib/sequence/stringlike.rb

# Copyright (C) 2006,2008, 2011  Caleb Clausen
# Distributed under the terms of Ruby's license.
require 'sequence/subseq'

class Sequence
  module StringLike
    # Return an empty object used for returning a sequence of elements.
    # The only method required of this object is << (append to the sequence).
    #usually [] or ""
    def new_data; '' end

    def data_class; String end
    
    def like; StringLike end

    #-------------------------------------
    FFS_4BITTABLE=[nil,0,1,0,2,0,1,0,3,0,1,0,2,0,1,0]
    def ffs
      holding{
        begin!

This file has been truncated. show original

It would be nice to have a Regexp parser as a standalone library,
actually.

ammar · October 3, 2010, 6:35am

On Sun, Oct 3, 2010 at 1:37 AM, Caleb C. [email protected] wrote:

Wow, that’s pretty cool, actually. I like it. A couple of criticism
(meant as constructive):

Thanks Caleb. Your criticisms and comments were indeed constructive
and very welcome. Much appreciated.

I’m a little concerned about what will happen with this regexp:
/(?:foo){1,10000}/. Currently, Regexp handles this just fine, but with
your lib I fear there might be a performance problem.

One problem I saw with large repetitions under 1.8.x is exceeding the
maximum number of allowed grouped sub-expressions, which is hard set
at >= 254 (or ((1<<8)-1) to quote the code). So the maximum number of
grouped repetitions under 1.8 will always be 253. I initially thought
it was a buffer exhaustion issue, but after checking, it turns out
there’s room for up to 64K under 1.8. In the case of
/(?:foo){1,10000)/, the group limit was being reached when the regex
was only at about 1.7K. With 1.9.x there doesn’t appear to be a limit,
or at least I didn’t hit it. I stopped trying at 250_000. Yet another
great reason to move to 1.9.

As for performance, I haven’t done any testing, but expect expressions
with greedy parts that require a lot of backtracking (already poorer
performers with a potential for exponential time) to get worse when
they get expanded. A benchmark is in order for sure. I would like to
also gauge how helpful atomic groups (?>) can be in situations like
this.

Your Regexp parser looks to me to be a little naive. I gave it only
the briefest of scannings, but it looks like you’re not handling
special characters escaped by [ and ], nor some of the other types of
ways of using backslashes, like \C-x, for example. I’m not entirely
clear what happens if there are two or more backslashes in a row. (A
character is only escaped by backslashes if preceded by an odd number
of them.)

You are right, the parser is naive and incomplete. It doesn’t handle
several cases, including character classes. I didn’t even consider
other escape sequences like control/meta characters or even \x hex
values. Thanks for pointing that omission out.

I’m not sure about the backslashes. What would you expect from this:

re = MetaRegexp.new ‘\d\d\\d\\d’
=> /\d\d\d\d/

The standard Regexp produces the same results:

re = Regexp.new ‘\d\d\\d\\d’
=> /\d\d\d\d/

Am I missing something? I will double check.

I wrote a more complete (at least I hope so) regexp parser (but also
specialized for my purposes) as part of my lib sequence. See the
group_anchors method in this file:
sequence/lib/sequence/stringlike.rb at master · coatl/sequence · GitHub

It would be nice to have a Regexp parser as a standalone library, actually.

Cool. Thanks for sharing that. Even after a quick read I learned a
couple of interesting things. I like the way you use split and the
handling of the escape sequences.

Also, I really like the idea of a standalone regex parser. Hmmm…
very tempting.

ammar · October 3, 2010, 9:06pm

On 10/2/10, Ammar A. [email protected] wrote:

your lib I fear there might be a performance problem.
great reason to move to 1.9.
Seems like performance is limited by the constraints of the underlying
regexp engine, then. In other words, no problem. Good. Maybe I
shouldn’t have worried.

As for performance, I haven’t done any testing, but expect expressions
with greedy parts that require a lot of backtracking (already poorer
performers with a potential for exponential time) to get worse when
they get expanded. A benchmark is in order for sure. I would like to
also gauge how helpful atomic groups (?>) can be in situations like
this.

I would think that there would actually be no difference in the
pathological backtracking cases, or maybe only a small one because the
expanded regexp will blow out the data cache more quickly.

Am I missing something? I will double check.
When you create a regexp from a string literal, there are actually two
levels of backslash interpretation. The string literal itself does
backslash interpretation first, then the regexp engine does a second
level. So, your examples above are not working quite the way you seem
to think…

So, the string literal ‘\d\d\\d\\d’ is equivalent to
“\d\d\\d\\d”… in other words, you’re still only testing the
cases of one and two backslashes before something else. A better
string to test with is ‘\d\\d\\\d\\\\d’. Doubling all the
backslashes defeats the first level of interpretation.

I just tried your parser out against that string, at it does appear to
be doing the right thing. I didn’t inspect your code closely enough
the first time to be confident of it, but after looking at it again, I
think you are handling multiple backslashes correctly.