Request for feedback, suggestions, contributions

I’m attempting to write a module whose classes will make it easier to
construct and work with regular expressions. The two, somewhat self
explanatory files (main and test) are given below. I would be grateful
for feedback, especially on better ways of doing things. Also note
that I am developing on JRuby, so currently JRuby re sytax is more
supported than Ruby syntax.

Check out the second (test) file for simple examples of use.

Thanks,
Ken

File ‘rex.rb’

=begin rdoc
‘rex.rb’ is a file that provide classes intended to make it easier to
develop
and use regular expressions. A primary feature is that it allows one
to easily
construct larger regular expressions out of smaller regular
expressions. The
other main feature is that it provides (or will provide) many
functions that
make it easier to apply regular expressions in useful ways. I also
believe that,
thought it is more verbose than standard Regexps, it provides much
more readable
code when constructing complex regular expressions.

rex is not intended to be comprehensive; I don’t have time for that.
My hope is
that it will be useful for the 95% of ‘common case’ re’s.
=end

CHARACTERS = {
:dot => “\t”,
:tab => “\t”,
:vtab => “\v”,
:newline => “\n”,
:return => “\r”,
:backspace => “\b”,
:form_feed => “\f”,
:bell => “\a”,
:esc => “\e”,
:word_char => “\w”,
:non_word_char => “\W”,
:whitespace_char => “\s”,
:non_whitespace_char => “\S”,
:digit_char => “\d”,
:non_digit_char => “\D”
}

class Rex

attr_writer :is_group

=begin rdoc
Create a new Rex pattern with string as the pattern that will be
passed to
Regexp. This is used by other Rex functions; you can also use it to
create
a ‘raw’ pattern.
=end
def initialize(string)
@pat = string
@is_group = false
@regexp = Regexp.new(@pat)
end

def index(string, start=0)
return string.index(@regexp, start)
end

=begin rdoc
yields each match in the string in succession
=end
def each(string)
start = 0
while true:
i = string.index(@regexp, start)
print “MATCHED #{@regexp.inspect} AT #{i}!\n”
if i == nil; break; end
md = $~
yield md
if md.end(0) == start
start = start + 1
else
start = md.end(0)
end
end
end
=begin rdoc
Same as =~ on the corresponding Regexp
=end
def =~(string)
return @regexp =~ string
end

=begin rdoc
Returns the pattern associated with this Rex instance. This is the
string is
passed to Regexp to create a new Regexp.
=end
def pat
return @pat
end

def group
if @is_group
return self
else
result = Rex.new("(?:#{@pat})")
result.is_group = true
return result
end
end

=begin rdoc
Regular expression concatenation; Lit.new(“ab”) + Lit.new(“cd”) will
produce
a Rex that has the same meaning as the Regexp /abcd/ (though the
pattern will
be different.
=end
def +(other)
return Rex.new(self.group.pat + other.group.pat)
end

=begin rdoc
Used to define a named group. If rex is a Rex instance with an
internal pattern
pat, then rex[‘name’] produces a new Rex with pattern (?
pat).
=end
def
result = Rex.new("(?<#{name}>#{@pat})")
result.is_group = true
return result
end

def +(other)

r1 = self

r1 = r1.wrap_if_not("+")

other = other.wrap_if_not("+")

r = Regexp.new(r1 + other)

r.operator = “+”

return r

end

=begin rdoc
Regular expression alternation. Lit.new(“ab”) | Lit.new(“cd”) will
produce
a Rex that has the same meaning as the Regexp /ab|cd/ (though the
pattern will
be different.
=end
def |(other)
return Rex.new(self.group.pat + “|” + other.group.pat)
end

=begin rdoc
Same as the corresponding match method in Regexp.
=end
def match(string)
return @regexp.match(string)
end

=begin rdoc
Returns a new Rex that is an optional version of this one;
Lit.new(‘a’).optional
has the same effect as the Regexp /a?/
=end
def optional
return Rex.new(self.group.pat + “?”)
end

Invoke up a Rex to indicate it is naturally grouped, i.e. does

not need to

be surrounded with parens before being put into another Rex.

def natural_group # :nodoc:
@is_group = true
return self
end

=begin rdoc
Defines regular expression repetitions. Lit.new(‘a’).n(3) is the same as
/a{3,}/, while Lit.new(3…7) is the same as /a{3,7}/. use 0 or 1 to
achieve
the same effect as the * and + Regexp operators. Tri-period ranges of
the form
3…8 are allowed, and have the same meaning as one would expect, i.e.
that
range give the same result as 3…7.
=end
def n(repetitions)
if repetitions.is_a?(Integer)
return Rex.new(self.group.pat +
“{#{repetitions},}”).natural_group
elsif repetitions.is_a?(Range)
ending = repetitions.end
if repetitions.exclude_end?
ending -= 1
end
return Rex.new(self.group.pat +
“{#{repetitions.begin},#{ending}}”).natural_group
end
end

=begin rdoc
Same as method n, but nongreedy.
=end
def n?(repetitions)
if repetitions.is_a?(Integer)
return Rex.new(self.group.pat +
“{#{repetitions},}?”).natural_group
elsif repetitions.is_a?(Range)
ending = repetitions.end
if repetitions.exclude_end?
ending -= 1
end
return Rex.new(self.group.pat +
“{#{repetitions.begin},#{ending}}?”).natural_group
end
end

def to_s
return @pat
end
end

=begin rdoc
Create a new literal that will match exactly that string. This handles
Regexp
escaping for you, so you do not need to worry about handling
characters with
special meanings in Regexp.
=end
class Lit < Rex
def initialize(string)
@pat = Regexp.escape(string)
@regexp = Regexp.new(@pat)
@is_group = false
end
end

class Chars < Rex
=begin rdoc
Creates a character class that matches those characters given in
include,
except for those given in exclude. Each of include and exclude
should be
one of:

  • A string, in which case it defines the set of characters to be
    included or excluded.
  • A double-dot (x…y) range, which will define a range of characters
    to be included or excluded.
  • A list of strings and ranges, which have the same meanings as above
    and are combined to produce the set of characters to be included or
    excluded.
  • A symbol, which is used to denote one of the special character
    classes.

Note that Chars defines no special characters.
include:: The set of characters to be included in the class. Include
may be nil or the empty string, if you don’t want to include
characters in the class.
exclude:: The set of characters to be excluded from the class.
Defaults to nil.
=end
def initialize(include, exclude=nil)

 def list_to_chars(list)
   chars = ""
   list.each {|e|
     if e.is_a?(String)
       chars << Regexp.escape(e)
     elsif e.is_a?(Range)
       chars << Regexp.escape(e.begin) << "-" <<

Regexp.escape(e.end)
elsif e.is_a?(Symbol)
chars << “[:” << e.to_s << “:]”
end
}
return chars
end

 if include == nil or include == ""
   include = nil
 elsif include.is_a?(Array)
   include = list_to_chars(include)
 else
   include = list_to_chars([include])
 end

 if exclude.is_a?(Array)
   exclude = list_to_chars(exclude)
 elsif exclude != nil
   exclude = list_to_chars([exclude])
 end

 if exclude == nil
   chars = ("[#{include}]")
 elsif include == nil
   chars = "[^#{exclude}]"
 else
   chars = ("[#{include}&&[^#{exclude}]]")
 end

 @pat = chars
 @regexp = Regexp.new(@pat)
 @is_group = true

end
end

File ‘rex_test.rb’

$:.unshift File.join(File.dirname(FILE),’…’,‘lib’)

require ‘test/unit’
require ‘rex’

class RexTest < Test::Unit::TestCase
def test_simple
posint = Rex.new(’[0-9]+’)
posfloat = posint + (Lit.new(’.’) + posint).optional
float = (Lit.new(’+’)|Lit.new(’-’)).optional + posfloat
complex = float[‘re’] + (Lit.new(’+’)|Lit.new(’-’)) +
posfloat[‘im’] + Lit.new(‘i’)
print complex
assert_equal(0, posint =~ “123”)
assert_equal(0, posfloat =~ “123.45”)
assert_equal(0, posfloat =~ “123”)
assert_equal(“3.45”, complex.match(" 3.45-2i")[‘re’])
end

def test_repetitions
assert_equal("(?:a){3,}", Lit.new(‘a’).n(3).pat)
assert_equal("(?:a){3,5}", Lit.new(‘a’).n(3…5).pat)
assert_equal("(?:a){3,4}", Lit.new(‘a’).n(3…5).pat)
assert_equal("(?:a){3,}?", Lit.new(‘a’).n?(3).pat)
assert_equal("(?:a){3,5}?", Lit.new(‘a’).n?(3…5).pat)
assert_equal("(?:a){3,4}?", Lit.new(‘a’).n?(3…5).pat)
end

def test_char_class
assert_equal("[abc]", Chars.new(“abc”).pat)
assert_equal("[^abc]", Chars.new(nil, “abc”).pat)
assert_equal("[abc&&[^de]]", Chars.new(“abc”, “de”).pat)
assert_equal("[abct-z&&[^n-u]]", Chars.new([“abc”, “t”…“z”],
“n”…“u”).pat)
assert_equal("[[:alnum:]]", Chars.new(:alnum).pat)
end

def test_index
assert_equal(3, Rex.new(“a”).index(“bcda”))
assert_equal(3, Lit.new(“a”).index(“bcda”))
end

def test_each
pat = Lit.new(‘a’).n(1)
s = “aababbaaababb”
result = []
pat.each(s) {|md|
result << md[0]
}
assert_equal([“aa”, “a”, “aaa”, “a”], result)
end
end