How would you design regexps in the integer domain?

andreas_l · May 5, 2008, 10:25am

I’m trying to come up with a clean way to specify regexps in the integer
domain. I.e. instead of describing a pattern of characters (as in normal
regexps) one describes patterns of integers (“17 followed by 3 or 15”
rather than “‘a’ followed by ‘b’ or ‘c’”).

My main objective is to make the syntax intuitive to Ruby users. I have
been toying around with a few different approaches, but I’m not sure if
any meet the goal. How would you design the syntax for regexps in the
integer domain?

The syntax does not need to support especially many operators:

Kleene star (‘*’ in character regexps)
“At least once” (‘+’ in character regexps)
Alternation (‘|’ in character regexps, i.e. “a|b” being “‘a’ or ‘b’”)

Below are some of the approaches one could take.

== String representation

Many Ruby users have used character regexps defined as strings, so it
would seem like a good idea to make integer regexps as similar as
possible. A delimiter would be needed though, since “17” could otherwise
mean “1 followed by 7” or just “17”. One delimiter could for instance be
a blank space. The following is how the pattern “Any number of (17 or, 1
followed by 5) followed by 4711” could look.

IntRegexp.new(‘(17|1 5)*4711’)

== Method combination

In this approach the regexp is gradually built up by invoking methods.
One major question is what the methods should be named.

The following is how the above example could look.

first_part = IntRegexp.new(17).or(IntRegexp.new(1).followed_by 5)
first_part.any_number_of_times.followed_by 4711

Or with some different method names

first_part = IntRegexp.new(17) | (IntRegexp.new(1) + 5)
first_part.any_times + 4711

I’m unsure what sort of method names would strike a good balance between
verbosity and readability.

== Blocks

TextualRegexp[1] provides a block-based interface for specifying normal
regexps. Perhaps something similar could be done for integer regexps?
The following is the syntax of TextualRegexp but with integers.

regexp do
at_least_zero do
any :of do
integer 17
group{ integer 1; integer 5 }
end
end
integer 4711
end

There’s a fair amount of redundancy since there will never be any thing
other than “integer” specified.

[1] http://rubyforge.org/projects/texrex/

andreas_l · May 5, 2008, 11:52am

Robert K. wrote:

2008/5/5 Andreas L. [email protected]:

I’m trying to come up with a clean way to specify regexps in the integer
domain. I.e. instead of describing a pattern of characters (as in normal
regexps) one describes patterns of integers (“17 followed by 3 or 15”
rather than “‘a’ followed by ‘b’ or ‘c’”).

Why do you need this, i.e. what advantages do you expect for a
particular “integer regexp” over classic regular expressions?

The two types of regexps are not really comparable since they work in
different domains. I.e. classic regular expressions describe patterns in
arrays of characters rather than in arrays of integers. One could for
instance use an integer regexp to decide whether an array begins with 17
and ends with 8.

Specifically this is for specifying deterministic finite automatons,
with an integer alphabet, for an upcoming constraint in Gecode/R[1].

IntRegexp.new(‘(17|1 5)*4711’)

You can as well do /(17|1 5)*4711/, can’t you?

True, assuming that one then convert it back to its source to reparse.
It would probably make it even more familiar.

[1] http://gecoder.rubyforge.org/

andreas_l · May 5, 2008, 10:45am

2008/5/5 Andreas L. [email protected]:

I’m trying to come up with a clean way to specify regexps in the integer
domain. I.e. instead of describing a pattern of characters (as in normal
regexps) one describes patterns of integers (“17 followed by 3 or 15”
rather than “‘a’ followed by ‘b’ or ‘c’”).

Why do you need this, i.e. what advantages do you expect for a
particular “integer regexp” over classic regular expressions?

Below are some of the approaches one could take.

IntRegexp.new(‘(17|1 5)*4711’)

You can as well do /(17|1 5)*4711/, can’t you?

== Method combination

In this approach the regexp is gradually built up by invoking methods.
One major question is what the methods should be named.

The following is how the above example could look.

first_part = IntRegexp.new(17).or(IntRegexp.new(1).followed_by 5)
first_part.any_number_of_times.followed_by 4711

This looks ugly.

TextualRegexp[1] provides a block-based interface for specifying normal
regexps. Perhaps something similar could be done for integer regexps?

The change would be easy, i.e. you just need to allow anything where
so far only raw strings were allowed and internally convert it via
#to_s.

end

There’s a fair amount of redundancy since there will never be any thing
other than “integer” specified.

[1] http://rubyforge.org/projects/texrex/

Apparently the project has gone and I could not prevent it since Ari
did not give me access to it. However, I posted an early version
here:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/263785

Kind regards

robert

andreas_l · May 5, 2008, 12:57pm

2008/5/5 Andreas L. [email protected]:

The two types of regexps are not really comparable since they work in
different domains. I.e. classic regular expressions describe patterns in
arrays of characters rather than in arrays of integers. One could for
instance use an integer regexp to decide whether an array begins with 17
and ends with 8.

Ah! That bit of information was missing from the original posting.

Specifically this is for specifying deterministic finite automatons,
with an integer alphabet, for an upcoming constraint in Gecode/R[1].

I see. Then of course it’s a different story. Unless you translate
the array of integers into a string and apply a textual regexp. But
you can still do that internally, i.e. shield that away as an
implementation detail.

I’d probably choose the approach we had taken with texrex because it
is easy to implement and has good usability (albeit it’s a bit
verbose). I would base the decision ultimately on the users of such a
package, i.e. if they are familiar with regular expressions then the
string based approach might be the better choice (short and concise)
while for others the wordy approach might be better.

Kind regards

robert

andreas_l · May 5, 2008, 4:20pm

Eivind E. wrote:

def repeat(minimum, maximum, repeated_expression)
def or(*expressions)

So you’d convert your example “(17|1 5)*” to repeat(0, nil, or(17, [1, 5]))

I like that idea, especially using the arrays to group integers. It
seems like a rather economic way to do it. I think “any” would fit
better than “or” though (since it can take more than two arguments). The
example in its entirety would then be

[repeat(0, nil, any(17, [1, 5])), 4711]

I would also consider swapping the position of repeated_expression to
the first argument of repeat, since it seems like a more natural order
to me (“repeat expression between 0 and 4 times”). I’m a bit split on
using nil for infinity, but that is minor.

I will need to explore the idea, but it feels like a good fit to me.
Thank you.

Actually, as I thought of after I had written the above, I have
written code that does part of this already, and is generic - it’s
available in types.rb (Index of /~eivind/ruby/types/).
If you want it, I think I can extend that to support repeats tonight.
Feel free to use code from it under any license that absolves me of
legal blame (ie, the only reason I don’t put it in the public domain
is because that expose me to legal liability.)

Thanks for the offer, but there’s no need to add additional support. The
important part is the idea, which you have provided.

andreas_l · May 5, 2008, 5:17pm

On May 5, 2:24 am, Andreas L. [email protected] wrote:

I’m trying to come up with a clean way to specify regexps in the integer
domain. I.e. instead of describing a pattern of characters (as in normal
regexps) one describes patterns of integers (“17 followed by 3 or 15”
rather than “‘a’ followed by ‘b’ or ‘c’”).

You could specify it as a PEG[1], possibly using Treetop[2] in some
way.

The benefit of the PEG (re-usable sub-expressions) may not outweigh
the multi-line nature. However, you could do something like LPEG[3]
does. Instead of using standard PEG syntax come up with a custom
syntax via operators that allows you to describe an entire grammar in
a single Ruby expression. I’m personally not a fan of the syntax LPEG
uses; you could come up with a better one, perhaps using terse method
names instead of operators in some cases. (Perhaps
“pattern.atleast(5)” instead of LPEG’s “pattern^5”.)

[1] Parsing expression grammar - Wikipedia
[2] http://treetop.rubyforge.org/
[3] http://www.inf.puc-rio.br/~roberto/lpeg.html

andreas_l · May 5, 2008, 5:51pm

On May 5, 9:10 am, Phrogz [email protected] wrote:

The benefit of the PEG (re-usable sub-expressions) may not outweigh
the multi-line nature.

To expand on this a little bit more, and use an LPEG-like custom
grammar entirely in Ruby, using PEG instead of regexp would allow you
to do something like:

header = IPEG[17,34,15,12,89,127]
break = IPEG[13,10]
close = IPEG[0]

body = (IPEG[12,43,13,break,57,75] | IPEG[18,13,31])
message_style_1 = header + body + close
message_style_2 = header + IPEG[108,103] + break + body + close
message = message_style_1 | message_style_2

result = message.match( my_array_of_numbers )

As you can see, this allows you to break up complex expressions into
named parts. This makes it both easier to understand, and DRYs up the
expression.

In the above made up syntax:

“IPEG” stands for “Integer Parsing Expression Grammar”
The IPEG.[] notation might represent a literal sequence of
integers
The + operator would combine sequences
The | operator provides alternation

andreas_l · May 5, 2008, 12:55pm

On Mon, May 5, 2008 at 11:51 AM, Andreas L. [email protected]
wrote:

The two types of regexps are not really comparable since they work in
different domains. I.e. classic regular expressions describe patterns in
arrays of characters rather than in arrays of integers. One could for
instance use an integer regexp to decide whether an array begins with 17
and ends with 8.

Specifically this is for specifying deterministic finite automatons,
with an integer alphabet, for an upcoming constraint in Gecode/R[1].

I’d implement this using a generic repeat operator instead of
hardcoding * and +, and using a generic comparator instead of
restricting to Integers.

The declaration for this would be something like

Use repeat(0, nil, …) for *

Use repeat(1, nil, …) for +

Use repeat(0, 1, …) for ?

def repeat(minimum, maximum, repeated_expression)
def or(*expressions)

So you’d convert your example “(17|1 5)*” to repeat(0, nil, or(17, [1,
5]))

Actually, as I thought of after I had written the above, I have
written code that does part of this already, and is generic - it’s
available in types.rb (Index of /~eivind/ruby/types/).
If you want it, I think I can extend that to support repeats tonight.
Feel free to use code from it under any license that absolves me of
legal blame (ie, the only reason I don’t put it in the public domain
is because that expose me to legal liability.)

Eivind.

andreas_l · May 5, 2008, 6:48pm

On May 5, 2008, at 2:24 AM, Andreas L. wrote:

if
any meet the goal. How would you design the syntax for regexps in the
integer domain?

The syntax does not need to support especially many operators:

Kleene star (‘*’ in character regexps)

“At least once” (‘+’ in character regexps)

Alternation (‘|’ in character regexps, i.e. “a|b” being “‘a’ or
‘b’”)

Below are some of the approaches one could take.

i personally would not bother making a new syntax, rather i’d simply
use the exiting one and add a new atom builder:

cfp:~ > cat a.rb
N = 0.chr

def N n
“#{ n }#{ N }”
end

module Enumerable
def nmatch pattern
match = pattern.match( select{|i| Fixnum === i}.join(N) )
if match
returned = []
captured = match.to_a.map{|capture| capture.split N}
returned << captured.first.map{|s| Float(s).to_i}
captured[1…-1].each do |list|
returned << Float(list.first).to_i
end
returned
else
nil
end
end
end

pattern = %r/#{ N 42 }#{ N 7 }(#{ N 17 }){2}(#{ N 1 }){1,}/

p [42, 7, 17, 17, 1, 1, 1].nmatch( pattern ).to_a

cfp:~ > ruby a.rb
[[42, 7, 17, 17, 1, 1], 17, 1]

a @ http://codeforpeople.com/

andreas_l · May 5, 2008, 7:05pm

ara.t.howard [email protected] wrote:

we can deny everything, except that we have the possibility of being
better. simply reflect on that.

which suppose their is a norm somewhere, ie. something able to measure
what is worst what is best…

andreas_l · May 6, 2008, 6:13am

ara.t.howard wrote:

match = pattern.match( select{|i| Fixnum === i}.join(N) )

   match = pattern.match( grep(Fixnum).join(N) )

is a nice alternative.

andreas_l · May 6, 2008, 6:19am

On May 5, 2008, at 10:13 PM, Joel VanderWerf wrote:

is a nice alternative.

it sure is! never considered using a trequals object with grep -
makes perfect sense too!

cheers.

a @ http://codeforpeople.com/

andreas_l · May 5, 2008, 7:12pm

On May 5, 2008, at 11:05 AM, Une Bévue wrote:

which suppose their is a norm somewhere, ie. something able to measure
what is worst what is best…

i don’t think the implication is an absolute one. whether relative or
absolute i think it would be rare the to find a being that did not
know this about themselves at some, possibly deeply hidden, level.

a @ http://codeforpeople.com/