Regex - Exclude Multiple Characters and Global Scanning

blacknight · June 21, 2008, 6:51am

Hihi,

I have 2 problems.

--------------Question 1-----------------------
Firstly, a Ruby question. I’m confused about how to match a single
regular expression multiple times in a single string. For instance,

‘llgllallo’.match(/(ll.)/)[0] #-> ‘llg’
‘llgllallo’.match(/(ll.)/)[1] #-> ‘llg’
‘llgllallo’.match(/(ll.)/)[1] #-> nil

How do I access all 3 matches? String#scan will work, but that gives me

‘llgllallo’.scan(/(ll.)/) #=> [[“llg”], [“lla”], [“llo”]]

But I need the offsets, and this info isn’t given to me.

--------------Question 2-----------------------
Now an old gap in my regex understanding. How do I exclude on
consecutive characters? I want something like [^abc], except aba or bbc
is ok, just not ‘abc’. Summing this up:

reg = /something/
‘abc’.match(reg) #-> no match
‘cba’.match(reg) #-> match

And then I want to be able to do OR operations too, like not ‘abc’ and
not ‘bbc’, but that is probably another step of complexity.

I don’t suppose there is any way to pass a block to the regex to use in
a specific place? That would be cool, though maybe not possible given
optimisations in regex?

Thanks in advance,
ben

blacknight · June 21, 2008, 10:02am

Hi –

On Sat, 21 Jun 2008, Ben W. wrote:

‘llgllallo’.match(/(ll.)/)[1] #-> nil

How do I access all 3 matches? String#scan will work, but that gives me

‘llgllallo’.scan(/(ll.)/) #=> [[“llg”], [“lla”], [“llo”]]

But I need the offsets, and this info isn’t given to me.

You could do:

irb(main):029:0> offsets = []
=> []
irb(main):030:0> str.scan(/ll./) { offsets << $~.offset(0)[1] }
=> “llgllallo”
irb(main):031:0> offsets
=> [3, 6, 9]

(Pending someone coming up with something slicker. I don’t like the
temp variable particularly, but anyway.)

--------------Question 2-----------------------
Now an old gap in my regex understanding. How do I exclude on
consecutive characters? I want something like [^abc], except aba or bbc
is ok, just not ‘abc’. Summing this up:

[^abc] means: match one character that is not ‘a’, not ‘b’, and not
‘c’. I don’t think that’s what you mean.

reg = /something/
‘abc’.match(reg) #-> no match
‘cba’.match(reg) #-> match

And then I want to be able to do OR operations too, like not ‘abc’ and
not ‘bbc’, but that is probably another step of complexity.

You can use (?!), which is negative lookahead.

irb(main):033:0> reg = /(?!abc)[abc]{3}/
=> /(?!abc)[abc]{3}/

So that means: three of a, b, c, as long as we’re not looking at
“abc” when we start looking for those three characters.

irb(main):034:0> reg.match(“abc”)
=> nil
irb(main):035:0> reg.match(“abb”)
=> #MatchData:0x69de8
irb(main):036:0> reg.match(“cba”)
=> #MatchData:0x63de4

I don’t suppose there is any way to pass a block to the regex to use in
a specific place? That would be cool, though maybe not possible given
optimisations in regex?

Blocks get passed to methods, not objects, and regexes are objects.
Some of the methods that use regexes also take blocks, like scan, sub,
and gsub. I’m not sure what you mean about the specific place, though.

David

blacknight · June 22, 2008, 3:53am

David A. Black wrote:

You could do:

irb(main):029:0> offsets = []
=> []
irb(main):030:0> str.scan(/ll./) { offsets << $~.offset(0)[1] }
=> “llgllallo”
irb(main):031:0> offsets
=> [3, 6, 9]

(Pending someone coming up with something slicker. I don’t like the
temp variable particularly, but anyway.)

That will work, thanks. It would seem intuitive to me that scan (or a
method like it) would iterate of MatchData objects, but anyway. Thanks.

--------------Question 2-----------------------
Now an old gap in my regex understanding. How do I exclude on
consecutive characters? I want something like [^abc], except aba or bbc
is ok, just not ‘abc’. Summing this up:

[^abc] means: match one character that is not ‘a’, not ‘b’, and not
‘c’. I don’t think that’s what you mean.

reg = /something/
‘abc’.match(reg) #-> no match
‘cba’.match(reg) #-> match

And then I want to be able to do OR operations too, like not ‘abc’ and
not ‘bbc’, but that is probably another step of complexity.

You can use (?!), which is negative lookahead.

irb(main):033:0> reg = /(?!abc)[abc]{3}/
=> /(?!abc)[abc]{3}/

So that means: three of a, b, c, as long as we’re not looking at
“abc” when we start looking for those three characters.

irb(main):034:0> reg.match(“abc”)
=> nil
irb(main):035:0> reg.match(“abb”)
=> #MatchData:0x69de8
irb(main):036:0> reg.match(“cba”)
=> #MatchData:0x63de4

That is exactly what I meant. I was unaware of the negative lookahead
operator. Thanks!

I don’t suppose there is any way to pass a block to the regex to use in
a specific place? That would be cool, though maybe not possible given
optimisations in regex?

Blocks get passed to methods, not objects, and regexes are objects.
Some of the methods that use regexes also take blocks, like scan, sub,
and gsub. I’m not sure what you mean about the specific place, though.

My question was not explained very well, sorry. I meant it would be cool
if you could pass a block that became part of the regex itself. For
instance instead of /(?!abc)/ you could somehow tell it
{|s| s != ‘abc’}

Just an idea, doesn’t really matter now you’ve fixed my problem.

Thanks,
ben

David

blacknight · June 23, 2008, 3:35am

From: Ben W. [mailto:[email protected]]

David A. Black wrote:

> irb(main):029:0> offsets = []

> => []

> irb(main):030:0> str.scan(/ll./) { offsets << $~.offset(0)[1] }

> => “llgllallo”

> irb(main):031:0> offsets

> => [3, 6, 9]

That will work, thanks. It would seem intuitive to me that scan (or a

method like it) would iterate of MatchData objects, but

$~ is MatchData
you could wrap dBlack’s hint if you want something similar to #scan

eg,

class String
def mapscan pattern
atemp=[]
scan(pattern){ atemp << yield($~)}
atemp
end
end
#=> nil

s
#=> “llgllallo”

s.mapscan(/ll./){|md| [md[0],md.offset(0)]}
#=> [[“llg”, [0, 3]], [“lla”, [3, 6]], [“llo”, [6, 9]]]

>> --------------Question 2-----------------------

>> Now an old gap in my regex understanding. How do I exclude on

>> consecutive characters? I want something like [^abc],

except aba or bbc

>> is ok, just not ‘abc’. Summing this up:

>

> [^abc] means: match one character that is not ‘a’, not ‘b’, and not

> ‘c’. I don’t think that’s what you mean.

>> reg = /something/

>> ‘abc’.match(reg) #-> no match

>> ‘cba’.match(reg) #-> match

>> And then I want to be able to do OR operations too, like

not ‘abc’ and

>> not ‘bbc’, but that is probably another step of complexity.

> You can use (?!), which is negative lookahead.

> irb(main):033:0> reg = /(?!abc)[abc]{3}/

> => /(?!abc)[abc]{3}/

> So that means: three of a, b, c, as long as we’re not looking at

> “abc” when we start looking for those three characters.

> irb(main):034:0> reg.match(“abc”)

> => nil

> irb(main):035:0> reg.match(“abb”)

> => #MatchData:0x69de8

> irb(main):036:0> reg.match(“cba”)

> => #MatchData:0x63de4

That is exactly what I meant. I was unaware of the negative lookahead

operator. Thanks!

if you want to compare sequences, you can create a complete sequence for
your case, so you do not end up creating many regex pattern. and then
test everything from there.

eg,

SEQALPHA=(“a”…“z”).to_a.join
#=> “abcdefghijklmnopqrstuvwxyz”
SEQALPHA.match “abc”
#=> #MatchData:0x2906288
SEQALPHA.match “def”
#=> #MatchData:0x28ff870
SEQALPHA.match “xyz”
#=> #MatchData:0x28fb4a0
SEQALPHA.match “bac”
#=> nil
SEQALPHA.match “cba”
#=> nil
SEQALPHA.match “yyy”
#=> nil

negating it on your case is simple,

not SEQALPHA.match “bac”
#=> true
not SEQALPHA.match “abc”
#=> false

now using mapscan above, you can do,
SEQALPHA.mapscan(/abc|xyz/){|md| [md[0],md.offset(0)]}
#=> [[“abc”, [0, 3]], [“xyz”, [23, 26]]]

btw, index is a faster if you just want simple string compar.
SEQALPHA.index “abc”
#=> 0
SEQALPHA.index “def”
#=> 3
SEQALPHA.index “xxx”
#=> nil
SEQALPHA.index “bac”
#=> nil
SEQALPHA.index /abc/
#=> 0
SEQALPHA.index /def/
#=> 3
SEQALPHA.index /efd/
#=> nil

again, negating is simple

not SEQALPHA.index /def/
#=> false
not SEQALPHA.index /fde/
#=> true

hth.
kind regards -botp