How about Enumerable#find_pattern?

Ignore the name, I don’t really know what it’s best to call it.
Basically I’ve
been thinking it might be useful if Enumerable had a method to
efficiently
search for a pattern in an Enumerable object. Existing methods like
#find and
#grep work only on matching a single object.

This method might be implemented using the Knuth-Morris-Pratt algorithm
(Knuth–Morris–Pratt algorithm - Wikipedia). It just
seems to
me that it’s a useful feature for Enumerable objects to have. A brute
force
search is easy using #each_cons, and of course users can write their own
KMP-based searches, it just strikes me as something that has general
utility.
There’d need to be one to return all matches and another to return only
the
first. The method should return the position of the start of the match.
Comments?

Alex

Hi –

On Sun, 10 Sep 2006, A. S. Bradbury wrote:

Ignore the name, I don’t really know what it’s best to call it. Basically I’ve
been thinking it might be useful if Enumerable had a method to efficiently
search for a pattern in an Enumerable object. Existing methods like #find and
#grep work only on matching a single object.

#find, yes, but #grep does something similar to what I understand you
to be describing:

%w{a b c}.grep(/./)
=> [“a”, “b”, “c”]
irb(main):002:0> str = “a\nb\nc”
=> “a\nb\nc”
irb(main):003:0> str.grep(/./)
=> [“a\n”, “b\n”, “c”]

David

On Sunday 10 September 2006 13:03, [email protected] wrote:

#find, yes, but #grep does something similar to what I understand you
to be describing:

%w{a b c}.grep(/./)
=> [“a”, “b”, “c”]
irb(main):002:0> str = “a\nb\nc”
=> “a\nb\nc”
irb(main):003:0> str.grep(/./)
=> [“a\n”, “b\n”, “c”]

I’m thinking this:
foo=[1,2,3,4,5,6,5,4,3,2,1]
foo.find_pattern(5,6,5) #=> 4 (position of the start of this pattern)

Alex

> > I'm thinking this: > foo=[1,2,3,4,5,6,5,4,3,2,1] > foo.find_pattern(5,6,5) #=> 4 (position of the start of this pattern)

something a little bit more elaborated than this

class Array
def pattern( *args )
l = args.length
c = dup
while c.first(l) != c do
c.shift
return nil if c.length < l
end
length - c.length
end
end

could be put into your toolbox
HTH
Robert

Alex


Deux choses sont infinies : l’univers et la bêtise humaine ; en ce qui
concerne l’univers, je n’en ai pas acquis la certitude absolue.

  • Albert Einstein

On Sunday 10 September 2006 15:04, Robert D. wrote:

end
That is one possible implementation, and this isn’t something I’m having
a
problem implementing. I’m just asking whether it would be appropriate
for
there to be a pattern matcher for Enumerable objects in core implemented
using an efficient algorithm such as Knuth-Morris-Pratt. To me it just
seems
useful to be able to efficiently search Enumerable objects without
having to
implement it myself. It’s a very small addition that just makes
Enumerable
objects that much more useful.

Alex

On 9/10/06, A. S. Bradbury [email protected] wrote:

length - c.length
end
end

‘while c.first(l) != c do’ should of course be:
while c.first(l) != args do

Alex

There is at least one guy reading my posts!!!
Naturally, strange I tested my code, must have been a funny test :frowning:
Thx
Robert


Deux choses sont infinies : l’univers et la bêtise humaine ; en ce qui
concerne l’univers, je n’en ai pas acquis la certitude absolue.

  • Albert Einstein

On Sunday 10 September 2006 15:04, Robert D. wrote:

end
‘while c.first(l) != c do’ should of course be:
while c.first(l) != args do

Alex

Well, here’s a (not very good) implementation, kind of ported from some
python
code.

module Enumerable

def find_pattern(*pat)
match_length=0
match_pos=0
shifts=compute_table(pat)
self.each do |obj|
while (match_length >=0) and !(pat[match_length]===obj)
match_pos+=shifts[match_length]
match_length-=shifts[match_length]
end
match_length+=1
if match_length == pat.length
return match_pos
end
end
return nil # failed match
end

private
def compute_table(pat)
shifts=Array.new(pat.size)
shift = 1
0.upto pat.size do |i|
a=pat[i-1]
b=pat[i-shift-1]
while (shift < i) and (pat[i-1] != pat[i-shift-1])
shift += shifts[i-shift-1]
end
shifts[i]=shift
end
return shifts
end
end

a=“aaaabbaabbab”.split(//)
a.find_pattern ‘a’, ‘b’, ‘b’ #=> 3
a.find_pattern ‘c’ #=> nil
a.find_pattern /a|b/, ‘a’, ‘b’ #=> 2

The idea is to allow efficient searching for patterns in the elements of
any
Enumerable object, the example above is contrived.

Alex

On 9/12/06, A. S. Bradbury [email protected] wrote:

  while (match_length >=0) and !(pat[match_length]===obj)

  shifts[i]=shift

The idea is to allow efficient searching for patterns in the elements of any
Enumerable object, the example above is contrived.

Nice, but it can fall down if the pattern contains a regexp which
matches more than one element:

a = “aaaabbabbab”.split(//)
a.find_pattern /aab/, ‘b’ #=> nil

There’s probably a way to fix that, but I’m not sure I can see how.

I did a pretty slavish translation to ruby of the KMP algorithm as
given in the Wikipedia article. It involved turning the enumerable
into either an array or a string so that it could be indexed instead
of using each. That didn’t allow regexps at all though.

I’ve been working on hacking that to work with regexps, but that would
really only work if the enumerable was a string anyway, so it would
probably be better to do a specialized implementation in String.


Rick DeNatale

My blog on Ruby
http://talklikeaduck.denhaven2.com/

IPMS/USA Region 12 Coordinator
http://ipmsr12.denhaven2.com/

Visit the Project Mercury Wiki Site
http://www.mercuryspacecraft.com/

On Tuesday 12 September 2006 19:50, Rick DeNatale wrote:

Nice, but it can fall down if the pattern contains a regexp which
matches more than one element:

a = “aaaabbabbab”.split(//)
a.find_pattern /aab/, ‘b’ #=> nil

There’s probably a way to fix that, but I’m not sure I can see how.

I’d actually argue that’s expected behaviour and you’re just using it
wrong(!). /aab/ simply doesn’t match any element, so it should fail.
The
main problem with my current implementation is it doesn’t really skip
ahead
when it should do, it just continues to the next iteration. This is
where
it’s easier when you’re using an index and a while loop (and the object
responds_to []), but we can’t assume that for an Enumerable-compatible
implementation.

The Enumerable mixin provides #to_a, but relying on this seems like a
poor
solution…

I did a pretty slavish translation to ruby of the KMP algorithm as
given in the Wikipedia article. It involved turning the enumerable
into either an array or a string so that it could be indexed instead
of using each. That didn’t allow regexps at all though.

I’ve been working on hacking that to work with regexps, but that would
really only work if the enumerable was a string anyway, so it would
probably be better to do a specialized implementation in String.

The string approach really doesn’t seem like the right way to go about
this,
if I understand you correctly. Not for the problem I’m trying to solve
at
least. The idea of #find_pattern is that it will work for any arbitrary
object.

Perhaps you should be able to specify a comparison function in a block
upon
calling find_pattern.

Alex

Alex

On 9/12/06, A. S. Bradbury [email protected] wrote:

wrong(!). /aab/ simply doesn’t match any element, so it should fail.
Well, I’d expect that if you allow regexps that you should allow
regexps! But see below.

The
main problem with my current implementation is it doesn’t really skip ahead
when it should do, it just continues to the next iteration. This is where
it’s easier when you’re using an index and a while loop (and the object
responds_to []), but we can’t assume that for an Enumerable-compatible
implementation.

The Enumerable mixin provides #to_a, but relying on this seems like a poor
solution…

Not sure why.

if I understand you correctly. Not for the problem I’m trying to solve at
least. The idea of #find_pattern is that it will work for any arbitrary
object.

But then regexps won’t work except for a string.

Also, strings are a funny kind of enumerable here, since by default,
they just yield themselves in each, unless they include newlines. I’d
think that the general use of find_pattern in a string would be to
search for the pattern in the string, not as a pattern of the lines in
the string.

Perhaps you should be able to specify a comparison function in a block upon
calling find_pattern.

Of course this would require careful interface design to expand the
generality while preserving the efficiency.


Rick DeNatale

My blog on Ruby
http://talklikeaduck.denhaven2.com/

On Wednesday 13 September 2006 21:03, Rick DeNatale wrote:

I’d actually argue that’s expected behaviour and you’re just using it
wrong(!). /aab/ simply doesn’t match any element, so it should fail.

Well, I’d expect that if you allow regexps that you should allow
regexps! But see below.

But why would it be expected behaviour for a regexp to try to match
across
multiple items? The regexp is applied to a single item of the Enumerable
object, surely that’s the only sensible way to deal with this?.

%w[This is an example].find_pattern /an example/ #=> false.
Surely you wouldn’t expect the above snippet to automagically match?

The Enumerable mixin provides #to_a, but relying on this seems like a
poor solution…

Not sure why.

I’m just thinking that then you have to build the array in memory. It’s
never
going to be a big deal unless you have really large arrays. It just
seems
cleaner to seek a solution that examines a single member of the
Enumerable
object at a time.

The string approach really doesn’t seem like the right way to go about
this, if I understand you correctly. Not for the problem I’m trying to
solve at least. The idea of #find_pattern is that it will work for any
arbitrary object.

But then regexps won’t work except for a string.

Well no, #find_pattern isn’t meant to be tied to regexp matching. You
just
could use it for that. I’m not sure I’m really following you that well
on
this point.

Also, strings are a funny kind of enumerable here, since by default,
they just yield themselves in each, unless they include newlines. I’d
think that the general use of find_pattern in a string would be to
search for the pattern in the string, not as a pattern of the lines in
the string.

Well, you’d use it in a way that is meaningful for your purposes. I’m
basically using this to find patterns in an array of tokens. If my
tokenization rules were simple, perhaps a regexp could be compiled, but
it’s
really not possible in this situation.

Alex