String generalization

Hello all,

I am having the following problem: I have to implement a method
which accepts a string and returns a generalized form of the string.
Some examples:

‘12345’ → \d+
‘ABCDE’ → [A-Z]+
[email protected]’ → [a-z]+.[a-z]+@([a-z]+.)+[a-z]
‘123-45-678-90’ → (\d±)+\d+
‘item:’ → [a-z]+:
‘Peter, SZINEK’: → [A-Z][a-z]+, [A-Z]+
http://www.google.com’ → [a-z]+://([a-z]+.)+[a-z]
‘jd£;d:L348kddd3’ → .*

So, given an example, the function should generate a generic regexp
which is then used to match the instances of the same class (i.e. based
on an e-mail, you create a regexp which can be used to match other
emails).
Of course this problem is not solvable in general. You would need more
examples (both positive and negative ones, as it is proven that just
using positive examples you can not generalize a pattern (generate a
regular grammar desribibg it) and then machine learn/induce a
grammar/etc based on that. Unfortunately i have only one positive
example. So i do not need a perfect solution (which is not possible
anyway) just a mostly working one.

I have implemented it in java, and it works pretty well but it is ugly
as hell (as any java code dealing with stuff where you need regexps,
slicing, maps etc). Any ideas for a nice Ruby code solving this? Then i
would call it via JRuby (and killed by the other colleagues using java
but not Ruby ;-).

thx,
Peter

As you said for yourself this is quite a general problem. So my code is
not to show off a clever automaton, it is completely stupid, but rather to
give you the idea how to hide it :wink:
Please note that it is not even complete (end handling has to be done) I
just wanted to give you ideas.

Wow! That’s what i like about Ruby. My solution in java (although more
complete,
and dealing with special cases etc) is about 10x bigger in size and much
more
ugly/obscure/unmaintainable etc. Thx for the code!

Just an idea: Maybe you should make some more effort in the definition of
the problem and post it to Ruby Q.!
Really? (I am new here, begun to learn Ruby 2 weeks ago) - i have seen
Ruby Q. but did not know that anybody can post problems… (And if it
gets solved, this way “outsorce” a task from his full-time job tasklist
:wink:

Best wishes,
Peter

On 4/6/06, Peter S. [email protected] wrote:

‘123-45-678-90’ → (\d±)+\d+
examples (both positive and negative ones, as it is proven that just
but not Ruby ;-).

thx,
Peter

As you said for yourself this is quite a general problem. So my code
is
not to show off a clever automaton, it is completely stupid, but rather
to
give you the idea how to hide it :wink:
Please note that it is not even complete (end handling has to be done) I
just wanted to give you ideas.

Just an idea: Maybe you should make some more effort in the definition
of
the problem and post it to Ruby Q.!
----------------------------------------- 8<

cat gen1.rb; ruby gen1.rb
#!/usr/bin/env ruby

class Generalize < Regexp

Do not use a Hash, order is crucial!

Generalisations = [ ‘\w’, ‘\d’, ‘@’, ‘.’, ‘.’ ]

    def initialize( str )
            @rep = generalize(str)

            super @rep
    end

    def to_s
            s = super
            "<" + s + ">:" + self.class.to_s + "\"" + @rep +"\""
    end

    private
    #
    # This is the real tough part which might become ugly, but at 

least
it is private :slight_smile:
def generalize( str )
r = “”
old = nil
star= “”
str.each_byte do
|b|
c = b.chr
# Much more intelligence to be put here, maybe
also
more context than the
# single string we have at our disposal.
Generalisations.each do
| lit |
reg = Regexp.new( lit )
if reg === c then
if lit == old then
star = “+”
else
if old then
r << old
r << star
star = “”
end
old = lit
end
break
end
end
end
r
end

end

r = Generalize.new(“[email protected]”)
puts r.to_s
<(?-mix:\w+.\w+@\w+.)>:Generalize"\w+.\w+@\w+."
------------------------------- >8

Cheers
Robert

Deux choses sont infinies : l’univers et la bêtise humaine ; en ce qui
concerne l’univers, je n’en ai pas acquis la certitude absolue.

  • Albert Einstein

On 4/6/06, Peter S. [email protected] wrote:

SNIP


Really? (I am new here, begun to learn Ruby 2 weeks ago) - i have seen

Ruby Q. but did not know that anybody can post problems… (And if it
gets solved, this way “outsorce” a task from his full-time job tasklist
:wink:

Best wishes,
Peter

Well James reads this list frequently, he will tell us, I guess :wink:
But my idea is the following.

  • Give some constraints about how different sets of example strings
    would
    influence each other.
  • Define a test suite f.e.

I think along these lines
If the program has only one string “acccd” it will generalize to “.+”
if you give a second one
[ “addff”, “” ] it will generalize to “."
if you give the following data
[ “129”, “0x1ef” ], [“add”, “”] it shall create “\d+|0x[0-9a-f]+”, ".

etc. etc.

Believe me this is more than challanging and maybe a real good
definition
of the problem is more
difficult than the solution.

Cheers
Robert


Deux choses sont infinies : l’univers et la bêtise humaine ; en ce qui
concerne l’univers, je n’en ai pas acquis la certitude absolue.

  • Albert Einstein

Peter S. schrieb:

‘item:’ → [a-z]+:
‘Peter, SZINEK’: → [A-Z][a-z]+, [A-Z]+
http://www.google.com’ → [a-z]+://([a-z]+.)+[a-z]
‘jd£;d:L348kddd3’ → .*

Peter, thanks for the quiz. I needed it right now :slight_smile:

I’m not sure I understand the rules. I had expected:

‘//’ → /+
‘jd£;d:L348kddd3’ → [a-z]+£;[a-z]:[A-Z]\d+[a-z]+\d

Anyway, here’s a first implementation for the use cases you’ve shown:

def generalize( str )
str.gsub( /\d+|[a-z]+|[A-Z]+|./ ) do |match|
case match
when /\d{2,}/: “\d+”
when /\d/: “\d”
when /[a-z]{2,}/: “[a-z]+”
when /[a-z]/: “[a-z]”
when /[A-Z]{2,}/: “[A-Z]+”
when /[A-Z]/: “[A-Z]”
when /./: “\.”
end
end.gsub( /(.+)\1+/, “(\1)+” )
end

You notice some repetitions, so you can generalize it further. It isn’t
perfect, but maybe gets you going.

Regards,
Pit

On Apr 6, 2006, at 8:08 AM, Peter S. wrote:

Really? (I am new here, begun to learn Ruby 2 weeks ago) - i have
seen Ruby Q. but did not know that anybody can post problems…
(And if it gets solved, this way “outsorce” a task from his full-
time job tasklist :wink:

Sure can. Ruby Q. is a resource. Use away. :wink:

However, I should warn that I’m getting stricter and stricter about
making most submitters solve the problem before we run it. Have to
make sure it’s a reasonable problem.

James Edward G. II

etc. etc.
Uh-oh. I definitely like the idea to do something like this (i.e. to
define
a quiz based on the above problem). But imho the
problem as you described it is in some way a machine learning problem -
i.e. a quite hard one. I am really new here, so i am not sure about the
usual level of the Ruby Q. puzzles - but this seems to me a master
thesis level problem rather than a quiz question :wink:

bw,
Peter

cat gen1.rb; ruby gen1.rb
#!/usr/bin/env ruby

class Generalize < Regexp

The only part i am missing is the one that gave me te most headache in
the java version - the second step of generalization.
Example:

r = Generalize.new(“[email protected]”)
You would return:
\w+.\w+@\w+.\w+.\w+.\w+.\w+ (first step)

But i want

\w+.\w+@(\w+.)+\w+ (second step)

instead…

i.e. to find the repeating strings in the pattern and + them.
I mean it’s ok to find the repeating strings, since you can do this with
one back-referencing regexp. But to decide which of the repeating parts
to + etc. is a real PITB… At least for me it was.

bw,
Peter

On 4/6/06, James Edward G. II [email protected] wrote:

On Apr 6, 2006, at 8:08 AM, Peter S. wrote:

However, I should warn that I’m getting stricter and stricter about
making most submitters solve the problem before we run it. Have to
make sure it’s a reasonable problem.

And you do well, doing so.
But I think, coming up with a reasonably understandable specification of
the
problem and submitting it to Ruby Q. has its own rewards.
(a) By understanding the problem, one might be very close to the
solution.
Actually it is a Conditio Sine Qua Non, but sometimes we are not
aware
of it.
(b) Discussing the thing on-list before submission will get some useful
feadback.
(c) If you are out of ruby-quizzes (and that it may never happen) it
might
be good to have some ideas on the shelf.
One shall not be disappointed if you reject it, but as I stated above it
would be for a good, well explained reason, so it should not be taken
personally.

Robert


Deux choses sont infinies : l’univers et la bêtise humaine ; en ce qui
concerne l’univers, je n’en ai pas acquis la certitude absolue.

  • Albert Einstein

Hi,

Peter, thanks for the quiz. I needed it right now :slight_smile:
Actually it was not ment as quiz (its a task at hand at my workplace),
but i like the mentionality of the Ruby ML that everybody started to
solve it as a quiz :wink: Good work James!

I’m not sure I understand the rules. I had expected:

‘//’ -> /+
Yeah, you re right with this one.
‘jd£;d:L348kddd3’ -> [a-z]+£;[a-z]:[A-Z]\d+[a-z]+\d
The idea is, that if the input string is a mixture of this and that
without some ‘sensible’ pattern, it should be just generalized to .*
Well, it is not very straightforward to pin down whether a string is a
mixture or not, and when there is a sensible pattern in it. (And what is
sensible in this context) ATM i am using some crazy heuristics wich
works very nicely in the practice, but no one knows (not even me) what
it is really doing :wink: It just works.

To put it another way: a human user can tell that ‘jd£;d:L348kddd3’ is
some crap and not a repeating pattern of something (like a tel #, an
email, a name) in any way. So the rule is, that if you, as a human user
think it is a crap rather than something meaningful, the computer should
say this as well ;-). And you produced a chunk of a code which can be
wired into the first machine passing the Turing test…

bw,
Peter

On Thu, 06 Apr 2006 14:11:10 +0200, Peter S. [email protected] wrote:

‘item:’ → [a-z]+:
‘Peter, SZINEK’: → [A-Z][a-z]+, [A-Z]+
http://www.google.com’ → [a-z]+://([a-z]+.)+[a-z]
‘jd£;d:L348kddd3’ → .*

Here is my shot at the problem:

class String

from strict to less strict

PATTERNS_TO_GENERALIZE = [
/\d/, /\d+/,
/[a-z]/, /[a-z]+/,
/[A-Z]/, /[A-Z]+/,
/\w/, /\w+/
]
def generalize_to_regexp
parts = split(/(\W+)/).map { |x|
case x
when “”
nil
when /\W+/
Regexp.escape(x)
else
PATTERNS_TO_GENERALIZE.find { |p|
(p =~ x) && x == $&
}.source
end
}.compact
parts2 = [] # will contain the final parts
size = 0 # size of the window we are currently checking
parts.size.times { |idx|
size += 1
next if size < 4 # window to small
if size % 2 == 0
if parts[idx] != parts[idx-2] || parts[idx-1] != parts[idx-3]
if size > 4
parts2 << “(#{parts[idx-3]}#{parts[idx-2]})+”
size = 2
else # size == 4
parts2 << parts[idx-3]
size -= 1
end
end
end
}
# handle the remaining window
if size >= 4
size %= 2
parts2 << “(#{parts[-2-size]}#{parts[-1-size]})+”
end
parts2.concat(parts[-size, size]) if size > 0
Regexp.new(parts2.join(“”))
end

end

if $0 == FILE
[‘12345’, ‘ABCDE’, ‘[email protected]’,
‘123-45-678-90’, ‘item:’, ‘Peter, SZINEK’,
http://www.google.com’, ‘jd£;d:L348kddd3’].each { |s|
puts “#{s} => #{s.generalize_to_regexp.inspect}”
}
end

The part that finds the repeating patterns is a bit ugly, but it should
work. Here are the results:

$ ruby generalize_rx.rb
12345 => /\d+/
ABCDE => /[A-Z]+/
[email protected] => /[a-z]+.[a-z]+@([a-z]+.)+[a-z]+/
123-45-678-90 => /(\d+-)+\d+/
item: => /[a-z]+:/
Peter, SZINEK => /\w+,\ [A-Z]+/
http://www.google.com => /[a-z]+://([a-z]+.)+[a-z]+/
jd£;d:L348kddd3 => /[a-z]+\243;[a-z]:\w+/

Dominik

The part that finds the repeating patterns is a bit ugly, but it should
work. Here are the results:
You did not see the java version… Now THAT’S ugly :wink:

$ ruby generalize_rx.rb
12345 => /\d+/
ABCDE => /[A-Z]+/
[email protected] => /[a-z]+.[a-z]+@([a-z]+.)+[a-z]+/
123-45-678-90 => /(\d+-)+\d+/
item: => /[a-z]+:/
Peter, SZINEK => /\w+,\ [A-Z]+/
http://www.google.com => /[a-z]+://([a-z]+.)+[a-z]+/
jd£;d:L348kddd3 => /[a-z]+\243;[a-z]:\w+/

Wow! Thanks man! Nice solution. Ruby is awesome.

Cheers,
Peter

On Thu, 06 Apr 2006 19:16:44 +0200, Robert D.
[email protected]
wrote:

ABCDE => /[A-Z]+/
/\d+[a-z]+/, no?
split(/(\W+)/) splits the string into alphanumerical and
non-alphanumerical parts:

irb(main):001:0> “12a”.split(/(\W+)/)
=> [“12a”]
irb(main):002:0> “.12a.iu-23…4234”.split(/(\W+)/)
=> [“”, “.”, “12a”, “.”, “iu”, “-”, “23”, “…”, “4234”]

Then my implementation escapes the non-alphanumeric parts and tries to
find a pattern for the alphanumeric parts (from PATTERNS_TO_GENERALIZE).
The first pattern that matches “12a” is /\w+/ so it gets choosen.

If you want /\d+[a-z]+/ you can either add more patterns to
PATTERNS_TO_GENERALIZE, or try to further split the alphanumeric parts.

But what would you do with “1b23xy3sa”? Should that be /\w+/ or
/\d[a-z]\d+[a-z]+\d[a-z]+/?

Dominik

On 4/6/06, Dominik B. [email protected] wrote:

But what would you do with “1b23xy3sa”? Should that be /\w+/ or
/\d[a-z]\d+[a-z]+\d[a-z]+/?

Dominik

That is exactly the question one has to ask himself.
Given the original post it can be generalized to /.*/.
To generalize one string does not make any sense, to generalize sets of
sets
of strings each set defining an entity, that makes sense…
… and is increadibly difficult.

Cheers
Robert


Deux choses sont infinies : l’univers et la bêtise humaine ; en ce qui
concerne l’univers, je n’en ai pas acquis la certitude absolue.

  • Albert Einstein

On 4/6/06, Dominik B. [email protected] wrote:

[email protected] => /[a-z]+.[a-z]+@([a-z]+.)+[a-z]+/
123-45-678-90 => /(\d+-)+\d+/
item: => /[a-z]+:/
Peter, SZINEK => /\w+,\ [A-Z]+/
http://www.google.com => /[a-z]+://([a-z]+.)+[a-z]+/
jd£;d:L348kddd3 => /[a-z]+\243;[a-z]:\w+/

Well that is quite impressive, but I could not figure out why
“12a”.you_know_the_name_of_the_method gives /\w+/, it clearly should
give
/\d+[a-z]+/, no?

Cheers
Robert

Dominik


Deux choses sont infinies : l’univers et la bêtise humaine ; en ce qui
concerne l’univers, je n’en ai pas acquis la certitude absolue.

  • Albert Einstein