Forum: Ruby String generalization

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
35594c037eba2fa48f7129d5fded828b?d=identicon&s=25 Peter Szinek (Guest)
on 2006-04-06 14:12
(Received via mailing list)
Hello all,

I am having the following problem: I have to implement a method
which accepts a string and returns a  generalized form of the string.
Some examples:

'12345' -> \d+
'ABCDE' -> [A-Z]+
'john.smith@my.super.server.rb' -> [a-z]+\.[a-z]+@([a-z]+\.)+[a-z]
'123-45-678-90' -> (\d+-)+\d+
'item:' -> [a-z]+:
'Peter, SZINEK': -> [A-Z][a-z]+, [A-Z]+
'http://www.google.com' -> [a-z]+://([a-z]+\.)+[a-z]
'jd£;d:L348kddd3' -> .*

So, given an example, the function should generate a generic regexp
which is then used to match the instances of the same class (i.e. based
on an e-mail, you create a regexp which can be used to match other
emails).
Of course this problem is not solvable in general. You would need more
examples (both positive and negative ones, as it is proven that just
using positive examples you can not generalize a pattern (generate a
regular grammar desribibg it) and then machine learn/induce a
grammar/etc based on that. Unfortunately i have only one positive
example. So i do not need a perfect solution (which is not possible
anyway) just a mostly working one.

I have implemented it in java, and it works pretty well but it is ugly
as hell (as any java code dealing with stuff where you need regexps,
slicing, maps etc). Any ideas for a nice Ruby code solving this? Then i
would call it via JRuby (and killed by the other colleagues using java
but not Ruby ;-).

thx,
Peter
703fbc991fd63e0e1db54dca9ea31b53?d=identicon&s=25 Robert Dober (Guest)
on 2006-04-06 14:59
(Received via mailing list)
On 4/6/06, Peter Szinek <peter@rt.sk> wrote:
> '123-45-678-90' -> (\d+-)+\d+
> examples (both positive and negative ones, as it is proven that just
> but not Ruby ;-).
>
> thx,
> Peter
>
>
As you said for yourself this is quite a *general* problem. So my code
is
not to show off a clever automaton, it is completely stupid, but rather
to
give you the idea how to hide it ;)
Please note that it is not even complete (end handling has to be done) I
just wanted to give you ideas.

Just an idea: Maybe you should make some more effort in the definition
of
the problem and post it to Ruby Quiz!
----------------------------------------- 8<
---------------------------------------------
 cat gen1.rb; ruby gen1.rb
#!/usr/bin/env ruby
#

class Generalize < Regexp

# Do not use a Hash, order is crucial!
Generalisations = [ '\w', '\d', '@', '\.', '.' ]

        def initialize( str )
                @rep = generalize(str)

                super @rep
        end

        def to_s
                s = super
                "<" + s + ">:" + self.class.to_s + "\"" + @rep +"\""
        end

        private
        #
        # This is the real tough part which might become ugly, but at
least
it is private :)
        def generalize( str )
                r = ""
                old = nil
                star= ""
                str.each_byte do
                        |b|
                        c = b.chr
                        # Much more intelligence to be put here, maybe
also
more context than the
                        # single string we have at our disposal.
                        Generalisations.each do
                                | lit |
                                reg = Regexp.new( lit )
                                if reg === c then
                                        if lit == old then
                                                star = "+"
                                        else
                                                if old then
                                                        r << old
                                                        r << star
                                                        star = ""
                                                end
                                                old = lit
                                        end
                                        break
                                end
                        end
                end
                r
        end


end

r = Generalize.new("robert.dober@gmail.com")
puts r.to_s
<(?-mix:\w+\.\w+@\w+\.)>:Generalize"\w+\.\w+@\w+\."
------------------------------- >8
-----------------------------------------
Cheers
Robert
--
Deux choses sont infinies : l'univers et la bêtise humaine ; en ce qui
concerne l'univers, je n'en ai pas acquis la certitude absolue.

- Albert Einstein
35594c037eba2fa48f7129d5fded828b?d=identicon&s=25 Peter Szinek (Guest)
on 2006-04-06 15:11
(Received via mailing list)
> As you said for yourself this is quite a *general* problem. So my code is
> not to show off a clever automaton, it is completely stupid, but rather to
> give you the idea how to hide it ;)
> Please note that it is not even complete (end handling has to be done) I
> just wanted to give you ideas.

Wow! That's what i like about Ruby. My solution in java (although more
complete,
and dealing with special cases etc) is about 10x bigger in size and much
more
ugly/obscure/unmaintainable etc. Thx for the code!

> Just an idea: Maybe you should make some more effort in the definition of
> the problem and post it to Ruby Quiz!
Really? (I am new here, begun to learn Ruby 2 weeks ago) - i have seen
Ruby Quiz but did not know that anybody can post problems... (And if it
gets solved, this way "outsorce" a task from his full-time job tasklist
;-)

Best wishes,
Peter
703fbc991fd63e0e1db54dca9ea31b53?d=identicon&s=25 Robert Dober (Guest)
on 2006-04-06 15:20
(Received via mailing list)
On 4/6/06, Peter Szinek <peter@rt.sk> wrote:
>
> SNIP


----------------

Really? (I am new here, begun to learn Ruby 2 weeks ago) - i have seen
> Ruby Quiz but did not know that anybody can post problems... (And if it
> gets solved, this way "outsorce" a task from his full-time job tasklist
> ;-)
>
> Best wishes,
> Peter
>
> Well James reads this list frequently, he will tell us, I guess ;)
But my idea is the following.

* Give some constraints about how different sets of example strings
would
influence each other.
* Define a test suite f.e.

I think along these lines
If the program has only one string "acccd" it will generalize to ".+"
if you give a second one
[ "addff", "" ] it will generalize to ".*"
if you give the following data
[  "129", "0x1ef" ], ["add", ""] it shall create "\d+|0x[0-9a-f]+", ".*"
etc. etc.

Believe me this is more than challanging and maybe a *real* good
definition
of the problem is more
difficult than the solution.

Cheers
Robert



--
Deux choses sont infinies : l'univers et la bêtise humaine ; en ce qui
concerne l'univers, je n'en ai pas acquis la certitude absolue.

- Albert Einstein
93d566cc26b230c553c197c4cd8ac6e4?d=identicon&s=25 Pit Capitain (Guest)
on 2006-04-06 15:39
(Received via mailing list)
Peter Szinek schrieb:
> 'item:' -> [a-z]+:
> 'Peter, SZINEK': -> [A-Z][a-z]+, [A-Z]+
> 'http://www.google.com' -> [a-z]+://([a-z]+\.)+[a-z]
> 'jd£;d:L348kddd3' -> .*

Peter, thanks for the quiz. I needed it right now :-)

I'm not sure I understand the rules. I had expected:

   '//' -> /+
   'jd£;d:L348kddd3' -> [a-z]+£;[a-z]:[A-Z]\d+[a-z]+\d

Anyway, here's a first implementation for the use cases you've shown:

   def generalize( str )
     str.gsub( /\d+|[a-z]+|[A-Z]+|\./ ) do |match|
       case match
       when /\d{2,}/: "\\d+"
       when /\d/: "\\d"
       when /[a-z]{2,}/: "[a-z]+"
       when /[a-z]/: "[a-z]"
       when /[A-Z]{2,}/: "[A-Z]+"
       when /[A-Z]/: "[A-Z]"
       when /\./: "\\."
       end
     end.gsub( /(.+)\1+/, "(\\1)+" )
   end

You notice some repetitions, so you can generalize it further. It isn't
perfect, but maybe gets you going.

Regards,
Pit
35594c037eba2fa48f7129d5fded828b?d=identicon&s=25 Peter Szinek (Guest)
on 2006-04-06 15:48
(Received via mailing list)
> etc. etc.
Uh-oh. I definitely like the idea to do something like this (i.e. to
define
a quiz based on the above problem). But imho the
problem as you described it is in some way a machine learning problem -
i.e. a quite hard one. I am really new here, so i am not sure about the
usual level of the Ruby Quiz puzzles - but this seems to me a master
thesis level problem rather than a quiz question ;-)

bw,
Peter
4299e35bacef054df40583da2d51edea?d=identicon&s=25 James Gray (bbazzarrakk)
on 2006-04-06 15:51
(Received via mailing list)
On Apr 6, 2006, at 8:08 AM, Peter Szinek wrote:

> Really? (I am new here, begun to learn Ruby 2 weeks ago) - i have
> seen Ruby Quiz but did not know that anybody can post problems...
> (And if it gets solved, this way "outsorce" a task from his full-
> time job tasklist ;-)

Sure can.  Ruby Quiz is a resource.  Use away.  ;)

However, I should warn that I'm getting stricter and stricter about
making most submitters solve the problem before we run it.  Have to
make sure it's a reasonable problem.

James Edward Gray II
35594c037eba2fa48f7129d5fded828b?d=identicon&s=25 Peter Szinek (Guest)
on 2006-04-06 16:10
(Received via mailing list)
>  cat gen1.rb; ruby gen1.rb
> #!/usr/bin/env ruby
> #
>
> class Generalize < Regexp
>
The only part i am missing is the one that gave me te most headache in
the java version - the second step of generalization.
Example:

r = Generalize.new("a.b@c.d.e.f.g")
You would return:
\w+.\w+@\w+.\w+.\w+.\w+.\w+ (first step)

But i want

\w+.\w+@(\w+\.)+\w+ (second step)

instead...

i.e. to find  the repeating strings in the pattern and + them.
I mean it's ok to find the repeating strings, since you can do this with
one back-referencing regexp. But to decide which of the repeating parts
to + etc. is a real PITB... At least for me it was.

bw,
Peter
35594c037eba2fa48f7129d5fded828b?d=identicon&s=25 Peter Szinek (Guest)
on 2006-04-06 16:13
(Received via mailing list)
Hi,

> Peter, thanks for the quiz. I needed it right now :-)
Actually it was not ment as quiz (its a task at hand at my workplace),
but i like the mentionality of the Ruby ML that everybody started to
solve it as a quiz ;-) Good work James!
>
> I'm not sure I understand the rules. I had expected:
>
>   '//' -> /+
Yeah, you re right with this one.
>   'jd£;d:L348kddd3' -> [a-z]+£;[a-z]:[A-Z]\d+[a-z]+\d
The idea is, that if the input string is a mixture of this and that
without some 'sensible' pattern, it should be just generalized to .*
Well, it is not very straightforward to pin down whether a string is a
mixture or not, and when there is a sensible pattern in it. (And what is
sensible in this context) ATM i am using some crazy heuristics wich
works very nicely in the practice, but no one knows (not even me) what
it is really doing ;-) It just works.

To put it another way: a human user can tell that 'jd£;d:L348kddd3' is
some crap and not a repeating pattern of something (like a tel #, an
email, a name) in any way. So the rule is, that if you, as a human user
think it is a crap rather than something meaningful, the computer should
say this as well ;-). And you produced a chunk of a code which can be
wired into the first machine passing the Turing test...

bw,
Peter
703fbc991fd63e0e1db54dca9ea31b53?d=identicon&s=25 Robert Dober (Guest)
on 2006-04-06 16:26
(Received via mailing list)
On 4/6/06, James Edward Gray II <james@grayproductions.net> wrote:
>
> On Apr 6, 2006, at 8:08 AM, Peter Szinek wrote:
>
>
> However, I should warn that I'm getting stricter and stricter about
> making most submitters solve the problem before we run it.  Have to
> make sure it's a reasonable problem.


And you do well, doing so.
But I think, coming up with a reasonably understandable specification of
the
problem and submitting it to Ruby Quiz has its own rewards.
 (a) By understanding the problem, one might be very close to the
solution.
      Actually it is a Conditio Sine Qua Non, but sometimes we are not
aware
of it.
 (b) Discussing the thing on-list before submission will get some useful
feadback.
 (c) If you are out of ruby-quizzes (and that it may never happen) it
might
be good to have some ideas on the shelf.
One shall not be disappointed if you reject it, but as I stated above it
would be for a good, well explained reason, so it should not be taken
personally.

Robert

>
>
>
>


--
Deux choses sont infinies : l'univers et la bêtise humaine ; en ce qui
concerne l'univers, je n'en ai pas acquis la certitude absolue.

- Albert Einstein
18ca239ffade6df0b839d26062f173fb?d=identicon&s=25 Dominik Bathon (Guest)
on 2006-04-06 17:30
(Received via mailing list)
On Thu, 06 Apr 2006 14:11:10 +0200, Peter Szinek <peter@rt.sk> wrote:

> 'item:' -> [a-z]+:
> 'Peter, SZINEK': -> [A-Z][a-z]+, [A-Z]+
> 'http://www.google.com' -> [a-z]+://([a-z]+\.)+[a-z]
> 'jd£;d:L348kddd3' -> .*

Here is my shot at the problem:

class String

   # from strict to less strict
   PATTERNS_TO_GENERALIZE = [
     /\d/, /\d+/,
     /[a-z]/, /[a-z]+/,
     /[A-Z]/, /[A-Z]+/,
     /\w/, /\w+/
   ]
   def generalize_to_regexp
     parts = split(/(\W+)/).map { |x|
       case x
       when ""
         nil
       when /\W+/
         Regexp.escape(x)
       else
         PATTERNS_TO_GENERALIZE.find { |p|
           (p =~ x) && x == $&
         }.source
       end
     }.compact
     parts2 = [] # will contain the final parts
     size = 0 # size of the window we are currently checking
     parts.size.times { |idx|
       size += 1
       next if size < 4 # window to small
       if size % 2 == 0
         if parts[idx] != parts[idx-2] || parts[idx-1] != parts[idx-3]
           if size > 4
             parts2 << "(#{parts[idx-3]}#{parts[idx-2]})+"
             size = 2
           else # size == 4
             parts2 << parts[idx-3]
             size -= 1
           end
         end
       end
     }
     # handle the remaining window
     if size >= 4
       size %= 2
       parts2 << "(#{parts[-2-size]}#{parts[-1-size]})+"
     end
     parts2.concat(parts[-size, size]) if size > 0
     Regexp.new(parts2.join(""))
   end

end

if $0 == __FILE__
   ['12345', 'ABCDE', 'john.smith@my.super.server.rb',
    '123-45-678-90', 'item:', 'Peter, SZINEK',
    'http://www.google.com', 'jd£;d:L348kddd3'].each { |s|
     puts "#{s} => #{s.generalize_to_regexp.inspect}"
   }
end


The part that finds the repeating patterns is a bit ugly, but it should
work. Here are the results:

$ ruby generalize_rx.rb
12345 => /\d+/
ABCDE => /[A-Z]+/
john.smith@my.super.server.rb => /[a-z]+\.[a-z]+@([a-z]+\.)+[a-z]+/
123-45-678-90 => /(\d+\-)+\d+/
item: => /[a-z]+:/
Peter, SZINEK => /\w+,\ [A-Z]+/
http://www.google.com => /[a-z]+:\/\/([a-z]+\.)+[a-z]+/
jd£;d:L348kddd3 => /[a-z]+\243;[a-z]:\w+/


Dominik
35594c037eba2fa48f7129d5fded828b?d=identicon&s=25 Peter Szinek (Guest)
on 2006-04-06 17:42
(Received via mailing list)
> The part that finds the repeating patterns is a bit ugly, but it should
> work. Here are the results:
You did not see the java version... Now THAT'S ugly ;-)
>
> $ ruby generalize_rx.rb
> 12345 => /\d+/
> ABCDE => /[A-Z]+/
> john.smith@my.super.server.rb => /[a-z]+\.[a-z]+@([a-z]+\.)+[a-z]+/
> 123-45-678-90 => /(\d+\-)+\d+/
> item: => /[a-z]+:/
> Peter, SZINEK => /\w+,\ [A-Z]+/
> http://www.google.com => /[a-z]+:\/\/([a-z]+\.)+[a-z]+/
> jd£;d:L348kddd3 => /[a-z]+\243;[a-z]:\w+/

Wow! Thanks man! Nice solution. Ruby is awesome.

Cheers,
Peter
703fbc991fd63e0e1db54dca9ea31b53?d=identicon&s=25 Robert Dober (Guest)
on 2006-04-06 19:18
(Received via mailing list)
On 4/6/06, Dominik Bathon <dbatml@gmx.de> wrote:
> john.smith@my.super.server.rb => /[a-z]+\.[a-z]+@([a-z]+\.)+[a-z]+/
> 123-45-678-90 => /(\d+\-)+\d+/
> item: => /[a-z]+:/
> Peter, SZINEK => /\w+,\ [A-Z]+/
> http://www.google.com => /[a-z]+:\/\/([a-z]+\.)+[a-z]+/
> jd£;d:L348kddd3 => /[a-z]+\243;[a-z]:\w+/


Well that is quite impressive, but I could not figure out why
"12a".you_know_the_name_of_the_method gives /\w+/, it clearly should
give
/\d+[a-z]+/, no?

Cheers
Robert

Dominik
>
>


--
Deux choses sont infinies : l'univers et la bêtise humaine ; en ce qui
concerne l'univers, je n'en ai pas acquis la certitude absolue.

- Albert Einstein
18ca239ffade6df0b839d26062f173fb?d=identicon&s=25 Dominik Bathon (Guest)
on 2006-04-06 22:38
(Received via mailing list)
On Thu, 06 Apr 2006 19:16:44 +0200, Robert Dober
<robert.dober@gmail.com>
wrote:

>> ABCDE => /[A-Z]+/
> /\d+[a-z]+/, no?
split(/(\W+)/) splits the string into alphanumerical and
non-alphanumerical parts:

irb(main):001:0> "12a".split(/(\W+)/)
=> ["12a"]
irb(main):002:0> ".12a.iu-23..4234".split(/(\W+)/)
=> ["", ".", "12a", ".", "iu", "-", "23", "..", "4234"]

Then my implementation escapes the non-alphanumeric parts and tries to
find a pattern for the alphanumeric parts (from PATTERNS_TO_GENERALIZE).
The first pattern that matches "12a" is /\w+/ so it gets choosen.

If you want /\d+[a-z]+/ you can either add more patterns to
PATTERNS_TO_GENERALIZE, or try to further split the alphanumeric parts.

But what would you do with "1b23xy3sa"? Should that be /\w+/ or
/\d[a-z]\d+[a-z]+\d[a-z]+/?


Dominik
703fbc991fd63e0e1db54dca9ea31b53?d=identicon&s=25 Robert Dober (Guest)
on 2006-04-07 15:38
(Received via mailing list)
On 4/6/06, Dominik Bathon <dbatml@gmx.de> wrote:
>
> But what would you do with "1b23xy3sa"? Should that be /\w+/ or
> /\d[a-z]\d+[a-z]+\d[a-z]+/?
>
>
> Dominik


That is exactly the question one has to ask himself.
Given the original post it can be generalized to /.*/.
To generalize one string does not make any sense, to generalize sets of
sets
of strings each set defining an entity, that makes sense...
... and is increadibly difficult.

Cheers
Robert




--
Deux choses sont infinies : l'univers et la bêtise humaine ; en ce qui
concerne l'univers, je n'en ai pas acquis la certitude absolue.

- Albert Einstein
This topic is locked and can not be replied to.