Forum: Ruby Optimization anyone

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
81cf8dab4b4af8aa3148c28421afd845?d=identicon&s=25 hsanson (Guest)
on 2005-11-22 11:40
(Received via mailing list)
I have this little script that takes a list of keyword sets, each set
has only
two keywords and for each one of them the script creates a regular
expression
like this:

Regexp.new("#{key1}\.*#{key2}|#{key2}\.*#{key1}")

then I match it to a string that contains a long text fetched from a web
page.

a more complete pseudo-code

#########################################
long_text = get_web_page(url)

keyword_hash = load_keyword_array_from_database

keyword_hash.each_pair { |id, value|

    key1 = value[0]
    key2 = value[1]

    r = Regexp.new("#{key1}\.*#{key2}|#{key2}\.*#{key1}")
    return id if long_text =~ r
}

return -1
###########################################


Now this code works perfect, the problem is that the keyword_hash has
more
than 300 elements and running this code can take between 50 to 120
seconds.
Since I am processing more than 1000 pages with this code it takes
forever.


I solved this problem by replacing the regular expression match to

   r1 =  Regexp.new("#{key1}\.*#{key2}")
   r2 =  Regexp.new("#{key2}\.*#{key1}")

   return id if long_text =~ r1 or long_text =~ r2


I simply put the or statement outside the regular expresion and the
speedup
was from 50~120sec to 0.40 secs per page.


using the Benchmark class and running some test I got

normal:   0   0
    27.688000          0.015000         27.703000         ( 27.765000 )
fast:
    0.469000             0.000000        0.484000       (0.954000)


the speed difference is totally diferent.

Is this expected when using regular expressions??


regards,
Horacio
956f185be9eac1760a2a54e287c4c844?d=identicon&s=25 decoux (Guest)
on 2005-11-22 11:52
(Received via mailing list)
>>>>> "H" == Horacio Sanson <hsanson@moegi.waseda.jp> writes:

H> Regexp.new("#{key1}\.*#{key2}|#{key2}\.*#{key1}")

 vs

H>    r1 =  Regexp.new("#{key1}\.*#{key2}")
H>    r2 =  Regexp.new("#{key2}\.*#{key1}")

H> Is this expected when using regular expressions??

 yes, ruby has some optimizations. For example with the regexp
/abc.*def/

svg% ruby -rjj -e '/abc.*def/.dump'
Regexp /abc.*def/
  0	exactn "abc" (3)
  1	anychar_repeat
  2	exactn "def" (3)
  3	end
must : abc
optimize : exactn
svg%

 It call the regexp engine (which is slow) only when it has found the
 substring "abc" in the string

 Now if you use /abc.*def|def.*abc/ you break this optimization


svg% ruby -rjj -e '/abc.*def|def.*abc/.dump'
Regexp /abc.*def|def.*abc/
  0	on_failure_jump		==>   5
  1	exactn "abc" (3)
  2	anychar_repeat
  3	exactn "def" (3)
  4	jump			==>   8
  5	exactn "def" (3)
  6	anychar_repeat
  7	exactn "abc" (3)
  8	end
svg%


 it must call the stupid (:-)) regexp engine for each line


Guy Decoux
5befe95e6648daec3dd5728cd36602d0?d=identicon&s=25 bob.news (Guest)
on 2005-11-22 12:20
(Received via mailing list)
Horacio Sanson wrote:
>
>     r = Regexp.new("#{key1}\.*#{key2}|#{key2}\.*#{key1}")
> code it takes forever.
> I simply put the or statement outside the regular expresion and the
>
>
> the speed difference is totally diferent.
>
> Is this expected when using regular expressions??

On obvious optimization is to create all regexps during
load_keyword_array_from_database() and not during iteration of the hash.
That way you just have to do it once and can reuse those regexps with
multiple pages you check.

Another possible optimization is to take your approach of splitting the
regexps a bit further and create two regexps - one for each keyword -
and
return the id if both match.  This works only correctly if (i) keywords
don't overlap or (ii) you can use \b to ensure matching on word
boundaries.

Kind regards

    robert
680f75e1d37a4691bf9ae902baf7beee?d=identicon&s=25 christian.leskowsky (Guest)
on 2005-11-22 15:17
(Received via mailing list)
There's a whole section in Mastering Regular Expressions that goes into
the
differences between regexp engines.

Summary: it makes a big difference how you setup your patterns!

On 11/22/05, Robert Klemme <bob.news@gmx.net> wrote:
> >
> > key2 = value[1]
> > more than 300 elements and running this code can take between 50 to
> >
> > fast:
> multiple pages you check.
>
>
>
>


--
-

'There was an owl lived in an oak.
The more he heard, the less he spoke.
The less he spoke, the more he heard.'

Christian Leskowsky
christian.leskowsky@gmail.com
7264fb16beeea92b89bb42023738259d?d=identicon&s=25 chneukirchen (Guest)
on 2005-11-22 17:34
(Received via mailing list)
ts <decoux@moulon.inra.fr> writes:

>
>  yes, ruby has some optimizations. For example with the regexp /abc.*def/

Do you know how oniguruma does that, per chance?
956f185be9eac1760a2a54e287c4c844?d=identicon&s=25 decoux (Guest)
on 2005-11-22 17:42
(Received via mailing list)
>>>>> "C" == Christian Neukirchen <chneukirchen@gmail.com> writes:

C> Do you know how oniguruma does that, per chance?

 You can compile oniguruma with debugging options

moulon% ./ruby -e '/abc.*def/'
<list:814d418>
   <string:814d3e8>abc
   <qualifier:814d478>{0,-1}
      <anychar:814d448>
   <string:814d4d8>def
optimize: EXACT_BM
  anchor: []
  sub anchor: []

exact: [abc]: length: 3
code length: 11
[exact3:abc] [anychar*-peek-next:d] [exact3:def] [end]
moulon%


Guy Decoux
This topic is locked and can not be replied to.