Optimizing for speed - Array#each


#1

I ran the profiler on my program and it said that 51% of the time is
spent
in Array#each. Is there any way to speed this up (like using
Array#each_index perhaps)?

And is there a quicker way to do a .gsub if I’m using just strings and
not a
regexp?

By the way, the program compares 2 databases and determines which
records
have been inserted and deleted, and each database has about 15,000
records
to compare.

Mike S.


#2

Mike S. wrote:

I ran the profiler on my program and it said that 51% of the time is
spent
in Array#each. Is there any way to speed this up (like using
Array#each_index perhaps)?

And is there a quicker way to do a .gsub if I’m using just strings and
not a
regexp?

By the way, the program compares 2 databases and determines which
records
have been inserted and deleted, and each database has about 15,000
records
to compare.

Mike S.

No. I don’t think you can speed up Array#each at all. What’s far more
promising is improving your algorithm in a way that it has to run less
loops. You most likely have one or even several O(n^x) algorithms with x

= 2 in your app. Maybe you’re not even aware. E.g. if you have an each
with a select in the loop you’re already in O(n^2) realm.
The database comparison probably can be handled best with hashtables of
the primary keys.

Regards
Stefan


#3

On 5/23/07, Stefan R. removed_email_address@domain.invalid wrote:

No. I don’t think you can speed up Array#each at all. What’s far more
promising is improving your algorithm in a way that it has to run less
loops. You most likely have one or even several O(n^x) algorithms with x

= 2 in your app. Maybe you’re not even aware. E.g. if you have an each
with a select in the loop you’re already in O(n^2) realm.
The database comparison probably can be handled best with hashtables of
the primary keys.

I was going to suggest something even more primitive – query each
database table in some sorted fashion and dump it as a CSV file. Use
diff to compare CSV files. Obviously this won’t work for tables with
blobs, but it might be a fun experiment.

Failing that, how can the database do the work for you? Let it find
the records you’re interested in (it will do it faster). Put
modification timestamps in your tables to reduce the number of records
you need to search. Or use triggers to log all transactions to a
separate table so you can run those same transactions on the other
database.

Just some thoughts.

Blessings,
TwP


#4

One of the things I’m doing is a lot of File.join’s, and I have to do a
gsub
( /// , “\” ) after each one (because I’m running under Windows).

I can’t use sets because I have duplicate records. There is no unique
record
key, so for each record (which I use as a key in a hash), I have an
array
listing the files it appears in (which can include the same file more
than
once).

Thanks for the help!

Mike S.


#5

On 24.05.2007 00:15, Stefan R. wrote:

Mike S. wrote:

I ran the profiler on my program and it said that 51% of the time is
spent
in Array#each. Is there any way to speed this up (like using
Array#each_index perhaps)?

Note that the profiler output is often irritating because time for each
also includes all the time spend in the block - and while #each is
invoked only once the block is invoked n times.

And is there a quicker way to do a .gsub if I’m using just strings and
not a regexp?

Quicker than what? Are you using the block form of gsub? Are you using
gsub! or gsub? Can you show some code?

By the way, the program compares 2 databases and determines which
records
have been inserted and deleted, and each database has about 15,000
records
to compare.

15,000 does not sound like much. What is the runtime of the program
without profile?

And another idea: could be that you are using inefficient methods to
test for existence. Finding something in an Array is always O(n) unless
you have additional constraints; i.e. array content is sorted in which
case you can use binary search which is O(log(n)). It seems you want to
be doing set operations - so why don’t you try to use Set for which
member tests are far more efficient (O(1) because it’s using hashing
internally).

Kind regards

robert


#6

Mike S. wrote:

One of the things I’m doing is a lot of File.join’s, and I have to do a
gsub
( /// , “\” ) after each one (because I’m running under Windows).

Are you using the strings generated by File.join outside of ruby?

If not you can simply use them without gsub.

Stefan


#7

On May 23, 2007, at 14:59 , Mike S. wrote:

I ran the profiler on my program and it said that 51% of the time
is spent
in Array#each. Is there any way to speed this up (like using
Array#each_index perhaps)?

The time spent on Array#each itself is minimal and can’t really be
sped up much. What you’re really seeing is the time spent in the
block’s code. If you have a bunch of each calls all over, you can
refactor each one to call out to a method:

ary.each { … } => def x; …; end; ary.each { x }

That way you can get better accounting on which block is taking up
the most time.

And is there a quicker way to do a .gsub if I’m using just strings
and not a
regexp?

Not by much. I will say (based on a subsequent email of yours on this
thread) that you shouldn’t have to be doing these gsubs just because
you’re on windows. IIRC, ruby should open files with either form of
path separator.

By the way, the program compares 2 databases and determines which
records
have been inserted and deleted, and each database has about 15,000
records
to compare.

15k records to compare isn’t that much, honestly.

One thing I would suggest is for you to use an alternative profiler.
shugo’s prof or my zenprofile gives much more accurate profiles where
lots and lots of method calls are involved. The beauty of the default
profiler is how small and easy it is to write. The pain is that it is
biased towards # of calls, not time spent.


#8

2007/5/24, Mike S. removed_email_address@domain.invalid:

One of the things I’m doing is a lot of File.join’s, and I have to do a gsub
( /// , “\” ) after each one (because I’m running under Windows).

I can’t use sets because I have duplicate records. There is no unique record
key, so for each record (which I use as a key in a hash), I have an array
listing the files it appears in (which can include the same file more than
once).

Can you be a bit more specific on what exactly you are doing?

robert


#9

On Thu, 24 May 2007, Mike S. wrote:

I ran the profiler on my program and it said that 51% of the time is spent
in Array#each. Is there any way to speed this up (like using
Array#each_index perhaps)?

Mostly what the profiler is telling you is “don’t do that”.

Which sometimes isn’t very helpful.

The thing to do rather than ask “is there something faster than
Array#each?” rather ask is there a way I can rewrite my algorithm so I
don’t called Array#each so much?

I tend to not look at those entries in the profiler output and skip
downwards until I find the first method I wrote and optimize that.

And is there a quicker way to do a .gsub if I’m using just strings and not a
regexp?

I don’t know exactly what you’re doing but String []= does a lot more
than I would naively think.

------------------------------------------------------------- String#[]=
str[fixnum] = fixnum
str[fixnum] = new_str
str[fixnum, fixnum] = new_str
str[range] = aString
str[regexp] = new_str
str[regexp, fixnum] = new_str
str[other_str] = new_str

  Element Assignment---Replaces some or all of the content of _str_.
  The portion of the string affected is determined using the same
  criteria as +String#[]+. If the replacement string is not the same
  length as the text it is replacing, the string will be adjusted
  accordingly. If the regular expression or string is used as the
  index doesn't match a position in the string, +IndexError+ is
  raised. If the regular expression form is used, the optional 

second
+Fixnum+ allows you to specify which portion of the match to
replace (effectively using the +MatchData+ indexing rules. The
forms that take a +Fixnum+ will raise an +IndexError+ if the value
is out of range; the +Range+ form will raise a +RangeError+, and
the +Regexp+ and +String+ forms will silently ignore the
assignment.

By the way, the program compares 2 databases and determines which records
have been inserted and deleted, and each database has about 15,000 records
to compare.

Have a look at
http://rubygarden.org:3000/Ruby/page/show/RubyOptimization
for more info.

If you are still stuck, and it doesn’t reveal any confidential
details, send the top 50 lines profiler output to the group and ask
for suggestions.

John C. Phone : (64)(3) 358 6639
Tait Electronics Fax : (64)(3) 359 4632
PO Box 1645 Christchurch Email : removed_email_address@domain.invalid
New Zealand