Indexed arrays, delete_if, and performance

leveille · November 1, 2008, 7:12am

Dear all,

My application uses a calendar which I’m populating using an array of
indexed RollingEvent objects. A RollingEvent is one of a series,
multiplied from its original Event’s repeat options (daily, weekly,
fortnightly etc). The idea is that when a user saves an Event, the app
uses a loop to generate and store all the RollingEvents for the coming
year.

So a new Event is added, add_rolling_event is called within the loop
(I’ve omitted most of the params for readability):

def add_rolling_event(id, title, calendar_date)
@events[calendar_date.to_date] ||= []
@events[calendar_date.to_date] << RollingEvent.new(id, title,
calendar_date)
end

The indexing works well, as in the calendar view I can quickly pull out
the RollingEvent objects for that specific day. For the moment, whenever
an Event is edited the entire @events array is reset and a routine loops
through all Events to re-generate RollingEvents.

I’m trying to speed up the process by just deleting the RollingEvents
pertinent to the Event that’s being edited, then calling
add_rolling_event for the Event in question.

I have 3 questions:

Is using an indexed array actually faster than picking out events on
the day using @events.select { |e| e.calendar_date == date }? It would
be good to uncomplicate this if possible.
It seems that if I were to use delete_if on an indexed array, I’d
have to loop through the array of RollingEvent objects for each indexed
day. Am I right in assuming that this is uber-inefficient? Is there a
better way to do this?
Currently @events (in a model called Event_Roster) and RollingEvents
are all non-database-backed for performance, and stored in a memcached
session. My concern has always been that - hypothetically - one could
have 10 daily Events, resulting in 3650 RollingEvents. This doesn’t seem
to be a scalable solution, and the read/writes to and from the database
would be costly. Would you advise writing all RollingEvents to a
database, and is the performance hit negligable?

Thanks!

leveille · November 1, 2008, 11:36am

On 01.11.2008 07:11, Jason L. wrote:

(I’ve omitted most of the params for readability):

def add_rolling_event(id, title, calendar_date)
@events[calendar_date.to_date] ||= []
@events[calendar_date.to_date] << RollingEvent.new(id, title,
calendar_date)
end

You can make that a tad more efficient by doing

def add_rolling_event(id, title, calendar_date)
(@events[calendar_date.to_date] ||= []) <<
RollingEvent.new(id, title, calendar_date)
end

Is using an indexed array actually faster than picking out events on
the day using @events.select { |e| e.calendar_date == date }? It would
be good to uncomplicate this if possible.

What do you mean by “indexed array”? I am not sure what your #to_date
returns so you might be using a Hash or an Array. Ultimately when it
comes to “faster” you need to implement different variants and benchmark
them. Fortunately this is pretty easily done in Ruby.

It seems that if I were to use delete_if on an indexed array, I’d
have to loop through the array of RollingEvent objects for each indexed
day. Am I right in assuming that this is uber-inefficient? Is there a
better way to do this?

Well, it seems you have to access paths to an event: lookup by date and
lookup by event id. I do not know whether there are other access paths
(e.g. search for event name or such) but assuming for the moment that
there are just the two a solution with two Hashes seems most appropriate

class Calendar
def initialize
@ev_by_date = Hash.new {|h,k| h[k] = []}
@ev_by_id = Hash.new {|h,k| h[k] = []}
end

def add_rolling_event(id, title, calendar_date)
ev = RollingEvent.new(id, title, calendar_date)
@ev_by_date[calendar_date] = ev
@ev_by_id[id] = ev
end

def delete_by_id(id)
@ev_by_id.delete(id).each do |ev|
@ev_by_date[ev.date].delete(ev)
end
end

def delete_by_date(date)
@ev_by_date.delete(date).each do |ev|
@ev_by_id[ev.id].delete(ev)
end
end
end

Currently @events (in a model called Event_Roster) and RollingEvents
are all non-database-backed for performance, and stored in a memcached
session. My concern has always been that - hypothetically - one could
have 10 daily Events, resulting in 3650 RollingEvents. This doesn’t seem
to be a scalable solution, and the read/writes to and from the database
would be costly. Would you advise writing all RollingEvents to a
database, and is the performance hit negligable?

There are a lot other factors that you need to consider when thinking
about the database. 3600 events is certainly a small number, even when
read from a flat file. You rather want a DB if your application needs
to run on multiple machines concurrently and you want to have a single
repository etc. As long as you can foresee that there will not be
millions of entries I would probably not complicate the application by
adding another component.

Kind regards

robert

leveille · November 1, 2008, 9:22pm

You can make that a tad more efficient by doing

def add_rolling_event(id, title, calendar_date)
(@events[calendar_date.to_date] ||= []) <<
RollingEvent.new(id, title, calendar_date)
end

Gotcha

What do you mean by “indexed array”? I am not sure what your #to_date
returns so you might be using a Hash or an Array. Ultimately when it
comes to “faster” you need to implement different variants and benchmark
them. Fortunately this is pretty easily done in Ruby.

Sorry, my terminology needs work - #to_date returns a date value so I’m
using a Hash. I’ll certainly benchmark the variants.

Well, it seems you have to access paths to an event: lookup by date and
lookup by event id. I do not know whether there are other access paths
(e.g. search for event name or such) but assuming for the moment that
there are just the two a solution with two Hashes seems most appropriate

That’s correct! And by highlighting the access paths you’ve helped me
narrow it down to one of the challenges I’ve come across. I have the
event id which I could use to delete an event from the array indexed by
event hash - but this doesn’t delete the event indexed by id hash (or
does it? I’ll give it a go).

There are a lot other factors that you need to consider when thinking
about the database. 3600 events is certainly a small number, even when
read from a flat file. You rather want a DB if your application needs
to run on multiple machines concurrently and you want to have a single
repository etc. As long as you can foresee that there will not be
millions of entries I would probably not complicate the application by
adding another component.

That’s good advice. Thanks Robert!

leveille · November 3, 2008, 1:01am

Please look carefully at the code. Although untested it should do
exactly this: delete from the other Hash. Note that the return value of
Hash#delete is the deleted element. I probably should have added a
safety check to ensure nil does not cause errors. So you’d probably
rather do

def delete_by_id(id)
dlt = @ev_by_id.delete(id) and dlt.each do |ev|
@ev_by_date[ev.date].delete(ev)
end
end

Ah yes! Pounded off a reply before I looked, sorry - thanks for the
safety check too, that certainly came in handy. The one difference in my
final implementation is this:

def delete_events_by_id(id)
dlt = @titles.delete(id.to_i) and dlt.each do |e|
@events[e.date].delete_if { |i| i.id == e.id }
end
end

The reason being (I think!) the object passed in for deletion in
@ev_by_date[ev.date].delete(ev) does not match up with the object in
@ev_by_date as they’re two different Hashes - but a compare based on the
id works. Does that sound right?

Thanks Robert, you’ve taught me a bunch about Hashes!

leveille · November 3, 2008, 9:13am

2008/11/3 Jason L. [email protected]:

end

Ah yes! Pounded off a reply before I looked, sorry - thanks for the
safety check too, that certainly came in handy.

And it’s needed if someone passes in a wrong id or date (“wrong”
meaning, there is no data for it).

@ev_by_date[ev.date].delete(ev) does not match up with the object in
@ev_by_date as they’re two different Hashes - but a compare based on the
id works. Does that sound right?

I do not know the rest of your code but in my version the same
instance was put into both Hashes. If you think about it, this is
what you want, i.e. regardless of whether you look up the event by id
or date you want to have the same instance - otherwise you would have
to manipulate two instances all the time, which makes code more
complex and also wastes memory. And in that case Array#delete is
sufficient:

irb(main):001:0> o = Object.new
=> #Object:0x7ff9c69c
irb(main):002:0> a = [o]
=> [#Object:0x7ff9c69c]
irb(main):003:0> a.delete o
=> #Object:0x7ff9c69c
irb(main):004:0> a
=> []

Even for multiple occurrences:

irb(main):005:0> a = [o,o]
=> [#Object:0x7ff9c69c, #Object:0x7ff9c69c]
irb(main):006:0> a.delete o
=> #Object:0x7ff9c69c
irb(main):007:0> a
=> []

But delete_if is ok of course as well. If you need more efficiency you
can use Set instead of Array as Hash values but then you should use
#delete because this is more efficient for Set (O(1) vs. O(n) for
delete_if).

Thanks Robert, you’ve taught me a bunch about Hashes!

You’re welcome!

Kind regards

robert

leveille · November 1, 2008, 11:10pm

On 01.11.2008 21:21, Jason L. wrote:

Well, it seems you have to access paths to an event: lookup by date and
lookup by event id. I do not know whether there are other access paths
(e.g. search for event name or such) but assuming for the moment that
there are just the two a solution with two Hashes seems most appropriate

That’s correct! And by highlighting the access paths you’ve helped me
narrow it down to one of the challenges I’ve come across. I have the
event id which I could use to delete an event from the array indexed by
event hash - but this doesn’t delete the event indexed by id hash (or
does it? I’ll give it a go).

Please look carefully at the code. Although untested it should do
exactly this: delete from the other Hash. Note that the return value of
Hash#delete is the deleted element. I probably should have added a
safety check to ensure nil does not cause errors. So you’d probably
rather do

def delete_by_id(id)
dlt = @ev_by_id.delete(id) and dlt.each do |ev|
@ev_by_date[ev.date].delete(ev)
end
end

Kind regards

robert