Forum: Ruby Indexed arrays, delete_if, and performance

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
6ad1811a45f7d53b3f3b6f26275037eb?d=identicon&s=25 Jason Leong (javakins)
on 2008-11-01 07:12
Dear all,

My application uses a calendar which I'm populating using an array of
indexed RollingEvent objects. A RollingEvent is one of a series,
multiplied from its original Event's repeat options (daily, weekly,
fortnightly etc). The idea is that when a user saves an Event, the app
uses a loop to generate and store all the RollingEvents for the coming
year.

So a new Event is added, add_rolling_event is called within the loop
(I've omitted most of the params for readability):

 def add_rolling_event(id, title, calendar_date)
    @events[calendar_date.to_date] ||= []
    @events[calendar_date.to_date] << RollingEvent.new(id, title,
calendar_date)
 end

The indexing works well, as in the calendar view I can quickly pull out
the RollingEvent objects for that specific day. For the moment, whenever
an Event is edited the entire @events array is reset and a routine loops
through all Events to re-generate RollingEvents.

I'm trying to speed up the process by just deleting the RollingEvents
pertinent to the Event that's being edited, then calling
add_rolling_event for the Event in question.

I have 3 questions:

1. Is using an indexed array actually faster than picking out events on
the day using @events.select { |e| e.calendar_date == date }? It would
be good to uncomplicate this if possible.

2. It seems that if I were to use delete_if on an indexed array, I'd
have to loop through the array of RollingEvent objects for each indexed
day. Am I right in assuming that this is uber-inefficient? Is there a
better way to do this?

3. Currently @events (in a model called Event_Roster) and RollingEvents
are all non-database-backed for performance, and stored in a memcached
session. My concern has always been that - hypothetically - one could
have 10 daily Events, resulting in 3650 RollingEvents. This doesn't seem
to be a scalable solution, and the read/writes to and from the database
would be costly. Would you advise writing all RollingEvents to a
database, and is the performance hit negligable?

Thanks!
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2008-11-01 11:36
(Received via mailing list)
On 01.11.2008 07:11, Jason Leong wrote:
> (I've omitted most of the params for readability):
>
>  def add_rolling_event(id, title, calendar_date)
>     @events[calendar_date.to_date] ||= []
>     @events[calendar_date.to_date] << RollingEvent.new(id, title,
> calendar_date)
>  end

You can make that a tad more efficient by doing

def add_rolling_event(id, title, calendar_date)
   (@events[calendar_date.to_date] ||= []) <<
     RollingEvent.new(id, title, calendar_date)
end

>
> 1. Is using an indexed array actually faster than picking out events on
> the day using @events.select { |e| e.calendar_date == date }? It would
> be good to uncomplicate this if possible.

What do you mean by "indexed array"?  I am not sure what your #to_date
returns so you might be using a Hash or an Array.  Ultimately when it
comes to "faster" you need to implement different variants and benchmark
them.  Fortunately this is pretty easily done in Ruby.

> 2. It seems that if I were to use delete_if on an indexed array, I'd
> have to loop through the array of RollingEvent objects for each indexed
> day. Am I right in assuming that this is uber-inefficient? Is there a
> better way to do this?

Well, it seems you have to access paths to an event: lookup by date and
lookup by event id.  I do not know whether there are other access paths
(e.g. search for event name or such) but assuming for the moment that
there are just the two a solution with two Hashes seems most appropriate

class Calendar
   def initialize
     @ev_by_date = Hash.new {|h,k| h[k] = []}
     @ev_by_id = Hash.new {|h,k| h[k] = []}
   end

   def add_rolling_event(id, title, calendar_date)
     ev = RollingEvent.new(id, title, calendar_date)
     @ev_by_date[calendar_date] = ev
     @ev_by_id[id] = ev
   end

   def delete_by_id(id)
     @ev_by_id.delete(id).each do |ev|
       @ev_by_date[ev.date].delete(ev)
     end
   end

   def delete_by_date(date)
     @ev_by_date.delete(date).each do |ev|
       @ev_by_id[ev.id].delete(ev)
     end
   end
end

> 3. Currently @events (in a model called Event_Roster) and RollingEvents
> are all non-database-backed for performance, and stored in a memcached
> session. My concern has always been that - hypothetically - one could
> have 10 daily Events, resulting in 3650 RollingEvents. This doesn't seem
> to be a scalable solution, and the read/writes to and from the database
> would be costly. Would you advise writing all RollingEvents to a
> database, and is the performance hit negligable?

There are a lot other factors that you need to consider when thinking
about the database.  3600 events is certainly a small number, even when
read from a flat file.  You rather want a DB if your application needs
to run on multiple machines concurrently and you want to have a single
repository etc.  As long as you can foresee that there will not be
millions of entries I would probably not complicate the application by
adding another component.

Kind regards

  robert
6ad1811a45f7d53b3f3b6f26275037eb?d=identicon&s=25 Jason Leong (javakins)
on 2008-11-01 21:22
> You can make that a tad more efficient by doing
>
> def add_rolling_event(id, title, calendar_date)
>    (@events[calendar_date.to_date] ||= []) <<
>      RollingEvent.new(id, title, calendar_date)
> end
>

Gotcha :-)

>
> What do you mean by "indexed array"?  I am not sure what your #to_date
> returns so you might be using a Hash or an Array.  Ultimately when it
> comes to "faster" you need to implement different variants and benchmark
> them.  Fortunately this is pretty easily done in Ruby.

Sorry, my terminology needs work - #to_date returns a date value so I'm
using a Hash. I'll certainly benchmark the variants.

> Well, it seems you have to access paths to an event: lookup by date and
> lookup by event id.  I do not know whether there are other access paths
> (e.g. search for event name or such) but assuming for the moment that
> there are just the two a solution with two Hashes seems most appropriate
>

That's correct! And by highlighting the access paths you've helped me
narrow it down to one of the challenges I've come across. I have the
event id which I could use to delete an event from the array indexed by
event hash - but this doesn't delete the event indexed by id hash (or
does it? I'll give it a go).


> There are a lot other factors that you need to consider when thinking
> about the database.  3600 events is certainly a small number, even when
> read from a flat file.  You rather want a DB if your application needs
> to run on multiple machines concurrently and you want to have a single
> repository etc.  As long as you can foresee that there will not be
> millions of entries I would probably not complicate the application by
> adding another component.

That's good advice. Thanks Robert!
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2008-11-01 23:10
(Received via mailing list)
On 01.11.2008 21:21, Jason Leong wrote:

>> Well, it seems you have to access paths to an event: lookup by date and
>> lookup by event id.  I do not know whether there are other access paths
>> (e.g. search for event name or such) but assuming for the moment that
>> there are just the two a solution with two Hashes seems most appropriate
>
> That's correct! And by highlighting the access paths you've helped me
> narrow it down to one of the challenges I've come across. I have the
> event id which I could use to delete an event from the array indexed by
> event hash - but this doesn't delete the event indexed by id hash (or
> does it? I'll give it a go).

Please look carefully at the code.  Although untested it should do
exactly this: delete from the other Hash.  Note that the return value of
Hash#delete is the deleted element.  I probably should have added a
safety check to ensure nil does not cause errors.  So you'd probably
rather do

   def delete_by_id(id)
     dlt = @ev_by_id.delete(id) and dlt.each do |ev|
       @ev_by_date[ev.date].delete(ev)
     end
   end


Kind regards

  robert
6ad1811a45f7d53b3f3b6f26275037eb?d=identicon&s=25 Jason Leong (javakins)
on 2008-11-03 01:01
 > Please look carefully at the code.  Although untested it should do
> exactly this: delete from the other Hash.  Note that the return value of
> Hash#delete is the deleted element.  I probably should have added a
> safety check to ensure nil does not cause errors.  So you'd probably
> rather do
>
>    def delete_by_id(id)
>      dlt = @ev_by_id.delete(id) and dlt.each do |ev|
>        @ev_by_date[ev.date].delete(ev)
>      end
>    end

Ah yes! Pounded off a reply before I looked, sorry - thanks for the
safety check too, that certainly came in handy. The one difference in my
final implementation is this:

  def delete_events_by_id(id)
    dlt = @titles.delete(id.to_i) and dlt.each do |e|
      @events[e.date].delete_if { |i| i.id == e.id }
    end
  end

The reason being (I think!) the object passed in for deletion in
@ev_by_date[ev.date].delete(ev) does not match up with the object in
@ev_by_date as they're two different Hashes - but a compare based on the
id works. Does that sound right?

Thanks Robert, you've taught me a bunch about Hashes!
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2008-11-03 09:13
(Received via mailing list)
2008/11/3 Jason Leong <jason@pocketsmith.com>:
>>    end
>
> Ah yes! Pounded off a reply before I looked, sorry - thanks for the
> safety check too, that certainly came in handy.

And it's needed if someone passes in a wrong id or date ("wrong"
meaning, there is no data for it).

> @ev_by_date[ev.date].delete(ev) does not match up with the object in
> @ev_by_date as they're two different Hashes - but a compare based on the
> id works. Does that sound right?

I do not know the rest of your code but in my version the same
instance was put into both Hashes.  If you think about it, this is
what you want, i.e. regardless of whether you look up the event by id
or date you want to have the same instance - otherwise you would have
to manipulate two instances all the time, which makes code more
complex and also wastes memory. And in that case Array#delete is
sufficient:

irb(main):001:0> o = Object.new
=> #<Object:0x7ff9c69c>
irb(main):002:0> a = [o]
=> [#<Object:0x7ff9c69c>]
irb(main):003:0> a.delete o
=> #<Object:0x7ff9c69c>
irb(main):004:0> a
=> []

Even for multiple occurrences:

irb(main):005:0> a = [o,o]
=> [#<Object:0x7ff9c69c>, #<Object:0x7ff9c69c>]
irb(main):006:0> a.delete o
=> #<Object:0x7ff9c69c>
irb(main):007:0> a
=> []

But delete_if is ok of course as well. If you need more efficiency you
can use Set instead of Array as Hash values but then you should use
#delete because this is more efficient for Set (O(1) vs. O(n) for
delete_if).

> Thanks Robert, you've taught me a bunch about Hashes!

You're welcome!

Kind regards

robert
This topic is locked and can not be replied to.