Efficient Ruby equivalent to Excel's "SUMIF"?

I’m getting some relatively slow performance when doing calculations
against data, and I’m wondering whether there’s a more efficient way to
do this.

I’m taking a multidimensional array and then totalling a “column” based
on multiple criteria. It looks like a much longer version of this:

data = [
[ ‘Ref’, ‘Count’, ‘Locn’ ],
[ ‘Part1’, ‘3’, ‘Here’ ],
[ ‘Part2’, ‘1’, ‘Here’ ],
[ ‘Part1’, ‘2’, ‘There’ ],
]

The way I’m doing this at present is to take the main key: “Ref”
and unique it, then loop through all those values and select chunks of
the data which match. In order to accelerate the process I’m also
diminishing the source array each time so by the time I get to the end
the process is much faster.

Something like this:


#Make a 2D array of unique refs (+header) to store the output
data = [
[ ‘Ref’, ‘Count’, ‘Locn’ ],
[ ‘Part1’, ‘3’, ‘Here’ ],
[ ‘Part2’, ‘1’, ‘Here’ ],
[ ‘Part1’, ‘2’, ‘There’ ],
]

#Set up a 2D array to hold the output, with 1 of each Ref + a header
output = [ data.transpose[0].uniq ].transpose

output.map.with_index { |ar,idx|
if idx == 0 #Headers
[ ar[0], ‘Total’, ‘Here’, ‘There’ ]
else
#Get the current set to look at
current_data = data.select { |a| a[0] =~ /#{ ar[0] }/i }
#Remove it from the main pool
data -= current_data
#Populate the current row with totals
b = [
ar[0],
nil,
current_data.select { |a| a[2] == ‘Here’ }.inject(0) { |sum,a| sum

  • a[1].to_i },
    current_data.select { |a| a[2] == ‘There’ }.inject(0) { |sum,a|
    sum + a[1].to_i }
    ]
    #Add a total
    b[1] = b[2] + b[3]
    #This one’s done
    b
    end
    }

I’m doing this with around 9000 rows of data using over 2000 unique
references. The way I’m doing it at the moment seems quite slow. Is
there a faster way to do this?

Thanks for reading this.

I think part of the problem is you’re using arrays quite heavily. You
can
do it in one pass if you pick a Hash that you later convert to a more
appropriate structure, or simply define your data as a class and
instantiate relevant objects, etc.

Anyway, here’s an example:

data = [
[ ‘Ref’, ‘Count’, ‘Locn’ ],
[ ‘Part1’, ‘3’, ‘Here’ ],
[ ‘Part2’, ‘1’, ‘Here’ ],
[ ‘Part1’, ‘2’, ‘There’ ],
]

desired_output = [
[“Ref”, “Total”, “Here”, “There”],
[“Part1”, 5, 3, 2],
[“Part2”, 1, 1, 0]
]

bump off the first element

headers = data.shift

references = {}

data.each do |reference, count, location|

if the references hash hasn’t “seen” this reference yet,

create a hash to keep track of totals and locations.

references[reference] ||= { :total => 0, :locations => {} }
r = references[reference]

contribute to the total for this reference.

r[:total] += count.to_i

if this location hasn’t been seen for this reference yet,

start it off at zero. could also use Hash.new(0) in the

above definition of references[reference]

r[:locations][location] ||= 0

now contribute to the location’s count

r[:locations][location] += count.to_i
end

references now looks something like:

{

“Part1” => { :total => 5,

:locations => { “Here” => 3, “There” => 2 } },

“Part2” => { :total => 1,

:locations => { “Here” => 1 } }

}

This corresponds to desired_output, just in a different way.

You can convert this to the original desired array if you wish, by

iterating over references, or you can pick a more appropriate Hash

structure and just use that.

Thanks Adam, that’s a great way of looking at this!
I’ve been using arrays quite a lot in Ruby since they’re so much easier
to use than in VBA. When your only tool is a hammer…
I knew the logical answer was some sort of incremental system, but I
hadn’t been able to visualise a way to implement it. Hashes look like a
great solution.

Nice thread it is.

i made you this:

selects = [“Here”, “There”]

data = [
[ ‘Ref’, ‘Count’, ‘Locn’ ],
[ ‘Part1’, ‘3’, ‘Here’ ],
[ ‘Part2’, ‘1’, ‘Here’ ],
[ ‘Part1’, ‘2’, ‘There’ ],
]

data[1…-1].group_by(&:first).map {|k,d|
[k,d.inject(0){|s,v|s+v[1].to_i}] + selects.map {|s|
d.select{|v|v[2]==s}.inject(0){|s,v|s+v[1].to_i} }
}

#=> [[“Part1”, 5, 3, 2], [“Part2”, 1, 1, 0]]

I incorporated a mixture of these suggestions into my own data
construct, and it completes the task in a couple of seconds now rather
than in minutes. I’ll have to see if there’s anywhere else I could use
this kind of approach. Thanks for the help :slight_smile:

Ooh, pretty. Nice one-liner Hans. I can just about follow it.
I hadn’t seen “group_by” before. Looks like hashes are definitely the
right direction to be working towards for this.