Comparing two arrays, too slow

joe_chesak · August 6, 2010, 5:51pm

I don’t have access to the database, but everyday I get a csv dump of
all
live customer listings.
There are around 15,000 live listings.
A record contains 3 fields: id, name, description
Everyday some new accounts are created while others are deleted.
Everyday I want to compare yesterday’s dump with today’s dump and print
the full record of each terminated record and each new record.
I am achieving this with the script below.
That last line makes the script run very slow.

Is there a more elegant way to compare two
arrays|hashes|sets|FasterCSVtables while carrying a few fields along for
the
ride?

WORKING CODE:
require ‘rubygems’
require ‘fastercsv’

sundays_dump = FasterCSV.read("./sunday.csv")
mondays_dump = FasterCSV.read("./monday.csv")

sundays_ids = sundays_dump.collect {|row| row[1]}
mondays_ids = mondays_dump.collect {|row| row[1]}

newaccount_ids = mondays_ids - sundays_ids
terminated_ids = sundays_ids - mondays_ids

sundays_dump.each {|row| puts ‘delete,’+row[0]+’,’+row[1] if
terminated_ids.include? row[1]}
mondays_dump.each {|row| puts ‘create,’+row[0]+’,’+row[1] if
newaccount_ids.include? row[1]}

TEST DATA:
sunday.csv
id,name,otherstuff
1,larry,bald
2,curly,tall
3,moe,meanie

monday.csv
id,name,otherstuff
1,larry,bald
4,shemp,greasy

OUTPUT:
delete,2,curly
delete,3,moe
create,4,shemp

joe_chesak · August 6, 2010, 5:58pm

On Sat, 7 Aug 2010, joe chesak wrote:

Is there a more elegant way to compare two

sundays_ids = sundays_dump.collect {|row| row[1]}
mondays_ids = mondays_dump.collect {|row| row[1]}

newaccount_ids = mondays_ids - sundays_ids
terminated_ids = sundays_ids - mondays_ids

sundays_dump.each {|row| puts ‘delete,’+row[0]+’,’+row[1] if
terminated_ids.include? row[1]}
mondays_dump.each {|row| puts ‘create,’+row[0]+’,’+row[1] if
newaccount_ids.include? row[1]}

Rather than go through both sets of data, I’d do something like this:

terminated_ids.each do |row|
puts “delete,#{sundays_dump[row][0]},#{sundays_dump[row][1]}”
end

Same thing for the new account (substituting monday, of course)

This way you’re only outputting the changed records.

Matt

joe_chesak · August 6, 2010, 7:27pm

On 8/6/10, joe chesak [email protected] wrote:

Is there a more elegant way to compare two

1,larry,bald
4,shemp,greasy

OUTPUT:
delete,2,curly
delete,3,moe
create,4,shemp

Here’s an idea… although, Matthew’s suggestion is really the better
way to go (tho I think his code won’t quite work… you’ll have to
save away array indexes somewhere to make that approach effective).

Anyway, it should be much faster if you use Sets instead of arrays for
terminated_ids and newaccount_ids. Like this:

require ‘rubygems’
require ‘fastercsv’
require ‘set’

sundays_dump = FasterCSV.read(“./sunday.csv”)
mondays_dump = FasterCSV.read(“./monday.csv”)

sundays_ids = sundays_dump.collect {|row| row[1]}
mondays_ids = mondays_dump.collect {|row| row[1]}

newaccount_ids = Set[*mondays_ids - sundays_ids]
terminated_ids = Set[*sundays_ids - mondays_ids]

sundays_dump.each {|row| puts ‘delete,’+row[0]+‘,’+row[1] if
terminated_ids.include? row[1]}
mondays_dump.each {|row| puts ‘create,’+row[0]+‘,’+row[1] if
newaccount_ids.include? row[1]}

Turning sundays_ids and mondays_ids into Sets might also make it a tad
bit faster yet, but I think that the above code has achieved the lions
share of potential speedups.

joe_chesak · August 6, 2010, 11:16pm

Another option: sort them by id, then walk through them together. If the
current id on both list A and list B is the same, then skip forward on
both. If the current id on list A is less than the current id on list B,
then it exists only in B (so report this, and skip forward on A). And
vice versa.

The advantage of this is that it works with huge files - you can read
them one line at a time instead of reading them all into RAM. And there
are tools which can do an external sort efficiently, if they’re not
already sorted that way.

To avoid coding this, just use the unix shell command ‘join’ to do the
work for you (which you can invoke from IO.popen if you like). Just
beware that it’s very fussy about having the files correctly sorted
first.

$ join -t, -j1 -v1 sunday.csv monday.csv
2,curly,tall
3,moe,meanie
$ join -t, -j1 -v2 sunday.csv monday.csv
4,shemp,greasy

joe_chesak · August 9, 2010, 11:19am

Guys! Thanks for the great responses, and clever perspectives on this
problem. I have tried out all of them–got 3 of 4 to work on my real
dataset.

Caleb: I actually had tried to apply Set to this problem and failed
because
I thought I should be able to point it at one column of a multi-column
csv
file…thinking of a .csv file almost as a full-fledged database table.
Your minor tweak on my code shows just how easy it is to choose between
Array vs. Set.

Brian: That’s a raw and fast solution. I will keep this in mind if my
datasets get too large.

Kirk: Thanks for coding that all the way through. Great idea to create
an
index file and a data file in one sweep.

Joe Chesak

joe_chesak · August 6, 2010, 7:21pm

On Fri, Aug 6, 2010 at 9:51 AM, joe chesak [email protected] wrote:

Is there a more elegant way to compare two

sundays_ids = sundays_dump.collect {|row| row[1]}
mondays_ids = mondays_dump.collect {|row| row[1]}

newaccount_ids = mondays_ids - sundays_ids
terminated_ids = sundays_ids - mondays_ids

sundays_dump.each {|row| puts ‘delete,’+row[0]+‘,’+row[1] if
terminated_ids.include? row[1]}
mondays_dump.each {|row| puts ‘create,’+row[0]+‘,’+row[1] if
newaccount_ids.include? row[1]}

The answer to, “it’s too slow”, is often, “don’t do that, then.”

If comparing arrays is too slow, don’t compare arrays. Instead, do
something that’s fast. Looking up data in a Hash is fast. And since
your data appears to be keyed by unique ids (though your code is
written with variable names assuming it is looking at ids, but with
array lookups that look at the names, which is confusing and might be
a bug), it is easy to stuff that data into a hash for later lookup.

require ‘rubygems’
require ‘fastercsv’

def extract_data(dumpfile)
dump = FasterCSV.read(dumpfile)
ids = []
data = {}
dump.each do |row|
ids << row[0]
data[row[0]] = row
end

    [ids,data]

end

sundays_dump = FasterCSV.read(“./sunday.csv”)
mondays_dump = FasterCSV.read(“./monday.csv”)

sunday_ids, sunday_data = extract_data(“./sunday.csv”)
monday_ids, monday_data = extract_data(“./monday.csv”)

newaccount_ids = monday_ids - sunday_ids
terminated_ids = sunday_ids - monday_ids

terminated_ids.each do |id|
row = sunday_data[id]
puts “delete,#{id},#{row[1]}”
end

newaccount_ids.each do |id|
row = monday_data[id]
puts “create,#{id},#{row[1]}”
end

If this were my script, I’d further clean it up so that all of your
logic is in a simple library. Then either create a generic executable
that takes the file names as arguments, or write your scripts with
hard coded file names, but that just calls into your library with
those names. It’d be much cleaner unless you are absolutely sure that
this code will never be more than a one-off hack.

Kirk H.

Comparing two arrays, too slow

newaccount_ids.each do |id| row = monday_data[id] puts “create,#{id},#{row[1]}” end

newaccount_ids.each do |id|
row = monday_data[id]
puts “create,#{id},#{row[1]}”
end