Array comparison

cpeterson · October 30, 2008, 3:24am

I can easily write a program to compare the contents of arrays, of
course. Ruby’s great that way. In a matter of a minute or so, I could
write a program that compares small numbers of items in a list with
small
numbers of items in another list and give me output that consists of
things that appear in both, or those that don’t appear in both, or those
that appear in one and not the other.

I find myself contemplating doing much the same thing, but with lists
that contain millions of entries. I tend to guess that loading each
list
into an array and running a direct comparison of them:

array_1 = [millions of things]
array_2 = [millions of things]
array_3 = array1 & array2

. . . would fill up RAM in a hurry and drag system performance on a
typical desktop computer to a standstill. What sort of approach would
the expert Ruby hackers suggest for achieving much the same ends without
taking all week and risking a stack overflow?

cpeterson · October 30, 2008, 8:48am

2008/10/30 Chad P. [email protected]:

array_1 = [millions of things]
array_2 = [millions of things]
array_3 = array1 & array2

. . . would fill up RAM in a hurry and drag system performance on a
typical desktop computer to a standstill. What sort of approach would
the expert Ruby hackers suggest for achieving much the same ends without
taking all week and risking a stack overflow?

First of all I would use Set which is far more efficient for these
amounts. You need to make sure that your items in the Set implement
#hash and #eql? properly (see requirements for Hash keys).

Then, I would write a simple program using Sets and see what happens
before I even start to speculate whether it would work or not. This
program is written in a matter of minutes and if it works you are done

if not, you may be able to tweak the simple solution to work (e.g.
for example by making sure that there is only one instance of string
values that are frequently seen in the input).

Btw, if you compare two “lists” you only need to keep one in memory to
make some of the checks.

Kind regards

robert

cpeterson · October 30, 2008, 8:56am

2008/10/30 Robert K. [email protected]:

First of all I would use Set which is far more efficient for these
amounts.

+1

If the performance still isn’t adequate, I’d use a database.

Regards,
Pit

cpeterson · October 30, 2008, 10:09am

Chad P. wrote:

I tend to guess that loading each list
into an array and running a direct comparison of them:

array_1 = [millions of things]
array_2 = [millions of things]
array_3 = array1 & array2

. . . would fill up RAM in a hurry and drag system performance on a
typical desktop computer to a standstill. What sort of approach would
the expert Ruby hackers suggest for achieving much the same ends without
taking all week and risking a stack overflow?

If these millions of things are being read from disk, then sort them
first (*). Then you can do a sort-merge to check which are in one but
not the other.

Roughly: Read the first item from both and compare them. If they’re
equal, munch them both. If the item from A is smaller than the item in
B, then munch the item from A (since it exists in A but not B), and vice
versa. An efficient implementation might read a few thousands items from
both A and B into local buffers first.

The Sedgewick “Algorithms” book has chapters on external sorting and
searching.

HTH,

Brian.

(*) Of course, the problem then becomes one of sorting millions of
things, but this is a well-solved problem. The unix “sort” command-line
tool will happily sort files which are much larger than available RAM,
by dividing into smaller chunks and sort-merging them. Just take care to
set LC_ALL properly; otherwise sort behaves in a very strange way.
LC_ALL=C seems to work best for me.

Once the inputs are sorted, an ‘&’ or ‘|’ operation implemented as above
will generate an output which is already sorted.

The unix ‘join’ command can do these operations for you, too.

cpeterson · October 30, 2008, 7:36pm

On 30.10.2008 18:35, William J. wrote:

that appear in one and not the other.

hb = {}

— output —
4000000
2528918
2527239
1597752
12.64 seconds
1597752
5.063 seconds

The comparison is a bit unfair since you include the build time for the
structure in case of Hash but not for the Array. You should at least
also compare just the intersection time. And while you’re at it you can
also add Set to the mix.

Kind regards

robert

cpeterson · October 30, 2008, 6:41pm

On Oct 30, 2:47 am, Robert K. [email protected] wrote:

the expert Ruby hackers suggest for achieving much the same ends without
taking all week and risking a stack overflow?

Array intersection is pretty efficient in Ruby.

N = 4_000_000
a = Array.new(N){ rand N }.uniq
b = Array.new(N){ rand N }.uniq
p N, a.size, b.size

time = Time.now
ha = {}
a.each{|x| ha[x] = true }
hb = {}
b.each{|x| hb[x] = true }
intersect = []
ha.each_key{|x| intersect << x if hb.include? x}
puts intersect.size
print Time.now - time, " seconds\n"

time = Time.now
puts (a & b).size
print Time.now - time, " seconds\n"

— output —
4000000
2528918
2527239
1597752
12.64 seconds
1597752
5.063 seconds

cpeterson · October 31, 2008, 11:27am

-----Original Message-----
From: Robert K. [mailto:[email protected]]
Sent: Friday, October 31, 2008 12:04 AM
To: ruby-talk ML
Subject: Re: array comparison

The comparison is a bit unfair since you include the build
time for the structure in case of Hash but not for the Array.
You should at least also compare just the intersection time.
And while you’re at it you can also add Set to the mix.

Interesting discussion.

I wrote a small program to compare Array intersection to Set
intersection.
I noticed that for smaller sets (~10 ** 3), Array is much faster that
Set, sometimes upto 5x.
As the set size grows, they seem to converge (~10 ** 7, 100 million)

Program to compare timings for intersections

require ‘set’

n = 10000
while (n <= 10 ** 7) do
c = Array.new(n) {rand n}.uniq
d = Array.new(n) {rand n}.uniq

Native array intersection

t0 = Time.now
cd = c & d
t1 = Time.now
printf(“Array | n = %d : %d intersects found in %f
seconds\n”,n,cd.size,t1-t0)

Convert array to Set objects and perform intersection

c = c.to_set
d = d.to_set
t2 = Time.now
cd = c & d
t3 = Time.now
printf(“Set | n = %d : %d intersects found in %f
seconds\n”,n,cd.size,t3-t2)
printf(“Ratio of Set/Array intersect time =
%.2f\n”,(t3-t2)/(t1-t0))

n = 2 * n
end

== OUTPUT == [Dell Latitude 620 , Win XP SP2, 2 GB RAM, Intel Core2 CPU
@ 1.66 GHz, normal loading]

D:\Shourya\DEV\RUBY\MISC>ruby intersect.rb
Array | n = 10000 : 3980 intersects found in 0.000000 seconds
Set | n = 10000 : 3980 intersects found in 0.016000 seconds
Ratio of Set/Array intersect time = Inf
Array | n = 20000 : 7956 intersects found in 0.000000 seconds
Set | n = 20000 : 7956 intersects found in 0.032000 seconds
Ratio of Set/Array intersect time = Inf
Array | n = 40000 : 15940 intersects found in 0.016000 seconds
Set | n = 40000 : 15940 intersects found in 0.078000 seconds
Ratio of Set/Array intersect time = 4.88
Array | n = 80000 : 31835 intersects found in 0.047000 seconds
Set | n = 80000 : 31835 intersects found in 0.125000 seconds
Ratio of Set/Array intersect time = 2.66
Array | n = 160000 : 63959 intersects found in 0.125000 seconds
Set | n = 160000 : 63959 intersects found in 0.281000 seconds
Ratio of Set/Array intersect time = 2.25
Array | n = 320000 : 128062 intersects found in 0.391000 seconds
Set | n = 320000 : 128062 intersects found in 0.562000 seconds
Ratio of Set/Array intersect time = 1.44
Array | n = 640000 : 255466 intersects found in 0.672000 seconds
Set | n = 640000 : 255466 intersects found in 1.156000 seconds
Ratio of Set/Array intersect time = 1.72
Array | n = 1280000 : 511590 intersects found in 1.905000 seconds
Set | n = 1280000 : 511590 intersects found in 2.640000 seconds
Ratio of Set/Array intersect time = 1.39
Array | n = 2560000 : 1023031 intersects found in 4.124000 seconds
Set | n = 2560000 : 1023031 intersects found in 5.296000 seconds
Ratio of Set/Array intersect time = 1.28
Array | n = 5120000 : 2047355 intersects found in 8.857000 seconds
Set | n = 5120000 : 2047355 intersects found in 9.589000 seconds
Ratio of Set/Array intersect time = 1.08

cpeterson · October 31, 2008, 6:25pm

On 31.10.2008 11:21, Sarcar, Shourya C (GE Healthcare) wrote:

require ‘set’
printf("Array | n = %d : %d intersects found in %f
printf(“Ratio of Set/Array intersect time =
%.2f\n”,(t3-t2)/(t1-t0))

n = 2 * n
end

Some things are noteworthy about your program:

It has a large intersection, i.e. roughly 50% on average. This may
be different for other scenarios.
You use Fixnums which compare pretty fast, you can expect to reach
break even earlier when using other types.

I have taken the liberty to extend your test program a bit changing
those two parameters mentioned above and you can see significant
differences.

Kind regards

robert