Need some Ruby magic


#1

I’d like to sort collections randomly. This is what I tried first:

my_array.sort { |a,b| rand(2) }

but the results weren’t very random. I then tried the following:

class Array

def random_copy
b = Array.new
while b.length < self.length
random_value = self[rand self.length]
b.push random_value unless b.include? random_value
end
b
end
end

This works but I’m sure there’s some Ruby magic for doing this better.

thanks


#2

hammed wrote:

I’d like to sort collections randomly. This is what I tried first:

my_array.sort { |a,b| rand(2) }

my_array.sort_by { rand }


#3

Hi –

On Fri, 2 Dec 2005, Hammed M. wrote:

while b.length < self.length
  random_value = self[rand self.length]
  b.push random_value unless b.include? random_value
end
b

end
end

This works but I’m sure there’s some Ruby magic for doing this better.

I think the most common idiom is:

array.sort_by { rand }

David


David A. Black
removed_email_address@domain.invalid

“Ruby for Rails”, forthcoming from Manning Publications, April 2006!


#4

Maybe there should be Array#randomize…


#5

excellent! thanks.


#6

Take a look @ the Facets project. There are extension methods for array
to
have a number of randomized behaviors.

j.

On 12/1/05, Vincent F. removed_email_address@domain.invalid wrote:

Maybe there should be Array#randomize…


“Remember. Understand. Believe. Yield! -> http://ruby-lang.org

Jeff W.


#7

In article removed_email_address@domain.invalid,
Jim W. removed_email_address@domain.invalid wrote:

hammed wrote:

I’d like to sort collections randomly. This is what I tried first:

my_array.sort { |a,b| rand(2) }

my_array.sort_by { rand }

A quick test seems to indicate that this works with the ruby 1.8.2
implementation on my Mac. However, there is no guarantee that this call
will select each of the n! permutations with equal probability.

Whether it does will depend on the exact algorithm used by ‘sort’. For
instance, if sort uses the (inefficient for large arrays) bubble sort,
the last key will end up staying there in half the cases.

Reinder


#8

On Dec 3, 2005, at 9:32 AM, Reinder V. wrote:

However, there is no guarantee that this call
will select each of the n! permutations with equal probability.

How about just doing this?

class Array
def shuffle
newarr = self
self.length.times do |x|
y = rand(self.length)
newarr[x], newarr[y] = newarr[y], newarr[x]
end
newarr
end
end

shuffled_array = my_array.shuffle

–Steve


#9

reinder wrote:

my_array.sort_by { rand }

A quick test seems to indicate that this works with the ruby 1.8.2
implementation on my Mac. However, there is no guarantee that this call
will select each of the n! permutations with equal probability.

Actually, it will. sort_by uses a Schwartzian transform to do the
sorting. That means that rand is only called once for each element of
the array. As long as rand gives a decent distribution of random
numbers, the permutation will be random too.

Here’s a little experiment:


N = (ARGV.shift || 1000).to_i
A = %w(a b c)
Perms = %w(abc acb bac bca cab cba)
Score = Hash.new { |h, k| h[k] = 0 }
N.times do
sorted = A.dup.sort_by { rand }
Score[sorted.join("")] += 1
end

Score.keys.sort.each do |key|
puts “#{key}: #{Score[key]}”
end

Gives:

$ ruby sortdist.rb 100000
abc: 16693
acb: 16688
bac: 16752
bca: 16590
cab: 16475
cba: 16802

Thats a pretty good distribution for the number of iterations.

– Jim W.


#10

On Sun, 4 Dec 2005, Mauricio [iso-8859-1] Fernández wrote:

the array. As long as rand gives a decent distribution of random
more) pair of elements of the original array are a bit more probable.

In praxis, the bias this introduces is often so small that the mere
succinctness of sort_by{ rand } more than makes up for it.

does

%w( a b c d ).sort{ rand <=> rand }

help?

-a


#11

On Dec 4, 2005, at 7:59 AM, Brian B. wrote:

How frequently would rand return a same value? (In theory it would be
never but does anyone know the reality?)

According to the Pickaxe, Kernel.rand is an implementation of the
Mersenne Twister algorithm. If that is still the case, then in
reality it will begin repeating exactly the same numbers after
2^19937-1 (approx. 4e6001) pseudo-random numbers, as that’s the
period of the MT. The period is the point at which the exact same
series of PRNs will begin again. For example, if it started out as
1, 2, 3, …, then after 2^19937-1, it’d start again at 1, 2, 3.

But, that’s not what you really care about.

Numbers will certainly repeat within the period, since the period is
so much larger than our integer representation. For example, a 32
bit integer can only represent about 4e9. So, the likelihood of
getting any specific integer, in the 32 bit case, is 1/(2^32). This
is an extreme example:

5.times { p rand(2) }
5.times { p rand(1000) }

The bottom line is that the PRNs should be approximately uniformly
distributed. Apply that to your specific range, and you have your
answer.

–Steve


#12

On Dec 4, 2005, at 8:54 AM, Stephen W. wrote:

Numbers will certainly repeat within the period, since the period
is so much larger than our integer representation. For example, a
32 bit integer can only represent about 4e9. So, the likelihood of
getting any specific integer, in the 32 bit case, is 1/(2^32).

I should add to this that you can approximate when you’ll get
repeats, and it’s likely to be much sooner than you’d guess. For
more information, see the Birthday Problem.

–Steve


#13

end
end

shuffled_array = my_array.shuffle

“Just” doing that

  1. changes the receiver (use new_arr = self.dup instead)
  2. definitely has a bias. Look e.g. at an array of length three. You
    will
    first swap the first element with 3 possibilities, then the second
    and so
    on; you’re going through 33 possibilities. There are 3! possible
    permutations, but alas, 3! == 6 is not a perfect divisor of 3
    3 ==
    27,
    thus you have a bias.

#14

On Sun, Dec 04, 2005 at 08:48:11AM +0900, Jim W. wrote:

numbers, the permutation will be random too.
sort_by{ rand } is actually biased, since sort_by will preserve the
relative order of the elements for which rand() returned the same value.

%w[a b c d].sort_by{ 0 } # => [“a”, “b”, “c”,
“d”]
i = 0
%w[a b c d].sort_by{ 10 - (i += 1) / 2 } # => [“d”, “b”, “c”,
“a”]

This means that permutations preserving the relative order of one (or
more) pair of elements of the original array are a bit more probable.

In praxis, the bias this introduces is often so small that the mere
succinctness of sort_by{ rand } more than makes up for it.


#15

On Dec 4, 2005, at 12:32 PM, Kero wrote:

  1. changes the receiver (use new_arr = self.dup instead)
  2. definitely has a bias. Look e.g. at an array of length three.
    You will
    first swap the first element with 3 possibilities, then the
    second and so
    on; you’re going through 33 possibilities. There are 3! possible
    permutations, but alas, 3! == 6 is not a perfect divisor of 3
    3
    == 27,
    thus you have a bias.

Good points, thanks. How about this?

class Array
def shuffle
newarr = self.dup
2.times do
self.length.times do |x|
y = rand(self.length)
newarr[x], newarr[y] = newarr[y], newarr[x]
end
end
newarr
end
end

test code lifted from Jim W.'s earlier post

N = (ARGV.shift || 1000).to_i
A = %w(a b c)
Perms = %w(abc acb bac bca cab cba)
Score = Hash.new { |h, k| h[k] = 0 }
N.times do
sorted = A.shuffle
Score[sorted.join("")] += 1
end

Score.keys.sort.each do |key|
puts “#{key}: #{Score[key]}”
end

[~/Code/private/code/snippets] 382% ./shuffle.rb 100000
abc: 16596
acb: 16394
bac: 16700
bca: 16684
cab: 16841
cba: 16785

0:07.87 seconds, 96.4% CPU

This is really just meant to be a simple hack. I think a better
algorithm could assign a random index to each element, then sort on
those indices.

–Steve


#16

On Sun, Dec 04, 2005 at 10:21:02AM +0900, removed_email_address@domain.invalid wrote:

On Sun, 4 Dec 2005, Mauricio [iso-8859-1] Fernández wrote:

sort_by{ rand } is actually biased, since sort_by will preserve the
relative order of the elements for which rand() returned the same value.

I did some numbers and found the following probability estimates:

array size P(#rand() collision)
1000 5.54558e-11
10000 5.55056e-09
1000000 5.55096e-05
10000000 5.53574e-03

A collision implies that the associated pair of elements has got the
same
ordering in the output array, instead of 50-50 chances of it being
reversed.

More info, including the Ruby code I used, available at
http://eigenclass.org/hiki.rb?sort_by+rand+is+biased

does

%w( a b c d ).sort{ rand <=> rand }

help?

I’m not totally sure it’s really unbiased, but at first sight it looks
good.

When called with no arguments, rand will use

static double
genrand_real(void)
{
unsigned long a=genrand_int32()>>5, b=genrand_int32()>>6;
return(a67108864.0+b)(1.0/9007199254740992.0);
}

so it would seem that two consecutive calls to rand() cannot return the
same value, but I still have to think about the effect of qsort.

It is however somewhat slower than sort_by{ rand }.

[100, 1000, 10000, 100000].each do |n|
GC.start
t0 = Time.new
(1…n).sort_by{rand}
a = (t1 = Time.new) - t0 # => 0.000425, 0.003002,
0.054392, 0.939692
(1…n).sort{rand <=> rand}
b = Time.new - t1 # => 0.001161, 0.026275,
0.289807, 3.791833
“%4.2f” % [b / a] # => “2.73”, “8.75”, “5.33”,
“4.04”
end
RUBY_VERSION # => “1.8.3”
RUBY_RELEASE_DATE # => “2005-09-24”


#17

sort_by{ rand } is actually biased, since sort_by will preserve the
relative order of the elements for which rand() returned the same value.

How frequently would rand return a same value? (In theory it would be
never but does anyone know the reality?)


#18

Mauricio Fernández wrote:

does

%w( a b c d ).sort{ rand <=> rand }

help?

It is however somewhat slower than sort_by{ rand }.

It’s considerably slower. Changing it to sort { rand(10000) <=> 5000 }
shows a 30% speedup on 10k element Arrays on my system. Don’t know why
I chose those numbers, I suppose bigger numbers reduce the “0”
comparison, considering { rand(3) <=> 1 }.

Anyway, I put together a small benchmark to compare my (revised)
shuffle, to sort, and sort_by:

#!/usr/bin/env ruby

class Array
def shuffle
newarr = self.dup
2.times do
self.length.times do |x|
y = rand(self.length)
newarr[x], newarr[y] = newarr[y], newarr[x]
end
end
newarr
end
end

printf(" size shuffle sort sort_by\n")

[10, 100, 1000, 10000].each do |n|

populate an Array size n

a = Array.new(n) { |i| i }

my method

start = Time.new
100.times do
b = a.shuffle
end
a_time = Time.new - start

Array’s sort

start = Time.new
100.times do
b = a.sort { rand(10000) <=> 5000 }
end
b_time = Time.new - start

Enumerable’s sort_by

start = Time.new
100.times do
b = a.sort_by { rand }
end
c_time = Time.new - start

results

printf("%6d %5.2f %5.2f %5.2f\n", n, a_time, b_time, c_time)

end

Which, on my P4/3.0GHz (Win32) emits this:

size shuffle sort sort_by
10 0.00 0.02 0.00
100 0.16 0.13 0.03
1000 1.61 2.03 0.44
10000 16.31 28.14 4.78

–Steve


#19

On 12/4/05, Reinder V. removed_email_address@domain.invalid wrote:

A quick test seems to indicate that this works with the ruby 1.8.2
implementation on my Mac. However, there is no guarantee that this call
will select each of the n! permutations with equal probability.

Whether it does will depend on the exact algorithm used by ‘sort’. For
instance, if sort uses the (inefficient for large arrays) bubble sort,
the last key will end up staying there in half the cases.

Actually, sort_by caches every value passed and uses those for
comparison (aka a Schwartzian Transform), so it would essentially
return the same results as sorting a list of random numbers,
regardless of algorithm.

Sam


#20

repeats, and it’s likely to be much sooner than you’d guess. For
more information, see the Birthday Problem.

–Steve

I liked the Birthday Problem. Thanks.