Need some Ruby magic

hammed · December 6, 2005, 11:41am

On Tue, Dec 06, 2005 at 08:23:52AM +0900, [email protected] wrote:

the big difference is that a value will have rand called multiple times for

That might be the very reason why it is ultimately much more biased than
sort_by{ rand }… I’m keeping the latter.
If the limited precision of rand() became a problem, I could always use
arr.sort_by{ [rand, rand] } where collisions happen with a probability
around 6.16298e-19 for a 10_000_000 element array, or more generally
Array.new(N){rand} (but then we’d have to look into MT for another
possible source of residual bias).

it - in otherwords an element compareds differently every time. this is
bound to help shuffle more uniformly it seems:

We cannot trust our intuition in these cases; in the following example,
the
tenth element is more likely to stay where it was (look at pos[9]):

a = (1…100); pos = Array.new(100, 0)

t = Time.new; 100000.times{ pos[a.sort{rand <=> rand}.index(10)] += 1 }
Time.new - t

=> 223.891608

pos

=> [1176, 1208, 1083, 1153, 1191, 1130, 1098, 1133, 1189, 1438, 1150,

1043,
========

1121, 1067, 1076, 1069, 1078, 1113, 1129, 1143,1115, 1074, 1081, 1069,

1057,

[…]

t = Time.new; 10000000.times{ pos[a.sort{rand <=> rand}.index(10)] += 1
}
Time.new - t

=> 22781.664168

pos

=> [105944, 105861, 105546, 105681, 104059, 101196, 99080, 100675,

110851,

132223, 101670, 96018, 96147, 96688, 97709, 99051, 97823, 97345,

98799,

[…]

The bias can be observed easily with smaller arrays:

a = (1…20); pos = Array.new(20, 0)
t = Time.new
100000.times{ pos[a.sort{rand <=> rand}.index(3)] += 1 }
Time.new - t # => 20.848851
pos

=> [5084, 5694, 6561, 4326, 4205, 4792, 4913, 4827, 4791, 4571, 4560,

4707,

4863, 4934, 5112, 5011, 5071, 5326, 5482, 5170]

a = (1…10); pos = Array.new(10, 0)
t = Time.new
1_000_000.times{ pos[a.sort{rand <=> rand}.index(5)] += 1 }
Time.new - t # => 77.438912
pos

=> [113836, 103238, 99442, 88682, 101306, 83294, 93296, 100830,

100744, 115332]

Compare the latter to

a = (1…10); pos = Array.new(10, 0)
t = Time.new
1_000_000.times{ pos[a.sort_by{rand}.index(5)] += 1 }
Time.new - t # => 32.163499
pos

=> [99822, 100020, 100328, 100040, 99603, 100098, 100195, 100023,

99875, 99996]

hammed · December 6, 2005, 12:26am

On Tue, 6 Dec 2005, Mauricio [iso-8859-1] Fernández wrote:

1000000 5.55096e-05

%w( a b c d ).sort{ rand <=> rand }

help?

I’m not totally sure it’s really unbiased, but at first sight it looks good.

the big difference is that a value will have rand called multiple times
for it

in otherwords an element compareds differently every time. this is
bound to
help shuffle more uniformly it seems:

harp:~ > cat a.rb
%w( a b c d ).sort{|a,b| ra = rand; rb = rand; p a => ra; p b => rb;
ra <=> rb }

harp:~ > ruby a.rb
{“a”=>0.181439380218727}
{“c”=>0.579403886629126}
{“c”=>0.713513989939958}
{“d”=>0.573687508540169}
{“a”=>0.851534740906118}
{“d”=>0.00156074151427565}
{“b”=>0.803591007691647}
{“a”=>0.494066110872563}
{“b”=>0.834676789626288}
{“c”=>0.792106134460754}

so, note that “a” was sorted using a different value each time.

can anyone who remembers stats and knows the sort algorithm weigh in?

-a

hammed · December 6, 2005, 3:34pm

||
|| On Tue, Dec 06, 2005 at 07:32:34PM +0900, Uwe S. wrote:
|| > Hi, I’m new to ruby, but I use Python for some years now.
|| > So I suggest the following ‘pythonic’ solution
|| […]
|| > Is there a shorter or more elegant way to do this
|| > ‘decorate-sort-undecorate’ in ruby ?
||
|| arr.sort_by{rand}

that is not equivalent to my solution.
the python equivalent to yours is

arr.sort(key = lambda x: random())

||
|| > In Python it is quite easy by using list comprehensions…
|| What does that look like in Python?

from random import random

def shuffle(a):
b = [ (random(), i) for i in a]
b.sort()
return [ x[1] for x in b ]

print shuffle(range(10))

|| Is it shorter than the above one? }
||
|| –
|| Mauricio F.

greetings, Uwe

||

hammed · December 6, 2005, 4:44pm

Mauricio Fernández [email protected] writes:

same value, but I still have to think about the effect of qsort.
And that is a good thing? How can the sequence be “random” if it
never contains the same two numbers in row? (Remember Enigma…)

hammed · December 6, 2005, 11:33am

||

|| > More info, including the Ruby code I used, available at
|| > eigenclass.org
|| >
|| >> does
|| >>
|| >> %w( a b c d ).sort{ rand <=> rand }
|| >>
|| >> help?

Hi, I’m new to ruby, but I use Python for some years now.
So I suggest the following ‘pythonic’ solution

def shuffle(a)

  b=[]
  # decorate
  a.each { |x| b << [rand,x]; }

  # sort
  b.sort! { |x,y| x[0] <=> y[0] }

  # undecorate
  b.collect { |x| x[1] }

end

a=%w{a b c d e f}
puts shuffle a

Is there a shorter or more elegant way to do this
‘decorate-sort-undecorate’ in ruby ?
In Python it is quite easy by using list comprehensions…

Greetings, Uwe

hammed · December 6, 2005, 5:04pm

|| > || arr.sort_by{rand}
|| > that is not equivalent to my solution.
|| > the python equivalent to yours is
|| > arr.sort(key = lambda x: random())

|| > || > In Python it is quite easy by using list comprehensions…
|| > || What does that look like in Python?
|| >
|| > from random import random
|| >
|| > def shuffle(a):
|| > b = [ (random(), i) for i in a]
|| > b.sort()
|| > return [ x[1] for x in b ]
|| >
|| > print shuffle(range(10))

|| I would have thought those two functions do the same thing
|| with the exception that shuffle does not sort in place.
|| I’m probably missing something obvious. Could you explain
|| the difference between the two versions?

If you use random as sorting key, the result of a<=>b
will differ each time you evaluate a<=>b.
In my case I attach a random number to each item,
then I sort my list according to this number and remove
this ‘decoration’ afterwards.

Pythonprogrammers call this pattern “decorate-sort-undecorate”.

As I do not now anything about the ruby interanls of sort,
I prefer my version.

Greetings, Uwe.

hammed · December 6, 2005, 5:12pm

Hi,

From: “Uwe S.” [email protected]

arr.sort(key = lambda x: random())
No, Array#sort_by does the ‘decorate-sort-undecorate’, aka.
Schwartzian Transform.

I don’t read Python very well, but I think the ruby equivalent
to ‘arr.sort(key = lambda x: random())’ might be:

arr.sort { rand <=> rand }

But that is different from arr.sort_by {rand}

Regards,

Bill

hammed · December 6, 2005, 2:37pm

On Tue, Dec 06, 2005 at 07:32:34PM +0900, Uwe S. wrote:

Hi, I’m new to ruby, but I use Python for some years now.
So I suggest the following ‘pythonic’ solution
[…]
Is there a shorter or more elegant way to do this
‘decorate-sort-undecorate’ in ruby ?

arr.sort_by{rand}

In Python it is quite easy by using list comprehensions…

What does that look like in Python?
Is it shorter than the above one? }

hammed · December 6, 2005, 6:26pm

On Wed, Dec 07, 2005 at 12:41:39AM +0900, Christian N. wrote:

so it would seem that two consecutive calls to rand() cannot return the
same value, but I still have to think about the effect of qsort.

And that is a good thing? How can the sequence be “random” if it
never contains the same two numbers in row? (Remember Enigma…)

Nah, I misread genrand_real; the above shows it is possible.

hammed · December 6, 2005, 6:58pm

Uwe S. ha scritto:

|| > def shuffle(a):

If you use random as sorting key, the result of a<=>b
will differ each time you evaluate a<=>b.
In my case I attach a random number to each item,
then I sort my list according to this number and remove
this ‘decoration’ afterwards.

Pythonprogrammers call this pattern “decorate-sort-undecorate”.

Are you sure of what you say? AFAIK
sort(key=lambda x: somefunc…))

will do decorate/sort/undecorate under the hood, behaving like ruby’s
#sort_by and like your version.
OTOH
sort(cmp=lambda x,y: something… ) will behave like ruby’s #sort.

hammed · December 6, 2005, 2:25pm

Mauricio Fernández wrote:

On Tue, Dec 06, 2005 at 08:23:52AM +0900, [email protected] wrote:

the big difference is that a value will have rand called multiple times for

That might be the very reason why it is ultimately much more biased than
sort_by{ rand }… I’m keeping the latter.

Indeed,

ar = [0, 1, 2]; result = Hash.new(0)
100000.times {result[ar.sort {rand <=> rand}] += 1}
p result

=> {[2, 1, 0]=>25235,

   [2, 0, 1]=>12285,
   [1, 0, 2]=>12426,
   [0, 1, 2]=>25111,
   [1, 2, 0]=>12468,
   [0, 2, 1]=>12475}

Michael

hammed · December 6, 2005, 8:53pm

Ok, I think I’ve beat sort_by{ rand } on large Arrays , by about 30%. I
ended up with this:

 h = Hash.new
 self.each { |v| h[rand(1000000000)] = v }
 h.keys.sort.collect { |k| h[k] }

Explanation…

Hash works better than an Array of Arrays
[[rand,x[0]],[rand,x[1]],…] because Array#<=> seems to be slow.
Got some speedup by avoiding Float#<=>, hence the Integer version of
rand.
On rand(1000000000); chose 1billion rather arbitrarily to keep things
out of the range of BigNum. Oddly, using 2^31-1 caused BigNums to show
up. Ideally we’d use the largest FixNum, but I don’t know how to
retrieve that.
Bias should be minimal for most applications, even at rand(1Billion).
Can it be improved further???

Find my benchmark results, and script below.

–Steve

                   user     system      total        real

shuffle (10) 0.016000 0.000000 0.016000 ( 0.016000)
shuffle2 (10) 0.000000 0.000000 0.000000 ( 0.000000)
shuffle3 (10) 0.000000 0.000000 0.000000 ( 0.000000)
sort (10) 0.000000 0.000000 0.000000 ( 0.000000)
sort_by (10) 0.000000 0.000000 0.000000 ( 0.000000)

                   user     system      total        real

shuffle (100) 0.156000 0.000000 0.156000 ( 0.156000)
shuffle2 (100) 0.047000 0.000000 0.047000 ( 0.046000)
shuffle3 (100) 0.031000 0.000000 0.031000 ( 0.032000)
sort (100) 0.125000 0.000000 0.125000 ( 0.125000)
sort_by (100) 0.031000 0.000000 0.031000 ( 0.031000)

                   user     system      total        real

shuffle (1000) 1.625000 0.000000 1.625000 ( 1.625000)
shuffle2 (1000) 0.656000 0.000000 0.656000 ( 0.672000)
shuffle3 (1000) 0.266000 0.000000 0.266000 ( 0.281000)
sort (1000) 2.094000 0.000000 2.094000 ( 2.093000)
sort_by (1000) 0.453000 0.000000 0.453000 ( 0.453000)

                   user     system      total        real

shuffle (10000) 16.469000 0.032000 16.501000 ( 16.578000)
shuffle2 (10000) 8.156000 0.000000 8.156000 ( 8.203000)
shuffle3 (10000) 3.344000 0.047000 3.391000 ( 3.391000)
sort (10000) 25.453000 0.000000 25.453000 ( 25.593000)
sort_by (10000) 4.765000 0.000000 4.765000 ( 4.796000)

#!/usr/bin/env ruby

require ‘benchmark’

class Array

def shuffle
newarr = self.dup
2.times do
self.length.times do |x|
y = rand(self.length)
newarr[x], newarr[y] = newarr[y], newarr[x]
end
end
newarr
end

def shuffle2
tmparr = self.collect { |x| [rand(1000000000),x] }
tmparr.sort.collect { |x| x[1] }
end

def shuffle3
h = Hash.new
self.each { |v| h[rand(1000000000)] = v }
h.keys.sort.collect { |k| h[k] }
end

end

[10, 100, 1000, 10000].each do |n|

populate an Array size n

a = Array.new(n) { |i| i }

run our benchmarks

Benchmark.bm(16) do |bb|

 # my method
 bb.report("shuffle ("+n.to_s+")") do
   100.times do
     a.shuffle
   end
 end

 # decorate, sort, undecorate
 bb.report("shuffle2 ("+n.to_s+")") do
   100.times do
     a.shuffle2
   end
 end

 # decorate, sort, undecorate with a hash
 bb.report("shuffle3 ("+n.to_s+")") do
   100.times do
     a.shuffle3
   end
 end

 # Array's sort
 bb.report("sort ("+n.to_s+")") do
   100.times do
     a.sort { rand(10000) <=> 5000 }
   end
 end

 # Enumerable's sort_by
 bb.report("sort_by ("+n.to_s+")") do
   100.times do
     a.sort_by { rand }
   end
 end

 printf("\n")

end

hammed · December 6, 2005, 8:57pm

Stephen W. wrote:

On rand(1000000000); chose 1billion rather arbitrarily to keep things
out of the range of BigNum. Oddly, using 2^31-1 caused BigNums to show
up. Ideally we’d use the largest FixNum, but I don’t know how to
retrieve that.

2**30 - 1 is reliably a fixnum (though I guess it might not be the
biggest one on some platforms).

hammed · December 6, 2005, 5:08pm

“U” == Uwe S. [email protected] writes:

U> If you use random as sorting key, the result of a<=>b
U> will differ each time you evaluate a<=>b.
U> In my case I attach a random number to each item,
U> then I sort my list according to this number and remove
U> this ‘decoration’ afterwards.

run

ri Enum#sort_by

and see what it do.

Guy Decoux

hammed · December 6, 2005, 9:01pm

Stephen W. wrote:

h = Hash.new
self.each { |v| h[rand(1000000000)] = v }
h.keys.sort.collect { |k| h[k] }

OOPS!

I forgot to deal with collisions in the hash!

–Steve

hammed · December 7, 2005, 4:01pm

Why not
a.dup.sort { 2*rand(12345) <=> 12345 }

? that way, we’d have NO colisions ever.

hammed · December 6, 2005, 9:05pm

Stephen W. wrote:

I forgot to deal with collisions in the hash!

Here’s another shot… still faster than sort_by{rand} on my system, and
now should be unbiased:

def shuffle3
h = Hash.new
self.each do |v|
k = rand(1000000000) until !h.has_key?(k)
h[k] = v
end
end

shuffle3 (10000) 3.625000 0.016000 3.641000 ( 3.656000)
sort_by (10000) 5.250000 0.000000 5.250000 ( 5.312000)

–Steve

hammed · December 6, 2005, 5:37pm

||
|| >>>>> “U” == Uwe S. [email protected] writes:
||
|| U> If you use random as sorting key, the result of a<=>b
|| U> will differ each time you evaluate a<=>b.
|| U> In my case I attach a random number to each item,
|| U> then I sort my list according to this number and remove
|| U> this ‘decoration’ afterwards.
||
|| run
||
|| ri Enum#sort_by
||
|| and see what it do.

Thanks. As I told you before I’m new to ruby

Greetings, Uwe

hammed · December 7, 2005, 4:54pm

On 07/12/05, Michael U. [email protected] wrote:

a.sort {2*rand(12345) <=> 2 * rand(12345) + 1}

[snip]

and then he should do us a favour and write

a.sort { [-1, 1][rand(2)] }
instead, which gives the same skewed results but at least does not do
it as complicated.

Brian

–
http://ruby.brian-schroeder.de/

Stringed instrument chords: http://chordlist.brian-schroeder.de/

hammed · December 7, 2005, 5:31pm

On Dec 7, 2005, at 6:57 AM, [email protected] wrote:

Why not
a.dup.sort { 2*rand(12345) <=> 12345 }

Because it’s the slowest version I’ve tested. See the rest of the
thread for my benchmarks. In order of speed, we have ‘shuffle3’,
‘sort_by {rand}’, and ‘shuffle4’.

–Steve