Need some Ruby magic

On Tue, Dec 06, 2005 at 08:23:52AM +0900, [email protected] wrote:

the big difference is that a value will have rand called multiple times for

That might be the very reason why it is ultimately much more biased than
sort_by{ rand }… I’m keeping the latter.
If the limited precision of rand() became a problem, I could always use
arr.sort_by{ [rand, rand] } where collisions happen with a probability
around 6.16298e-19 for a 10_000_000 element array, or more generally
Array.new(N){rand} (but then we’d have to look into MT for another
possible source of residual bias).

it - in otherwords an element compareds differently every time. this is
bound to help shuffle more uniformly it seems:

We cannot trust our intuition in these cases; in the following example,
the
tenth element is more likely to stay where it was (look at pos[9]):

a = (1…100); pos = Array.new(100, 0)

t = Time.new; 100000.times{ pos[a.sort{rand <=> rand}.index(10)] += 1 }
Time.new - t

=> 223.891608

pos

=> [1176, 1208, 1083, 1153, 1191, 1130, 1098, 1133, 1189, 1438, 1150,

1043,
========

1121, 1067, 1076, 1069, 1078, 1113, 1129, 1143,1115, 1074, 1081, 1069,

1057,

[…]

t = Time.new; 10000000.times{ pos[a.sort{rand <=> rand}.index(10)] += 1
}
Time.new - t

=> 22781.664168

pos

=> [105944, 105861, 105546, 105681, 104059, 101196, 99080, 100675,

110851,

132223, 101670, 96018, 96147, 96688, 97709, 99051, 97823, 97345,

98799,

[…]

The bias can be observed easily with smaller arrays:

a = (1…20); pos = Array.new(20, 0)
t = Time.new
100000.times{ pos[a.sort{rand <=> rand}.index(3)] += 1 }
Time.new - t # => 20.848851
pos

=> [5084, 5694, 6561, 4326, 4205, 4792, 4913, 4827, 4791, 4571, 4560,

4707,

4863, 4934, 5112, 5011, 5071, 5326, 5482, 5170]

a = (1…10); pos = Array.new(10, 0)
t = Time.new
1_000_000.times{ pos[a.sort{rand <=> rand}.index(5)] += 1 }
Time.new - t # => 77.438912
pos

=> [113836, 103238, 99442, 88682, 101306, 83294, 93296, 100830,

100744, 115332]

Compare the latter to

a = (1…10); pos = Array.new(10, 0)
t = Time.new
1_000_000.times{ pos[a.sort_by{rand}.index(5)] += 1 }
Time.new - t # => 32.163499
pos

=> [99822, 100020, 100328, 100040, 99603, 100098, 100195, 100023,

99875, 99996]

On Tue, 6 Dec 2005, Mauricio [iso-8859-1] Fernández wrote:

1000000 5.55096e-05

%w( a b c d ).sort{ rand <=> rand }

help?

I’m not totally sure it’s really unbiased, but at first sight it looks good.

the big difference is that a value will have rand called multiple times
for it

  • in otherwords an element compareds differently every time. this is
    bound to
    help shuffle more uniformly it seems:

    harp:~ > cat a.rb
    %w( a b c d ).sort{|a,b| ra = rand; rb = rand; p a => ra; p b => rb;
    ra <=> rb }

    harp:~ > ruby a.rb
    {“a”=>0.181439380218727}
    {“c”=>0.579403886629126}
    {“c”=>0.713513989939958}
    {“d”=>0.573687508540169}
    {“a”=>0.851534740906118}
    {“d”=>0.00156074151427565}
    {“b”=>0.803591007691647}
    {“a”=>0.494066110872563}
    {“b”=>0.834676789626288}
    {“c”=>0.792106134460754}

so, note that “a” was sorted using a different value each time.

can anyone who remembers stats and knows the sort algorithm weigh in?

-a

||
|| On Tue, Dec 06, 2005 at 07:32:34PM +0900, Uwe S. wrote:
|| > Hi, I’m new to ruby, but I use Python for some years now.
|| > So I suggest the following ‘pythonic’ solution :wink:
|| […]
|| > Is there a shorter or more elegant way to do this
|| > ‘decorate-sort-undecorate’ in ruby ?
||
|| arr.sort_by{rand}

that is not equivalent to my solution.
the python equivalent to yours is

arr.sort(key = lambda x: random())

||
|| > In Python it is quite easy by using list comprehensions…
|| What does that look like in Python?

from random import random

def shuffle(a):
b = [ (random(), i) for i in a]
b.sort()
return [ x[1] for x in b ]

print shuffle(range(10))

|| Is it shorter than the above one? }:slight_smile:
||
|| –
|| Mauricio F.

greetings, Uwe

||

Mauricio Fernández [email protected] writes:

same value, but I still have to think about the effect of qsort.
And that is a good thing? How can the sequence be “random” if it
never contains the same two numbers in row? (Remember Enigma…)

||

|| > More info, including the Ruby code I used, available at
|| > eigenclass.org
|| >
|| >> does
|| >>
|| >> %w( a b c d ).sort{ rand <=> rand }
|| >>
|| >> help?

Hi, I’m new to ruby, but I use Python for some years now.
So I suggest the following ‘pythonic’ solution :wink:

def shuffle(a)

  b=[]
  # decorate
  a.each { |x| b << [rand,x]; }

  # sort
  b.sort! { |x,y| x[0] <=> y[0] }

  # undecorate
  b.collect { |x| x[1] }

end

a=%w{a b c d e f}
puts shuffle a

Is there a shorter or more elegant way to do this
‘decorate-sort-undecorate’ in ruby ?
In Python it is quite easy by using list comprehensions…

Greetings, Uwe

|| > || arr.sort_by{rand}
|| > that is not equivalent to my solution.
|| > the python equivalent to yours is
|| > arr.sort(key = lambda x: random())

|| > || > In Python it is quite easy by using list comprehensions…
|| > || What does that look like in Python?
|| >
|| > from random import random
|| >
|| > def shuffle(a):
|| > b = [ (random(), i) for i in a]
|| > b.sort()
|| > return [ x[1] for x in b ]
|| >
|| > print shuffle(range(10))

|| I would have thought those two functions do the same thing
|| with the exception that shuffle does not sort in place.
|| I’m probably missing something obvious. Could you explain
|| the difference between the two versions?

If you use random as sorting key, the result of a<=>b
will differ each time you evaluate a<=>b.
In my case I attach a random number to each item,
then I sort my list according to this number and remove
this ‘decoration’ afterwards.

Pythonprogrammers call this pattern “decorate-sort-undecorate”.

As I do not now anything about the ruby interanls of sort,
I prefer my version.

Greetings, Uwe.

Hi,

From: “Uwe S.” [email protected]

arr.sort(key = lambda x: random())
No, Array#sort_by does the ‘decorate-sort-undecorate’, aka.
Schwartzian Transform.

I don’t read Python very well, but I think the ruby equivalent
to ‘arr.sort(key = lambda x: random())’ might be:

arr.sort { rand <=> rand }

But that is different from arr.sort_by {rand}

Regards,

Bill

On Tue, Dec 06, 2005 at 07:32:34PM +0900, Uwe S. wrote:

Hi, I’m new to ruby, but I use Python for some years now.
So I suggest the following ‘pythonic’ solution :wink:
[…]
Is there a shorter or more elegant way to do this
‘decorate-sort-undecorate’ in ruby ?

arr.sort_by{rand}

In Python it is quite easy by using list comprehensions…

What does that look like in Python?
Is it shorter than the above one? }:slight_smile:

On Wed, Dec 07, 2005 at 12:41:39AM +0900, Christian N. wrote:

so it would seem that two consecutive calls to rand() cannot return the
same value, but I still have to think about the effect of qsort.

And that is a good thing? How can the sequence be “random” if it
never contains the same two numbers in row? (Remember Enigma…)

Nah, I misread genrand_real; the above shows it is possible.

Uwe S. ha scritto:

|| > def shuffle(a):

If you use random as sorting key, the result of a<=>b
will differ each time you evaluate a<=>b.
In my case I attach a random number to each item,
then I sort my list according to this number and remove
this ‘decoration’ afterwards.

Pythonprogrammers call this pattern “decorate-sort-undecorate”.

Are you sure of what you say? AFAIK
sort(key=lambda x: somefunc…))

will do decorate/sort/undecorate under the hood, behaving like ruby’s
#sort_by and like your version.
OTOH
sort(cmp=lambda x,y: something… ) will behave like ruby’s #sort.

Mauricio Fernández wrote:

On Tue, Dec 06, 2005 at 08:23:52AM +0900, [email protected] wrote:

the big difference is that a value will have rand called multiple times for

That might be the very reason why it is ultimately much more biased than
sort_by{ rand }… I’m keeping the latter.

Indeed,

ar = [0, 1, 2]; result = Hash.new(0)
100000.times {result[ar.sort {rand <=> rand}] += 1}
p result

=> {[2, 1, 0]=>25235,

   [2, 0, 1]=>12285,
   [1, 0, 2]=>12426,
   [0, 1, 2]=>25111,
   [1, 2, 0]=>12468,
   [0, 2, 1]=>12475}

Michael

Ok, I think I’ve beat sort_by{ rand } on large Arrays , by about 30%. I
ended up with this:

 h = Hash.new
 self.each { |v| h[rand(1000000000)] = v }
 h.keys.sort.collect { |k| h[k] }

Explanation…

  • Hash works better than an Array of Arrays
    [[rand,x[0]],[rand,x[1]],…] because Array#<=> seems to be slow.

  • Got some speedup by avoiding Float#<=>, hence the Integer version of
    rand.

  • On rand(1000000000); chose 1billion rather arbitrarily to keep things
    out of the range of BigNum. Oddly, using 2^31-1 caused BigNums to show
    up. Ideally we’d use the largest FixNum, but I don’t know how to
    retrieve that.

  • Bias should be minimal for most applications, even at rand(1Billion).

  • Can it be improved further???

Find my benchmark results, and script below.

–Steve

                   user     system      total        real

shuffle (10) 0.016000 0.000000 0.016000 ( 0.016000)
shuffle2 (10) 0.000000 0.000000 0.000000 ( 0.000000)
shuffle3 (10) 0.000000 0.000000 0.000000 ( 0.000000)
sort (10) 0.000000 0.000000 0.000000 ( 0.000000)
sort_by (10) 0.000000 0.000000 0.000000 ( 0.000000)

                   user     system      total        real

shuffle (100) 0.156000 0.000000 0.156000 ( 0.156000)
shuffle2 (100) 0.047000 0.000000 0.047000 ( 0.046000)
shuffle3 (100) 0.031000 0.000000 0.031000 ( 0.032000)
sort (100) 0.125000 0.000000 0.125000 ( 0.125000)
sort_by (100) 0.031000 0.000000 0.031000 ( 0.031000)

                   user     system      total        real

shuffle (1000) 1.625000 0.000000 1.625000 ( 1.625000)
shuffle2 (1000) 0.656000 0.000000 0.656000 ( 0.672000)
shuffle3 (1000) 0.266000 0.000000 0.266000 ( 0.281000)
sort (1000) 2.094000 0.000000 2.094000 ( 2.093000)
sort_by (1000) 0.453000 0.000000 0.453000 ( 0.453000)

                   user     system      total        real

shuffle (10000) 16.469000 0.032000 16.501000 ( 16.578000)
shuffle2 (10000) 8.156000 0.000000 8.156000 ( 8.203000)
shuffle3 (10000) 3.344000 0.047000 3.391000 ( 3.391000)
sort (10000) 25.453000 0.000000 25.453000 ( 25.593000)
sort_by (10000) 4.765000 0.000000 4.765000 ( 4.796000)

#!/usr/bin/env ruby

require ‘benchmark’

class Array

def shuffle
newarr = self.dup
2.times do
self.length.times do |x|
y = rand(self.length)
newarr[x], newarr[y] = newarr[y], newarr[x]
end
end
newarr
end

def shuffle2
tmparr = self.collect { |x| [rand(1000000000),x] }
tmparr.sort.collect { |x| x[1] }
end

def shuffle3
h = Hash.new
self.each { |v| h[rand(1000000000)] = v }
h.keys.sort.collect { |k| h[k] }
end

end

[10, 100, 1000, 10000].each do |n|

populate an Array size n

a = Array.new(n) { |i| i }

run our benchmarks

Benchmark.bm(16) do |bb|

 # my method
 bb.report("shuffle ("+n.to_s+")") do
   100.times do
     a.shuffle
   end
 end

 # decorate, sort, undecorate
 bb.report("shuffle2 ("+n.to_s+")") do
   100.times do
     a.shuffle2
   end
 end

 # decorate, sort, undecorate with a hash
 bb.report("shuffle3 ("+n.to_s+")") do
   100.times do
     a.shuffle3
   end
 end

 # Array's sort
 bb.report("sort ("+n.to_s+")") do
   100.times do
     a.sort { rand(10000) <=> 5000 }
   end
 end

 # Enumerable's sort_by
 bb.report("sort_by ("+n.to_s+")") do
   100.times do
     a.sort_by { rand }
   end
 end

 printf("\n")

end

end

Stephen W. wrote:

  • On rand(1000000000); chose 1billion rather arbitrarily to keep things
    out of the range of BigNum. Oddly, using 2^31-1 caused BigNums to show
    up. Ideally we’d use the largest FixNum, but I don’t know how to
    retrieve that.

2**30 - 1 is reliably a fixnum (though I guess it might not be the
biggest one on some platforms).

“U” == Uwe S. [email protected] writes:

U> If you use random as sorting key, the result of a<=>b
U> will differ each time you evaluate a<=>b.
U> In my case I attach a random number to each item,
U> then I sort my list according to this number and remove
U> this ‘decoration’ afterwards.

run

ri Enum#sort_by

and see what it do.

Guy Decoux

Stephen W. wrote:

h = Hash.new
self.each { |v| h[rand(1000000000)] = v }
h.keys.sort.collect { |k| h[k] }

OOPS!

I forgot to deal with collisions in the hash! :frowning:

–Steve

Why not
a.dup.sort { 2*rand(12345) <=> 12345 }

? that way, we’d have NO colisions ever.

Stephen W. wrote:

I forgot to deal with collisions in the hash! :frowning:

Here’s another shot… still faster than sort_by{rand} on my system, and
now should be unbiased:

def shuffle3
h = Hash.new
self.each do |v|
k = rand(1000000000) until !h.has_key?(k)
h[k] = v
end
end

shuffle3 (10000) 3.625000 0.016000 3.641000 ( 3.656000)
sort_by (10000) 5.250000 0.000000 5.250000 ( 5.312000)

–Steve

||
|| >>>>> “U” == Uwe S. [email protected] writes:
||
|| U> If you use random as sorting key, the result of a<=>b
|| U> will differ each time you evaluate a<=>b.
|| U> In my case I attach a random number to each item,
|| U> then I sort my list according to this number and remove
|| U> this ‘decoration’ afterwards.
||
|| run
||
|| ri Enum#sort_by
||
|| and see what it do.

Thanks. As I told you before I’m new to ruby :wink:

Greetings, Uwe

On 07/12/05, Michael U. [email protected] wrote:

a.sort {2*rand(12345) <=> 2 * rand(12345) + 1}

[snip]

and then he should do us a favour and write

a.sort { [-1, 1][rand(2)] }
instead, which gives the same skewed results but at least does not do
it as complicated.

Brian


http://ruby.brian-schroeder.de/

Stringed instrument chords: http://chordlist.brian-schroeder.de/

On Dec 7, 2005, at 6:57 AM, [email protected] wrote:

Why not
a.dup.sort { 2*rand(12345) <=> 12345 }

Because it’s the slowest version I’ve tested. See the rest of the
thread for my benchmarks. In order of speed, we have ‘shuffle3’,
‘sort_by {rand}’, and ‘shuffle4’.

–Steve