Sorting by rand?

unknown · April 30, 2007, 4:01am

Browsing something at a bookstore recently, I saw an example of
an array shuffle allegedly implemented with
x.sort { rand }
or something to that effect.

This seems sort of unlikely, to me, to be a correct shuffling
algorithm. I’m not even sure it’s safe at all. I know that
doing the same thing to C qsort implementations occasionally
causes one to blow up spectacularly, because its assumption
that the comparison between any two items is constant gets
broken.

So, two vaguely related questions:

Is there a reasonably sane way to do this, should I ever find
myself wanting one?
Am I right in guessing that handing garbage to Array#sort
is probably crazy?

-s

unknown · April 30, 2007, 4:18am

On Apr 29, 2007, at 10:01 PM, Peter S. wrote:

So, two vaguely related questions:

Is there a reasonably sane way to do this, should I ever find
myself wanting one?

Short answer: array.sort_by { rand }

Longer answer: http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-
talk/168963

Gary W.

unknown · April 30, 2007, 4:29am

In message [email protected], Gary W.
writes:

On Apr 29, 2007, at 10:01 PM, Peter S. wrote:

So, two vaguely related questions:

Is there a reasonably sane way to do this, should I ever find
myself wanting one?

Short answer: array.sort_by { rand }

Longer answer: http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-
talk/168963

Well, okay. It works in practice. Is it guaranteed to work, and is
it actually reasonable to expect the output to be randomly shuffled?

Just for comparison, I tried:

l = a.length
a.each_index { |x| a[x] = a[rand(l)] }

I’m pretty sure I got the generic shuffle algorithm right (Knuth FTW),
and if so, it produces quality random results in about 1/4 the time.

-s

unknown · April 30, 2007, 4:38am

On 4/29/07, Peter S. [email protected] wrote:

Browsing something at a bookstore recently, I saw an example of
an array shuffle allegedly implemented with
x.sort { rand }
or something to that effect.

Am I right in guessing that handing garbage to Array#sort
is probably crazy?

I believe that the "something to that effect was probably:

 x.sort_by {rand}

sort_by takes a block which normally takes one argument. sort_by
calls this block once for each element of the enumeration and
associates that value with the element, it then sorts those values and
returns an array of the asssociated elements in the order of the
values. So for example:

(1…50).sort_by{|e| -e}
==>[50, 49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36,
35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19,
18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]

The block without the argument just assignes a random value to each
element.

Perfectly safe as far as I can see.

x.sort {rand}

doesn’t really work, whether or not it’s safe. Rand without an
argument returns a float >= 0.0 and < 1.0, the block for sort is
expected to return -1, 0, or 1. Depending on the implementaiton of the
particular sort method, you might get different results. For arrays,
it looks like you’ll get the same results with repeated calls:

irb(main):020:0> (1…10).to_a.sort {rand}
=> [10, 6, 1, 7, 3, 8, 5, 9, 4, 2]
irb(main):021:0> (1…10).to_a.sort {rand}
=> [10, 6, 1, 7, 3, 8, 5, 9, 4, 2]

Whereas sort_by seems to do what’s expected.

irb(main):022:0> (1…10).to_a.sort_by {rand}
=> [5, 9, 2, 1, 7, 6, 4, 10, 8, 3]
irb(main):023:0> (1…10).to_a.sort_by {rand}
=> [4, 8, 5, 6, 2, 3, 10, 7, 1, 9]

–
Rick DeNatale

My blog on Ruby
http://talklikeaduck.denhaven2.com/

unknown · April 30, 2007, 4:47am

On 4/29/07, Peter S. [email protected] wrote:

talk/168963

Well, okay. It works in practice. Is it guaranteed to work, and is
it actually reasonable to expect the output to be randomly shuffled?

See my longer explanation of sort_by which crossed in the e-mail.

Just for comparison, I tried:

l = a.length
a.each_index { |x| a[x] = a[rand(l)] }

I’m pretty sure I got the generic shuffle algorithm right (Knuth FTW),
and if so, it produces quality random results in about 1/4 the time.

Looks suspicious to me, you are replacing each element of x in turn
with a randomly selected element, which means that you are likely to
end up with a different set of elements in the end. One of the
requirements of shuffling an array would seem to be that

shuffle(a).sort == a.sort

Something like:

a.each_index {|x| y=rand; a[x], a[y] = a[y], a[x]}

would seem to be a better approach, but a.sort_by {rand} does the same
thing more concisely.

–
Rick DeNatale

My blog on Ruby
http://talklikeaduck.denhaven2.com/

unknown · April 30, 2007, 4:46am

On Apr 29, 2007, at 10:28 PM, Peter S. wrote:

l = a.length
a.each_index { |x| a[x] = a[rand(l)] }

You are trashing your original array.

irb(main):001:0> a = [1,2,3,4]
=> [1, 2, 3, 4]
irb(main):002:0> l = a.length
=> 4
irb(main):003:0> a.each_index { |x| a[x] = a[rand(l)] }
=> [3, 3, 3, 4]

unknown · April 30, 2007, 4:56am

Peter S. wrote:

l = a.length
a.each_index { |x| a[x] = a[rand(l)] }

I’m pretty sure I got the generic shuffle algorithm right (Knuth FTW),
and if so, it produces quality random results in about 1/4 the time.

-s

The above probably has a typo or something.

This one’s from Cormen, Leiserson, Rivest, and Stein. It swaps each
element with a random element in the rest of the list (allowing it to be
swapped with itself sometimes). I don’t have the book in front of me,
but I think it’s something like:

def randomize!(ary)
for i in 0…(ary.length)
j = i+rand(ary.length - i)
# swap ary[i] and ary[j]
tmp = ary[i]
ary[i] = ary[j]
ary[j] = tmp
end
ary
end

I think I can prove that it’s a valid sort, but I don’t trust my own
logic that much. I tested it with 1000 items and it seemed to have the
right number of each item.

a = [0, 1, 2, 3, 4]
=> [0, 1, 2, 3, 4]

randomized = []
=> []

1000.times do randomized << randomize!(a.dup)
end
=> 1000

count = 0
=> 0

randomized.each {|ary| count += 1 if ary == [3, 1, 2, 0, 4]}
count
=> 8

So for one of the 120 permutations of [0, 1, 2, 3, 4], I counted the
occurrences of one of them. There were 8. On average, if there are 120
permutations, there will be 8 of each:

1000/120
=> 8

Good enough for me, unless I was just really lucky. That happens with
random numbers, you know.

Dan

unknown · April 30, 2007, 5:04am

In message [email protected], Gary
Wright writes:

On Apr 29, 2007, at 10:28 PM, Peter S. wrote:

l = a.length
a.each_index { |x| a[x] = a[rand(l)] }

You are trashing your original array.

True.

That said, with the apparent speed difference, it’s probably worth
copying and then shuffling.

-s

unknown · April 30, 2007, 5:03am

In message
[email protected], “Rick
DeNatale” write
s:

I believe that the "something to that effect was probably:
x.sort_by {rand}

Yes.

Perfectly safe as far as I can see.

Ahh! Yes. Profiling it reveals this; it only calls rand once per
element. Sane, and efficient.

x.sort {rand}

I think I’ve seen also sort {rand <=> rand}, which is crazy.

-s

unknown · April 30, 2007, 5:09am

In message
[email protected], “Rick
DeNatale” write
s:

Looks suspicious to me, you are replacing each element of x in turn
with a randomly selected element, which means that you are likely to
end up with a different set of elements in the end.

DOH!

a.each_index {|x| y=rand; a[x], a[y] = a[y], a[x]}

would seem to be a better approach, but a.sort_by {rand} does the same
thing more concisely.

Yes.

However, the sort is doing a lot more comparisons and swaps. It’s
O(n*log(n)), while the straight shuffle is O(n). On an array of
10,000 items, shuffling uses 20,000 assignments. Sorting by rand
used 132,000 comparisons, and that would imply a fair number of
swaps. Total time difference is about a factor of three.

I’m willing to trade a little conciseness for that.

-s

unknown · April 30, 2007, 8:40am

In message [email protected], “M. Edward (Ed) Borasky”
writes:

Welll … I do this in Excel all the time – add a column of random
numbers to the right of the (CSV) file I want to shuffle, sort it about
five times, hitting F9 before each sort to change the random numbers,
and then deleting the random column and saving the file. Translate that
to Ruby and I think you’ve got it.

It seems expensive.

-s

unknown · April 30, 2007, 8:11am

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Peter S. wrote:

broken.

So, two vaguely related questions:

Is there a reasonably sane way to do this, should I ever find
myself wanting one?

Am I right in guessing that handing garbage to Array#sort
is probably crazy?

-s

Welll … I do this in Excel all the time – add a column of random
numbers to the right of the (CSV) file I want to shuffle, sort it about
five times, hitting F9 before each sort to change the random numbers,
and then deleting the random column and saving the file. Translate that
to Ruby and I think you’ve got it.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGNYhUaZUb+jwczfoRAgn3AKDJnG8yQYVTWkoPTw9YNT42/6M/OACg1NFg
76A4qzYkZjIIKBA/YAPblDM=
=KVsa
-----END PGP SIGNATURE-----

unknown · April 30, 2007, 9:11am

In message [email protected], Alex Y. writes:

Peter S. wrote:

In message [email protected], “M. Edward (Ed) Borasky” writes:

Welll … I do this in Excel all the time – add a column of random
numbers to the right of the (CSV) file I want to shuffle, sort it about
five times, hitting F9 before each sort to change the random numbers,
and then deleting the random column and saving the file. Translate that
to Ruby and I think you’ve got it.

It seems expensive.

That’s almost exactly what sort_by{rand} does…

Right.

And so far as I can tell, sort_by{rand} is inefficient compared to
a canonical shuffle algorithm. That’d make sense; it’s going to do a
lot
of compares, and some fair number of them (I’d guess half, or close to
it?) result in swaps. A shuffle does N swaps, or 2N assignments. A
sort might do more. Sorting a list of 10,000 items, sorted by rand,
performed about 128,000 comparisons. If half of them result in swaps,
that’s 64k swaps.

I can’t think of any way in which sort_by{rand} is going to be “more
random”
than a correct shuffle algorithm (yes, I know the one I posted the first
time
is not that correct algorithm; I didn’t do swaps, just assignment).

I also can’t see an immediate reason to do the random numbers sort more
than once; I guess to reduce the chances of collisions preserving order.

-s

unknown · April 30, 2007, 9:03am

Peter S. wrote:

In message [email protected], “M. Edward (Ed) Borasky” writes:

Welll … I do this in Excel all the time – add a column of random
numbers to the right of the (CSV) file I want to shuffle, sort it about
five times, hitting F9 before each sort to change the random numbers,
and then deleting the random column and saving the file. Translate that
to Ruby and I think you’ve got it.

It seems expensive.

That’s almost exactly what sort_by{rand} does…

unknown · April 30, 2007, 9:25am

Peter S. wrote:

It seems expensive.

That’s almost exactly what sort_by{rand} does…

Right.

And so far as I can tell, sort_by{rand} is inefficient compared to
a canonical shuffle algorithm.
That’s true, but keep in mind that sort_by{rand} is code that’s built
into the interpreter, so it’s still gonna be pretty fast. I tested this
on my system a few weeks ago, and rerunning the benchmark program, the
traditional sort only becomes faster when the arrays in question have
over 5,000 elements. After all, it’s a better algorithm, but it’s
written in ruby and not C.

I also can’t see an immediate reason to do the random numbers sort more
than once; I guess to reduce the chances of collisions preserving order.
If the numbers that rand() generates are good enough, the chance of
collisions preserving order should be the same as the chance of the next
iteration of randomizing putting them back in the same order, shouldn’t
it?

Dan

unknown · April 30, 2007, 9:34am

In message [email protected], Dan Z. writes:

That’s true, but keep in mind that sort_by{rand} is code that’s built
into the interpreter, so it’s still gonna be pretty fast. I tested this
on my system a few weeks ago, and rerunning the benchmark program, the
traditional sort only becomes faster when the arrays in question have
over 5,000 elements. After all, it’s a better algorithm, but it’s
written in ruby and not C.

Hmm. On my system, at only 250 elements, the shuffle comes out ahead.
Worth noting that the margin is very thin; .11 vs. .12 seconds user,
although oddly, the system time is .05 vs. .15 seconds. Might be
some underlying quirk there. Still, that’s a “large” set for many
programs. On the other hand, by, say, 20k, the factor is 3-1 or more.

If the numbers that rand() generates are good enough, the chance of
collisions preserving order should be the same as the chance of the next
iteration of randomizing putting them back in the same order, shouldn’t it?

I think not quite. Imagine a hypothetical pair of entities. They have
a
1/bignum chance of a collision, which preserves their relative sorting.

If you do it twice, the probability that the same pair of entities
collided
both times is 1/bignum^2. They might otherwise end up in the same
order, or
not, but there’s only a 1/bignum^2 chance that they ended up in the same
order due to a collision.

-s

unknown · April 30, 2007, 9:47am

In message [email protected], Dan Z. writes:

You’re right. I guess I’m just inclined to discount a such a small
difference in orderings. And the difference in our benchmarks may have
been due to the fact that I’m using a different algorithm (for each
element i, swap i with an element from i…end).

Hmm. I don’t think that makes a difference, although I’m not sure.
It may affect the results.

Or perhaps because I
have more method calls, as I made a method Array#randomize! (that calls
Array#swap! n times)

That might well do it; method lookup is comparatively expensive, I
think.

… Of course, the entire thing is probably silly, in that I doubt most
programs spend much time unsorting arrays. I was mostly just curious
about the question of what happens if you try to provide an un-fixed
sort key. For sort_by, they’re cached; sort { rand <=> rand } might
be unpredictable, though.

-s

unknown · April 30, 2007, 9:42am

Peter S. wrote:

order due to a collision.

-s

You’re right. I guess I’m just inclined to discount a such a small
difference in orderings. And the difference in our benchmarks may have
been due to the fact that I’m using a different algorithm (for each
element i, swap i with an element from i…end). Or perhaps because I
have more method calls, as I made a method Array#randomize! (that calls
Array#swap! n times)

Dan

unknown · April 30, 2007, 10:28am

On Monday 30 April 2007 02:55, Dan Z. wrote:

l = a.length
a.each_index { |x| a[x] = a[rand(l)] }

a.each_index { |x| r= rand(l); a[x], a[r] = a[r], a[x] }

unknown · April 30, 2007, 10:35am

In message [email protected], Marcin
Raczkowski writes:

On Monday 30 April 2007 02:55, Dan Z. wrote:

l = a.length
a.each_index { |x| a[x] = a[rand(l)] }

a.each_index { |x| r= rand(l); a[x], a[r] = a[r], a[x] }

Yeah. I can’t even play the “only been using this language three
days” card, that was just plain dumb. It’s examples like that that have
made unit testing the best thing that ever happened to me. I remembered
that the key component was putting a random thing in each spot…

Thinking about it, I think that Knuth proved that it doesn’t matter
whether you select “any random element” or “any random element from the
rest
of the list” when shuffling. At least, I think that was in ACP.
I could be wrong; I probably read it fifteen years ago.

-s