Is it more convenient to have a random_each in ruby stdlib?

uncutstone · June 20, 2006, 5:10am

On Jun 19, 2006, at 9:08, Kroeger, Simon (ext) wrote:

I think it’s a little bit to simple to be put in the standard lib.

a.sort_by{rand}.each {|e| print e , ’ '}

But it turns out to not be that simple. This can give a biased sort
depending on the underlying sort algorithm, and it’s not as efficient
as it could be, as Alex observes:
<>
Like a lot of simple-sounding problems, this one’s already been
solved, see:

perfect shuffle

The linked Haskell implementation also gives a very thorough
discussion, including a demonstration of how a stable sorting
algorithm will yield biased results.

Note that the discussion there does not apply to the implementation
quoted, but
rather to this implementation:

a.sort_by{rand(a.size)}.each {|e| print e , ’ '}

which I think we can all see the problem with. Absent any collisions in
the
values chosen by rand each time through the block, the short
implementation
quoted above is unbiased. It’s still less effecient than a Fisher-Yates
shuffle (at least in big-O notation; as a practical matter, I’d need to
see
some benchmarks), but “biased” is not a critique that can be fairly
leveled
against this implementation. (unless you’re willing to accept that your
F-Y
shuffle is “biased” too, because for example rand(7) doesn’t return 0
with
/exactly/ one-seventh probability, but rather with a probability that’s
merely
very, very close to one-seventh)

uncutstone · June 20, 2006, 9:27am

On Jun 20, 2006, at 4:08, Daniel M. wrote:

The linked Haskell implementation also gives a very thorough
block, the short implementation quoted above is unbiased.
You don’t provide anything to back up this assertion. What makes
this implementation unbiased? What makes the other implementation
biased? What’s the crucial difference between the two?

(technical note - #sort_by calculates the key assigned to each
element only once. In other words, each element of ‘a’ is assigned
one and only one random key, and is then sorted according to that
key. See the docs for more info.)

The only difference between the two implementations is the range of
the random numbers - the first implementation uses a float in the
range [0, 1.0), the second is an integer from in the range
(0…a.size). For sorting purposes, though, the particular range of
the keys is irrelevant, all that matters is that the keys are
comparable (you could use strings if you felt like it, too).

This is proven very simply: given a finite set of N keys in the float
range [0, 1.0), they can be mapped to a finite set of keys in the
integer range (0…N) without changing their sort order: simply sort
the float keys, and then enumerate them. The result is a set of
integer keys with the identical sort order as the float keys. The
reverse mapping is just as simple.

In other words: the implementations are identical with respect to the
sort. The sort is the only re-ordering performed, and so is
responsible for any bias in the resulting shuffle. And therefore, if
there’s a bias in one implementation, there’s a bias with the other
one as well.

Regarding whether or not bias actually exists, I didn’t say that the
sorting implementation is definitely going to be biased, merely that
implementations which rely on a sort are potentially biased
depending on the underlying sorting algorithm.

This was the precisely the problem in the Haskell paper - it assumed
a stable sort, which introduces bias. I believe Ruby uses quicksort,
which isn’t stable, and so the method isn’t biased. However, Ruby
makes no guarantees about the underlying sorting algorithm; were it
ever to change, it would potentially introduce bias into the
shuffle. One such change might be to introduce a sort which has
locally-stable conditions, such as a quicksort that optimises using
insertion sort on small partitions.

If (as we’re assuming) an unbiased shuffle is important, then it’s
dangerous to rely on non-guaranteed implementation details to ensure
that.

It’s still less effecient than a Fisher-Yates
shuffle (at least in big-O notation; as a practical matter, I’d
need to see
some benchmarks)

Acceptable benchmarks always depend on your intended use, but here
are some simple wall-clock ones one some large arrays (large arrays
because that’s the case where performance might matter). The results
are that f_y takes ~75% of the time of sort in the first case, and
~60% of the time in the second case, which is nice, but not nearly
what the complexity predicts (is it ever?) Makes me wonder what the
difference would be if they were in C.

matthew smillie

a = (1…100000).map { rand(100000) }

             user     system      total        real

in-place f_y 1.230000 0.010000 1.240000 ( 1.320484)
f_y_shuffle 1.370000 0.010000 1.380000 ( 1.471718)
sort_shuffle 1.870000 0.040000 1.910000 ( 2.004320)

a = (1…1000000).map { rand(1000000) }

             user     system      total        real

in-place f_y 12.670000 0.080000 12.750000 ( 13.202975)
f_y_shuffle 13.980000 0.150000 14.130000 ( 15.091745)
sort_shuffle 23.260000 0.410000 23.670000 ( 24.704380)

require ‘benchmark’
include Benchmark

class Array
def shuffle!
self.each_index { |i|
r = rand(size - i) + i
tmp = self[i]
self[i] = self[r]
self[r] = tmp
}
end
def shuffle
values_at(*(0…size-1).to_a.shuffle!)
end
def sort_shuffle
sort_by{rand}
end
end

test array

a = (1…100000).map { rand(100000) }
bm(10) { |b|
b.report(“in-place f_y”) { a.shuffle! }
b.report(" f_y_shuffle") { a.shuffle }
b.report(“sort_shuffle”) { a.sort_shuffle }
}
a = (1…1000000).map { rand(1000000) }
bm(10) { |b|
b.report(“in-place f_y”) { a.shuffle! }
b.report(" f_y_shuffle") { a.shuffle }
b.report(“sort_shuffle”) { a.sort_shuffle }
}

uncutstone · June 20, 2006, 4:08pm

Paul B. wrote:

On 19/06/06, uncutstone wu [email protected] wrote:

I found it may be a common case that we need randomly enumerate elements
in a array or some other collections.

Have a look at rand.rb:
http://chneukirchen.org/blog/static/projects/rand.html

Paul.

Thanks.

uncutstone

uncutstone · June 20, 2006, 2:16pm

On 19/06/06, uncutstone wu [email protected] wrote:

I found it may be a common case that we need randomly enumerate elements
in a array or some other collections.

Have a look at rand.rb:
http://chneukirchen.org/blog/static/projects/rand.html

Paul.

uncutstone · June 20, 2006, 6:01pm

Quoting Matthew S. [email protected]:

You don’t provide anything to back up this assertion. What makes
this implementation unbiased? What makes the other implementation
biased? What’s the crucial difference between the two?

I’m sorry; I thought it would be obvious to anyone who had read the
Haskell
text. In any case, the problem is the likelihood of collisions. When
you use
just {rand}, the chance of collisions is very low, on the order of 2**(-
how
many bits are in the random number implementation). When you use
{rand(a.size)} the chance of collision is comparatively high.

When keys collide, then the search order you get depends on the
implementation
of the sorting algorithm; note that it doesn’t matter whether or not the
sort
is a stable sort - you get whatever order the sort function ends up with
when
given identical keys. (note that it appears that ruby’s sort_by is
stable, but
that’s irrelevant; you could replace “sort_by” with “reverse.sort_by”
and you’d
still get bias, just bias in a different way)

When keys don’t collide, the sorting algorithm is irrelevant. (well, so
long as
the algorithm works at all) All sorting algorithms will end up putting
the
list in the same order given that collection of keys.

Now, some empirical numbers as evidence. First let’s try the algorithm
I claim
is unbiased. Let’s see what the first number is when shuffling (1…3):

irb(main):024:0> a = Hash.new(0); 99999.times {
a[(1…3).sort_by{rand}[0]] +=
1}; a
=> {1=>33348, 2=>33393, 3=>33258}

Note that each value has approximately the same likelihood, as we would
expect.
I suppose we could go and do a chi-squared test on the results, but I’m
willing
to just eyeball it and say that that’s a good result. I’ll note that
the
average initial value is 1.99909999099991, which is really close to 2.

Now let’s try the algorithm which the Haskell paper was talking about,
and which
I said was biased:

irb(main):028:0> a = Hash.new(0); 99999.times {
a[(1…3).sort_by{rand(3)}[0]] +=
1}; a
=> {1=>52017, 2=>18490, 3=>29492}

Note how comparatively unlikely it is to get a 2 first. The average
initial
value is around 1.77.

Note that this bias lessens as the array (and therefore, the pool of
keys) grows
larger. The Haskell paper, when demonstrating bias, used an array of
size two.

    For sorting purposes, though, the particular range of
the keys is irrelevant, all that matters is that the keys are
comparable (you could use strings if you felt like it, too).

Except, as I’ve pointed out, no. It’s all about how likely you are to
get key
collisions. With a limited integer range to choose from, the chance is
comparatively high.

Regarding whether or not bias actually exists, I didn’t say that the
sorting implementation is definitely going to be biased, merely that
implementations which rely on a sort are potentially biased
depending on the underlying sorting algorithm.

Actually, the sort_by shuffling algorithm with a high chance of key
collision is
biased regardless of the details of the underlying sorting algorithm.
Let’s
take a two element list, as the haskell paper did, and consider two
possible
sorting algorithms for sorting two-element lists:

ALGORITM A: if a[0].key >= a[1].key then a.reverse else a end
ALGORITM B: if a[0].key > a[1].key then a.reverse else a end

In fact, every sorting algorithm is going to be equivalent to one of
these when
applied to a two element list. (because, given equal keys, there are
only two
choices for what to do with the list)

Now, let’s look at what happens when we use %w(x y).sort_by{rand(2)}:

1/4 of the time: The keys chosen are [0,1] - both algorithms produce
%w(x y)
1/4 of the time: The keys chosen are [1,0] - both algorithms produce
%w(y x)
1/2 of the time: The keys chosen are the same.
Algorithm A produces %w(y x)
Algorithm B produces %w(x y)

That is, a system with an algorithm A-type sort will produce %w(y x)
three-fourths of the time. A system with an algorithm B-type sort will
produce
%w(x y) three-fourths of the time. Both systems are biased, and only
algorithm
B-type systems have a stable sort.

Note that if we threw out and re-shuffled all the cases where there’s
any key
collision, we’d have an unbiased shuffle regardless of sorting
algorithms.
We’d also have to reshuffle a two-element list an average of two times
each
time we wanted a proper shuffle, a three-element list 4.5 times on
average, a
four element list 10.67 times, and an n-element list n**n / n! times.
Not
something you want to do.

difference would be if they were in C.
Ew. I meant “efficient”. In any case, it’s nice to see results that
agree with
the theory. I has suspected that the sort_by method might prove faster
because
a large proportion of its computation is done in C (the actual sort) as
compared to the all-ruby F-Y shuffle.

I gotta get a ruby translation of my sig.