Quoting Matthew S. [email protected]:
You don’t provide anything to back up this assertion. What makes
this implementation unbiased? What makes the other implementation
biased? What’s the crucial difference between the two?
I’m sorry; I thought it would be obvious to anyone who had read the
Haskell
text. In any case, the problem is the likelihood of collisions. When
you use
just {rand}, the chance of collisions is very low, on the order of 2**(-
how
many bits are in the random number implementation). When you use
{rand(a.size)} the chance of collision is comparatively high.
When keys collide, then the search order you get depends on the
implementation
of the sorting algorithm; note that it doesn’t matter whether or not the
sort
is a stable sort - you get whatever order the sort function ends up with
when
given identical keys. (note that it appears that ruby’s sort_by is
stable, but
that’s irrelevant; you could replace “sort_by” with “reverse.sort_by”
and you’d
still get bias, just bias in a different way)
When keys don’t collide, the sorting algorithm is irrelevant. (well, so
long as
the algorithm works at all) All sorting algorithms will end up putting
the
list in the same order given that collection of keys.
Now, some empirical numbers as evidence. First let’s try the algorithm
I claim
is unbiased. Let’s see what the first number is when shuffling (1…3):
irb(main):024:0> a = Hash.new(0); 99999.times {
a[(1…3).sort_by{rand}[0]] +=
1}; a
=> {1=>33348, 2=>33393, 3=>33258}
Note that each value has approximately the same likelihood, as we would
expect.
I suppose we could go and do a chi-squared test on the results, but I’m
willing
to just eyeball it and say that that’s a good result. I’ll note that
the
average initial value is 1.99909999099991, which is really close to 2.
Now let’s try the algorithm which the Haskell paper was talking about,
and which
I said was biased:
irb(main):028:0> a = Hash.new(0); 99999.times {
a[(1…3).sort_by{rand(3)}[0]] +=
1}; a
=> {1=>52017, 2=>18490, 3=>29492}
Note how comparatively unlikely it is to get a 2 first. The average
initial
value is around 1.77.
Note that this bias lessens as the array (and therefore, the pool of
keys) grows
larger. The Haskell paper, when demonstrating bias, used an array of
size two.
For sorting purposes, though, the particular range of
the keys is irrelevant, all that matters is that the keys are
comparable (you could use strings if you felt like it, too).
Except, as I’ve pointed out, no. It’s all about how likely you are to
get key
collisions. With a limited integer range to choose from, the chance is
comparatively high.
Regarding whether or not bias actually exists, I didn’t say that the
sorting implementation is definitely going to be biased, merely that
implementations which rely on a sort are potentially biased
depending on the underlying sorting algorithm.
Actually, the sort_by shuffling algorithm with a high chance of key
collision is
biased regardless of the details of the underlying sorting algorithm.
Let’s
take a two element list, as the haskell paper did, and consider two
possible
sorting algorithms for sorting two-element lists:
ALGORITM A: if a[0].key >= a[1].key then a.reverse else a end
ALGORITM B: if a[0].key > a[1].key then a.reverse else a end
In fact, every sorting algorithm is going to be equivalent to one of
these when
applied to a two element list. (because, given equal keys, there are
only two
choices for what to do with the list)
Now, let’s look at what happens when we use %w(x y).sort_by{rand(2)}:
1/4 of the time: The keys chosen are [0,1] - both algorithms produce
%w(x y)
1/4 of the time: The keys chosen are [1,0] - both algorithms produce
%w(y x)
1/2 of the time: The keys chosen are the same.
Algorithm A produces %w(y x)
Algorithm B produces %w(x y)
That is, a system with an algorithm A-type sort will produce %w(y x)
three-fourths of the time. A system with an algorithm B-type sort will
produce
%w(x y) three-fourths of the time. Both systems are biased, and only
algorithm
B-type systems have a stable sort.
Note that if we threw out and re-shuffled all the cases where there’s
any key
collision, we’d have an unbiased shuffle regardless of sorting
algorithms.
We’d also have to reshuffle a two-element list an average of two times
each
time we wanted a proper shuffle, a three-element list 4.5 times on
average, a
four element list 10.67 times, and an n-element list n**n / n! times.
Not
something you want to do.
difference would be if they were in C.
Ew. I meant “efficient”. In any case, it’s nice to see results that
agree with
the theory. I has suspected that the sort_by method might prove faster
because
a large proportion of its computation is done in C (the actual sort) as
compared to the all-ruby F-Y shuffle.
I gotta get a ruby translation of my sig.