Is it more convenient to have a random_each in ruby stdlib?

uncutstone · June 19, 2006, 9:53am

I found it may be a common case that we need randomly enumerate elements
in a array or some other collections.
Please take a look at the very simple snippet of code below:

class Array
def random_each
a = self.dup
n = len = self.length
len.times do
index = rand(n)
yield a[index]
a.delete_at(index)
n -= 1
end
end
end

a = [1,2,3,4,5,6,7]
a.random_each {|e| print e , ’ '}

result:
7 1 6 2 4 3 5

I don’t whether it is proper to put this kind of simple but useful
utility in the standard library of ruby.

Thanks.

uncutstone

uncutstone · June 19, 2006, 10:09am

def random_each

uncutstone

I think it’s a little bit to simple to be put in the standard lib.

a.sort_by{rand}.each {|e| print e , ’ '}

cheers

Simon

uncutstone · June 19, 2006, 10:56am

Kroeger, Simon (ext) wrote:

def random_each

uncutstone

I think it’s a little bit to simple to be put in the standard lib.

a.sort_by{rand}.each {|e| print e , ’ '}

It is nice to see you make it so simple!

very thanks for the reply.

uncutstone

uncutstone · June 19, 2006, 12:20pm

Alex Y. wrote:

I think it’s a little bit to simple to be put in the standard lib.

a.sort_by{rand}.each {|e| print e , ’ '}

For large collections, it’d be much nicer not to have to either
duplicate or sort - isn’t this something that quasirandom sequences are
good for? Is there some GSL-guru out there who can comment?

Perhaps…

class Array
def random_each
((0…size).to_a.sort_by{rand}).each { |i|
yield self[i]
}
end
end

T.

uncutstone · June 19, 2006, 11:52am

Kroeger, Simon (ext) wrote:

From: uncutstone wu
Sent: Monday, June 19, 2006 9:54 AM

class Array
def random_each
a = self.dup

I think it’s a little bit to simple to be put in the standard lib.

a.sort_by{rand}.each {|e| print e , ’ '}

For large collections, it’d be much nicer not to have to either
duplicate or sort - isn’t this something that quasirandom sequences are
good for? Is there some GSL-guru out there who can comment?

uncutstone · June 19, 2006, 2:30pm

On 6/19/06, uncutstone wu [email protected] wrote:

It is nice to see you make it so simple!
Surely this would break some sort algorithms, since the same element
can return different sort priorities (using rand)… I assume it
doesn’t break Ruby’s sort algorithm?

uncutstone · June 19, 2006, 12:32pm

[email protected] wrote:

Perhaps…

class Array
def random_each
((0…size).to_a.sort_by{rand}).each { |i|
yield self[i]
}
end
end

On second though, that still sorts an array of the same size. Hmm…

T.

uncutstone · June 19, 2006, 3:28pm

Leslie V. wrote:

It is nice to see you make it so simple!

Surely this would break some sort algorithms, since the same element
can return different sort priorities (using rand)… I assume it
doesn’t break Ruby’s sort algorithm?

sort_by generates the block value once for each collection element then
uses that value for a standard sort, so it’s not a problem.

uncutstone · June 19, 2006, 3:41pm

Matthew S. wrote:

For large collections, it’d be much nicer not to have to either
duplicate or sort - isn’t this something that quasirandom sequences
are good for? Is there some GSL-guru out there who can comment?

A quasi-random sequence is a theoretically better flavour of random than
a pseudo-random sequence in a lot of cases, but there’s no direct
application to this problem (apart from being used in place of
Kernel#rand). It’d be overkill for all but the most specialised uses,
I’d say.
I was thinking of them more in terms of a random walk across a finite
set of elements that’s guaranteed to hit each element once and only
once. You’ve given a good explanation of the shuffle principle below,
and that’s great for when you want to actually reorder things, but what
if you want to walk across the Enumerable without modifying it? Did I
really mean “perfect hash functions” when I wrote “quasirandom
sequences?”

uncutstone · June 19, 2006, 3:31pm

On Jun 19, 2006, at 9:08, Kroeger, Simon (ext) wrote:

I think it’s a little bit to simple to be put in the standard lib.

a.sort_by{rand}.each {|e| print e , ’ '}

But it turns out to not be that simple. This can give a biased sort
depending on the underlying sort algorithm, and it’s not as efficient
as it could be, as Alex observes:

On Jun 19, 2006, at 10:52, Alex Y. wrote:

For large collections, it’d be much nicer not to have to either
duplicate or sort - isn’t this something that quasirandom sequences
are good for? Is there some GSL-guru out there who can comment?

A quasi-random sequence is a theoretically better flavour of random
than a pseudo-random sequence in a lot of cases, but there’s no
direct application to this problem (apart from being used in place of
Kernel#rand). It’d be overkill for all but the most specialised
uses, I’d say.

On Jun 19, 2006, at 11:29, [email protected] wrote:

On second thought, that still sorts an array of the same size. Hmm…
What we’re looking for is a shuffle; a random selection of elements
from an array might not include all the elements from the source,
whereas a shuffle is a permutation of the source.

Like a lot of simple-sounding problems, this one’s already been
solved, see:

The linked Haskell implementation also gives a very thorough
discussion, including a demonstration of how a stable sorting
algorithm will yield biased results. The imperative solution given
in that discussion is identical to the Fisher-Yates shuffle on the
NIST site, which is what I’ll use here:

for i: 0…n, exchange each element e_i with another element
e_i…e_n.

Which is the best possible case of O(n), compared to O(n lg n) when
using #sort_by{rand}. This is proven to be unbiased (see the
reference to Knuth given on the NIST site).

Intuitively, the best place for this seems like Enumerable, since
anything that can be sorted can be shuffled. Is this the sort of
thing that needs an RCR? Anyway, I doubt I’d use it myself, but I
can see the following arguments for its inclusion:

It’s easy to do wrong:
- inefficiently and possibly biased using #sort_by.
- biased under any other sort of element exchange:
  e.g., exchanging e_i with any element in the
  array instead of just the unshuffled elements
  leads to a biased shuffle.
It seems fairly generic.
People migrating from other languages may expect
it (e.g., PHP, Java).

On the other hand, people who REALLY care about unbiased shuffles
will already know how to do them.

Quick Ruby implementation for Array: In-place shuffling is the
simplest case. We can use it to create a more Ruby-esque shuffle by
using an approach similar to trans’ one given above: shuffle an array
of indices and use that as a mapping between the original and
shuffled arrays.

class Array
def shuffle!
self.each_index { |i|
r = rand(size - i) + i
tmp = self[i]
self[i] = self[r]
self[r] = tmp
}
end

this would work if dup produced a deep copy.

def dup_shuffle
self.dup.shuffle!
end

use an in-place shuffle of an array of indices

def shuffle
res = []
indices = (0…size).to_a.shuffle!
indices.each_with_index { |x, i|
block_given? ? yield(self[x]) : res[i] = self[x]
}
block_given? ? self : res
end
end

matthew smillie.

uncutstone · June 19, 2006, 3:47pm

Matthew S. wrote:

On the other hand, people who REALLY care about unbiased shuffles will
already know how to do them.

The problem with this argument is that people might not know when they
need them, and their not being unbiased when you’ve implicitly made that
assumption sounds like the sort of thing that would generate really
subtle problems. Correct me if I’m wrong…

uncutstone · June 19, 2006, 5:22pm

On Jun 19, 2006, at 14:46, Alex Y. wrote:

Matthew S. wrote:

On the other hand, people who REALLY care about unbiased shuffles
will already know how to do them.
The problem with this argument is that people might not know when
they need them, and their not being unbiased when you’ve implicitly
made that assumption sounds like the sort of thing that would
generate really subtle problems. Correct me if I’m wrong…

No, you’re right. The catch, as usual, comes down to something being
“good enough in practice”.

It’s a big circular argument: if you don’t know that you need a
perfect shuffle, then you’re unlikely to actually need one (a biased
one is often good enough). If you do know you need a perfect
shuffle, you’re likely to code your own rather than relying on
library code, since it’s simple enough to do so.

I don’t actually have much preference either way for including any
sort of shuffle in Array or Enumerable or wherever, I was just
presenting pros/cons I could think of.

You’ve given a good explanation of the shuffle principle below, and
that’s great for when you want to actually reorder things, but what
if you want to walk across the Enumerable without modifying it?
Did I really mean “perfect hash functions” when I wrote
“quasirandom sequences?”

“Perfect hash functions” makes more sense to me in this context,
anyway - I’m not aware that quasi-random sequences make any once-and-
only-once guarantee the same way a perfect hash function does, they
just guarantee a uniform distribution within a certain tolerance/
discrepancy. I might be missing something subtle. Or something
obvious, for that matter; I wouldn’t know, since I’m the one missing it.

Anyway, that’s the approach I used in Array#shuffle. The shuffled
array of indices defines a perfect hash function for the original
array: each key/index in the original array is mapped to a distinct
integer; they key/index in the shuffled array of indices. Illustration:

original array:
arr = [‘a’, ‘b’, ‘c’, ‘d’]

shuffled array of indices:
[2, 0, 1, 3]

implicit perfect hash function:
0 => 2, 1 => 0, 2 => 1, 3 => 3

application of the function to arr[0…size]:
[‘c’, ‘a’, ‘b’, ‘d’]

You can see in the implementation that providing a block parameter to
Array#shuffle does walk across the entire Array without modifying
it. Without the block parameter, it returns a new, shuffled Array.

matthew smillie.

uncutstone · June 19, 2006, 5:25pm

Hi Matthew,

I like these implementations, though I have some comments and
simplifications.

Matthew S. wrote:

def shuffle
res = []
indices = (0…size).to_a.shuffle!
indices.each_with_index { |x, i|
block_given? ? yield(self[x]) : res[i] = self[x]
}
block_given? ? self : res
end
end

I’m not too keen on #shuffle having a dual function as both a shuffler
and an iterator – I’d be more comfortable using `ary.shuffle.each’.
That would also allow a major simplification of the implementation:

class Array
def shuffle
values_at(*(0…size-1).to_a.shuffle!)
end
end

Cheers,
Daniel

uncutstone · June 19, 2006, 6:38pm

I’m not too keen on #shuffle having a dual function as both a
shuffler and an iterator – I’d be more comfortable using
`ary.shuffle.each’. That would also allow a major simplification of
the implementation:

class Array
def shuffle
values_at(*(0…size-1).to_a.shuffle!)
end
end

I agree with you about ‘arr.shuffle.each’ being better semantically,
but I was worried about efficiency; since the thread started out with
concerns about shuffling really big arrays, I wanted to avoid
creating an intermediate copy of the array. I didn’t really think
about it that hard, though, because thinking about it now, I realise
an intermediate copy isn’t all that bad (being just references), so
I’m not sure the savings would matter all that much.

Anyway, Array’s just a special case: I’d say that it belongs in
Enumerable along with sort, for which I’d grab your implementation.

module Enumerable
def shuffle
entries.values_at(*(0…entries.size).entries.shuffle!)
end
end

Though this still relies on Array#shuffle!, which I can’t say is a
good or bad thing. Bit of a moot question unless someone does decide
to put it into core or stdlib, though.

matthew smillie.

uncutstone · June 19, 2006, 7:05pm

Matthew S. wrote:

I agree with you about ‘arr.shuffle.each’ being better semantically,
but I was worried about efficiency; since the thread started out with
concerns about shuffling really big arrays, I wanted to avoid
creating an intermediate copy of the array. I didn’t really think
about it that hard, though, because thinking about it now, I realise
an intermediate copy isn’t all that bad (being just references), so
I’m not sure the savings would matter all that much.

Yes, since array just stores object references, (0…size).to_a is as
large as aArray.dup, they cost same time and space.

Best Regards.

uncutstone

uncutstone · June 19, 2006, 7:22pm

All of a sudden I’m beginning to get major performance problems when
calling #shuffle!. Am I the only one who’s experiencing this?

It does so on both 1.8 and 1.9.

Hmm, I’d better investigate further.

uncutstone · June 19, 2006, 7:06pm

Matthew S. wrote:

Anyway, Array’s just a special case: I’d say that it belongs in
Enumerable along with sort, for which I’d grab your implementation.

module Enumerable
def shuffle
entries.values_at(*(0…entries.size).entries.shuffle!)
end
end

Yup, Enumerable’s the place it should be.

Though this still relies on Array#shuffle!, which I can’t say is a good
or bad thing.

Since #shuffle! would alter the receiver, it shouldn’t be defined in
Enumerable. Other ordered collections could implement their own
versions.

I was wondering, is Ruby’s internal variable “switching” faster than
using a temperary variable? #shuffle! can be written like this, but I
don’t know if it’s faster.

def shuffle!
each_index do |i|
r = rand(size - i) + i
self[i], self[r] = self[r], self[i]
end
end

Cheers,
Daniel

uncutstone · June 19, 2006, 7:25pm

Daniel S. wrote:

All of a sudden I’m beginning to get major performance problems when
calling #shuffle!. Am I the only one who’s experiencing this?

It does so on both 1.8 and 1.9.

Hmm, I’d better investigate further.

Never mind, it seemed to go away when I moved the code into its own
file. Weird…

uncutstone · June 19, 2006, 8:10pm

Though this still relies on Array#shuffle!, which I can’t say is a
good or bad thing.

Since #shuffle! would alter the receiver, it shouldn’t be defined
in Enumerable. Other ordered collections could implement their own
versions.

I agree with you there, but it’s not quite what I meant. More
clearly, Enumerable#entries returns an array with no guarantees about
the ordering. Enumerable#sort and #shuffle return similar arrays,
but include guarantees about the ordering. Enumerable#shuffle uses
Array#shuffle! internally to produce the final array, which just
doesn’t quite click for me. It’s just an aesthetic disagreement, and
I can’t quite put my finger on anything concrete.

I was wondering, is Ruby’s internal variable “switching” faster
than using a temperary variable? #shuffle! can be written like
this, but I don’t know if it’s faster.

My hunch is that even if we could reliably measure a difference in a
benchmark, it would disappear into the static in a practical system.
At least, I’d hope that’s the case - one being significantly faster
than the other would set off some alarm bells for me about Ruby’s
implementation.

matthew smillie.

uncutstone · June 19, 2006, 9:13pm

Matthew S. wrote:

Since #shuffle! would alter the receiver, it shouldn’t be defined in
Enumerable. Other ordered collections could implement their own versions.
I agree with you there, but it’s not quite what I meant. More clearly,
Enumerable#entries returns an array with no guarantees about the
ordering. Enumerable#sort and #shuffle return similar arrays, but
include guarantees about the ordering. Enumerable#shuffle uses
Array#shuffle! internally to produce the final array, which just doesn’t
quite click for me. It’s just an aesthetic disagreement, and I can’t
quite put my finger on anything concrete.

I can see what you mean – you’d rather have #shuffle! depend on
#shuffle than the other way round? I.e.

class Array
def shuffle!
replace(shuffle)
end
end

I guess that could be possible.

module Enumerable
def shuffle
ary = entries
ary.each_index do |i|
r = rand(ary.size - 1) + i
ary[i], ary[r] = ary[r], ary[i]
end
end
end

Of course, it would not be much different than `entries.shuffle!’

Cheers,
Daniel