How big is the file?
Do you want sampling without replacement (shuffle the original file
keeping the lines intact) or sampling with replacement (n lines randomly
chosen from the file)?
I’m going to assume that 1 is “too big for a considerate programmer to
read all into memory on a shared machine” and 2 is “without replacement
(shuffling)”. I’m also going to assume that you’re on some form of UNIX
machine that has the “sort” verb. For Windows, that could mean CygWin.
So what you want to do is make a copy of the file with random numbers
tacked on to the front of each line. Then sort the tagged copy
numerically using the external “sort” verb, and remove the tags from the
sorted copy. You’ll be doing everything in Ruby except the sort.
If the answer to 1 is “small enough to fit into memory”, just read the
file into memory, tag the lines with random numbers, and use a Ruby
“sort” to do the sorting, then untag the lines and write out the file.
You’ll be doing everything in Ruby.
By the way, I do this sort of thing rather often. The files in question
are data files that drive performance test scripts. They’re small (under
65536 lines), so I just read them into Excel, tack on a random column,
sort on the random column, delete the random column, and write the file
If you want sampling with replacement, the easiest way to do it is
using R. I don’t know how to do it in Ruby or Excel, since I have R.
I think the “too big/shuffled” case would make an interesting Ruby quiz,
if you rule out the external sort verb as “cheating”.
M. Edward (Ed) Borasky