Preferable Pairs (#165)

Apologies for being a little late today… Took a little bit to
understand
some of what was going on in the solutions for the Preferable Pairs
quiz,
and I’m not certain I explained it well enough, but… here it is.

This “Preferable Pairs” quiz is, as was noted, very similar to the
Stable
Roommates
problem, though the goal as originally stated is slightly
different from the typical presentation of the Stable Roommates problem.
Still, the minor goal difference does not change that this is an
NP-complete problem, and as such, large input sets would require
some
sort of approximation, heuristics or other techniques to keep the run
time
reasonable.

Fortunately, I had no intention of testing the solutions provided on
large
sets. My test involved a set of 20 people, to be grouped into 10 pairs
according to their preferences. Here are the results of my test on the
solutions provided.

Solution          Time      Score
Andrea F.      0.159     438
Dustin Barker     0.020     589
Eric Ivancich     4.671     311
Steven Hahn       -DNF-
Eric M.      0.211     311
Matthew M.      0.022     589
Thomas ML         0.114     311

I neglected to post my own solution, but it’s nearly identical to
Dustin’s
both in algorithm, results and performance (but mine is much uglier).
Also,
apologies to Steven… I tried to wait, really I did, but it just kept
going, and going…

Andrea, Dustin and myself made straightforward attempts that were fast
but
not optimal. While the numbers presented above don’t show a huge
disparity
in performance, I suspect that would be a different story if we were
attempting to solve larger datasets, of thousands of people as opposed
to
twenty. For the tested dataset, I believe there might be a few grumbling
people, forced to play with someone they didn’t like very much, but the
tournament would go on.

Eric Ivancich provided a genetic algorithm that is notably slower (at
this
sample size), and isn’t guaranteed to get the most optimal answer, but
it
happened to do so here. It would be interesting to compare its
performance
against the optimal solution algorithms for larger data sets.

Eric M. and Thomas ML provided algorithms that find the optimal
solution. There is a fair bit of setup code in both to get the data into
a
convenient form for the optimization algorithm to work. I’m going to
skip
over that and jump right into Thomas’ optimize method. As input to
optimize, the pairings argument looks like this for the sample data:

[
  [["David", "Helen"], 0],
  [["David", "Vicki"], 1],
  [["Helen", "Vicki"], 2],
  [["Joseph", "Vicki"], 4],
  [["Helen", "Joseph"], 5],
  [["David", "Joseph"], 8]
]

Basically, the input is an ordered list of all pairs with the
corresponding
score (as calculated according to the metric as described by the quiz).
It
is a combined score, and reflects the preferences of both people in the
pair. For example, the pair of Helen and Joseph has a score of 5,
because
Helen scores Joseph as 4, while Joseph scores Helen as 1: so, 4 + 1 ==
5.

Before we conquer the whole of the algorithm, let’s look at a simplified
version of optimize to get a feel for the general structure. The input
here (for argument pairings) is similar to the above, but without the
scores.

def optimize(pairings, names, pos, posmax)
    bestpairs = nil
    while pos < posmax
        pair = pairings[pos]
        pos += 1
        if names & pair == pair
            names1 = names - pair
            if names1.size < 2
                bestpairs = [pair]
                bestpairs << names1 unless names1.empty?
                return bestpairs
            elsif (rv = optimize(pairings, names1, pos, posmax)
                bestpairs = rv
                bestpairs << pair
            end
        end
    end
    return bestpairs
end

bestpairs starts off empty, but by the end should be an array of pairs
chosen as the solution. Each iteration of the loop grabs the next pair
from
pairings and checks to see if it is still usable by set intersection:

        if names & pair == pair

names will contain the names of people still available; that is, those
that have not already been chosen previously. The & operator treats
the
two arrays as sets and performs set intersections. If the result of
intersection is equal to one of the arguments, that argument must be a
subset of the other, and in this context, is safe to choose for the
solution.

When a pair is chosen, those names are removed by Array difference (not
quite the same as set difference, but close enough for this case). So it
is
with:

          names1 = names - pair

That we remove the chosen pair from the list of remaining people. If
that is
the last possible pair to be made (there are fewer than 2 names
remaining),
we finish up by setting and returning the bestpairs array

                bestpairs = [pair]
                bestpairs << names1 unless names1.empty?
                return bestpairs

In the case that there is at least one more pair remaining, we continue
onto
the recursive part of this solution:

            elsif (rv = optimize(pairings, names1, pos, posmax)
                bestpairs = rv
                bestpairs << pair

We know that bestpairs, as returned by the recursive call to
optimize,
contains the best pairs from the subset of names1, which we got above
after removing pair. So we make sure to concatenate pair onto
bestpairs before it is returned to the caller.

As simplified, this amounts to a greedy algorithm, since the pairs were
initially sorted according to score, and the simplified algorithm, at
each
level, simply takes the first pair possible.

Now let’s go back to the complete optimize method.

def optimize(pairings, names, pos, posmax, maxweight=nil)
    bestpairs = nil
    maxweight ||= pairings.size ** 2 + 1
    while pos < posmax
        pair, weight1 = pairings[pos]
        break unless weight1 * (names.size / 2).floor < maxweight
        pos += 1
        if names & pair == pair
            names1 = names - pair
            if names1.size < 2
                bestpairs = [pair]
                bestpairs << names1 unless names1.empty?
                return [weight1, bestpairs]
            elsif (rv = optimize(pairings, names1, pos, posmax,

maxweight - weight1))
maxweight, bestpairs = rv
maxweight += weight1
bestpairs << pair
end
end
end
return bestpairs && [maxweight, bestpairs]
end

The algorithm structure is basically the same: recursive, greedily
selecting
pairs and removing those names from the set of names available… The
major
difference here is the inclusion of the weights (i.e. scores) and
possible
rejection of pairs with respect to those weights.

As recursive calls to optimize are made, the maxweight value is
updated
via this code:

                maxweight, bestpairs = rv
                maxweight += weight1

and checked via this code:

        break unless weight1 * (names.size / 2).floor < maxweight

So a pair may be skipped if it’s weight (scaled by the number of names)
exceeds the current maxweight, determined the the current set of
choices.

When a possible solution is found, the maxweight variable represents
the
total score for that solution. But the algorithm does not stop
immediately;
it keeps checking pairs and solutions of pairs, rejecting those
solutions
and partial solutions whose total score (i.e. maxweight) would exceed
that
previously known maxweight.

In the end, the solution reported is the collection of pairs with the
lowest
total score, and is returned via the original invocation of optimize.

Matthew M. [email protected]

On 6/12/08, Matthew M. [email protected] wrote:

Solution          Time      Score
Andrea F.      0.159     438
Dustin Barker     0.020     589
Eric Ivancich     4.671     311
Steven Hahn       -DNF-
Eric M.      0.211     311
Matthew M.      0.022     589
Thomas ML         0.114     311

Did anybody try to analyze the complexity of the final solution from
Thomas?

The problem with analyzing his solution is this line:

        break unless weight1 * (names.size / 2).floor < maxweight

This causes some dependency on the data for run-time. It seems like
this
line should only be needed for improving runtime, but commenting it out
gives non-optimal results. Don’t get it. If you ignore this line, the
analysis is easier. There are O(n2) pairs and each recursive call
decrements n by 2. When you get to a single pair, it is O(1). So,
O(n
2*(n-2)2(n-4)2…*22*1) or O(n!**2). It is probably more,
but
most of this is hidden in the C code (for these size inputs, we can
assume
every C method call is O(1)).

My solution is easy to analyze because it doesn’t have any data
dependent
variation. It should need to fill a O(2n) memoization cache which
takes
O(n) for each entry - so O(n*2
n).

From the random runs I’ve done, the ThomML algorithm doesn’t look like
O(n!2) at all. The randomized complexity looks similar to my
O(n*2
n)
algorithm based on random input experiments.

I was wondering if a worst case scenario could make his algorithm behave
more like O(n!**2). I tried two extreme scenarios: a) love-hate: the
more
one player loves another, the more the second player hates the first, b)
popularity: each player has a popularity ranking and every player will
prefer in the popularity order.

I found some interesting results with these extremes. The love-hate
case
allowed the ThomML algorithm to go up to hundreds of players (400
players in
30seconds). Probably O(n2) or O(n3) for this scenario. The
popularity
case was quite bad though - 14 players: 19s, 12 players: 1.8s, 10
players:
0.2s. This looks to be O(n!).

My algorithm doesn’t have these extreme variations. For all cases, 24
players: 3.8s, 26 players: 10s, 28 players: 30s.

The script I used to generate input for these cases in below. The first
arg
is the # players. Here’s what you can use for the second arg:
nothing/default/-2: the popularity case, -1: love-hate case, 0: random
w/o
seeding, >0: random w/ seeding.

Eric

#!/usr/bin/env ruby

NAMES = %w(Adam Anna Bob Betty Chuck David Daria Evan Ellen Fred Faye
Greg Gail Hank Helen Irene John Janet Laura Matt Maria Nadine Oleg
Olivia Peter Pamela Rick Rosa Steve Susan Theo Tracy Vicki Walter)

def sample(n, seed)
srand(seed) if seed>0
names = NAMES.clone
i = 1
while names.size<n
i += 1
NAMES.each { |name| names << “#{name}#{i}” }
end
names = names[0, n].sort!
prefs = Hash.new { |h,k| h[k] = [] }
names.each { |name1|
i = 0
names2 = (seed>=0) ? names.sort_by { rand } : names
names2.each { |name2|
next if name2==name1
next if prefs[name1].include?(name2)
while prefs[name1][i]
i += 1
end
prefs[name1][i] = name2
if seed==-1
!prefs[name2][names.size-2-i] or raise(“#{name2} #{i}
#{prefs.inspect}”)
prefs[name2][names.size-2-i] = name1
end
}
}
names.each { |name|
puts(“#{name}: #{prefs[name].join(’ ')}”)
}
end

sample(Integer(ARGV[0] || 10), Integer(ARGV[1]||-2))

On 6/13/08, ThoML [email protected] wrote:

players:
memory. I experimented with memoizing subresults but came to the
conclusion that it isn’t worth the trouble for small random data sets
because the cached data is hardly ever used. I assume memoization
could help to avoid those variations you pointed out.

I think to reasonably memoize in this problem and keep the memory low,
you
need to represent each player as a bit. With players<=31 (or <=63 for
64-bit machine), you can represent any set of players as a Fixnum. A
keys
in the memoize cache can a set of players (Fixnum) and the value can be
the
optimal cost (also Fixnum). With 28 players (30 seconds), my attempt
only
uses 26MB.

Here are a few ideas to improve what you have further:

  1. Represent each player as a bit. Translate going in and out of the
    algorithm. This won’t change the complexity, but should give a speedup
    of
    several times because you are dealing with much simpler data (for the
    computer).
  2. Memoize costs. This should change the worst case complexity from
    factorial (squared?) to exponential (base 2).
  3. Try a more accurate lower bound on the total cost. Instead of
    cost*n/2,
    you might try half of the sum of the next n costs. Another possibility
    may
    be to take half of the sum of the lowest remaining costs for each
    player.
    These might not yield anything because the your current estimation is
    O(1)
    and these are O(n).
  4. Instead of sorting by simply pair cost, you might consider sorting by
    a
    total estimated cost when a pair is chosen. Not sure this will help
    since
    it may be valid for the first pair chosen, but not necessarily good for
    ones
    after that.
  5. Instead of maintaining the current best pair sequence when finding
    the
    lowest cost, this could be done in a separate lower complexity phase.
    With
    memoization this works well since you can easily use the costs in the
    cache
    to “follow the crumbs”. Without, you may also be able to come up with
    some
    marker scheme to retrace your steps.

The general scheme you are using is quite similar to the A* search
algorithm. I also tried a more classic A* search on this problem, but
wasn’t able to improve from what I had. I think all of the extra
overhead
needed wasn’t paying off.

Eric

Hi,

I found some interesting results with these extremes. The love-hate case
allowed the ThomML algorithm to go up to hundreds of players (400 players in
30seconds). Probably O(n2) or O(n3) for this scenario. The popularity
case was quite bad though - 14 players: 19s, 12 players: 1.8s, 10 players:
0.2s. This looks to be O(n!).

Thanks for the script for generating specific types of preferences.
The performance depends on the adequacy of the underlying greedy
algorithm for the data set. In the “love-hate” case, a simple greedy
algorithm already returns optimal pairings. In the “popularity” case,
such an approach doesn’t work well.

But the solution submitted is really simple and doesn’t use too much
memory. I experimented with memoizing subresults but came to the
conclusion that it isn’t worth the trouble for small random data sets
because the cached data is hardly ever used. I assume memoization
could help to avoid those variations you pointed out.

Regards,
Thomas.