Posix Pangrams (#97)

This problem turns out to be pretty tough. Finding the best pangram for
a given
constraint can be NP-Hard, as explained by Ken B.:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/219003

Given that, we’re just interested in code that gets reasonably close to
ideal
results in a decent amount of time. All of the submissions did that
quite well,
but I want to show off Ezwan Aizat Bin Abdullah F.'s code below. I’ve
Rubyified it just a bit, but the code still works the same. For
reference, it
finds this pangram in under a second on my box:

String: zcat jobs newgrp mv qhold tty uux mkfifo
Panagram? true
Words? 8
Length? 33
Dupes? 7

Let’s start with the easy stuff:

#!/usr/bin/ruby

# removed c99 fort77 m4
words = %w[ admin alias ar asa at awk basename batch bc bg cal cat cd
            cflow chgrp chmod chown cksum cmp comm command compress cp
            crontab csplit ctags cut cxref date dd delta df diff 

dirname
du echo ed env ex expand expr false fc fg file find fold
fuser gencat get getconf getopts grep hash head iconv id
ipcrm ipcs jobs join kill lex link ln locale localedef
logger
logname lp ls mailx make man mesg mkdir mkfifo more mv
newgrp
nice nl nm nohup od paste patch pathchk pax pr printf prs
ps
pwd qalter qdel qhold qmove qmsg qrerun qrls qselect qsig
qstat qsub read renice rm rmdel rmdir sact sccs sed sh
sleep
sort split strings strip stty tabs tail talk tee test time
touch tput tr true tsort tty type ulimit umask unalias
uname
uncompress unexpand unget uniq unlink uucp uudecode
uuencode
uustat uux val vi wait wc what who write xargs yacc zcat ]
words_line = words.join(" ")

# ...

This code begins by filling an Array with the Posix utilities, except
for the
three that contain digits in their names. Everyone dropped these to
keep things
simple. Note that the words are also joined to form one long line of
utility
names.

Next we need to enhance String a bit with some capabilities we will need
to do
our work:

# ...

class String
  def letters(&block)
    scan(/[a-z]/, &block)
  end

  def pangram?
    letters.uniq.length == 26
  end

  def duplicated_letters
    seen = Hash.new { |found, char| found[char] = 1; 0 }
    letters.inject(0) { |sum, l| sum + seen[l] }
  end
end

# ...

We will need easy access to an Array of letters (without the spaces), so
a
simple wrapper is created over scan(). From there we can check pangram
status
by ensuring that we have 26 unique letters. Finally, when printing
statistics
it would be nice to be able to see how many letters are duplicates, so
we add a
method for that too.

Now the code builds up some handy constants in preparation for the
search:

# ...

OCCURRENCE = Hash.new { |counts, char| counts[char] = 0 }
words_line.letters { |char| OCCURRENCE[char] += 1 }

WORDS_LENGTH = words_line.delete(" ").length

# ...

Here we see the OCCURRENCE Hash being filled. It will be the key to the
search
algorithm. The Hash is simply a count of how many times each letter is
present
in all of the names. “e” appears the most at a count of 62 times and
“z” is
present only once.

Another constant is filled with the overall character count.

Here’s a method used to print statistics when we have our answer:

# ...

def stats(line)
  puts "String: #{line}"
  puts "Pangram? #{line.pangram?}"
  puts "Words? #{line.split.length}"
  puts "Length? #{line.delete(' ').length}"
  puts "Dupes? #{line.duplicated_letters}"
end

# ...

I think that one is pretty self documenting.

We’re finally ready for the primary search algorithm and here it comes:

# ...

=begin
 Suitability should be determined by
  * least number of letters takes out the length of the computer
  * no used letters
  * no duplicates
=end
def suitability(value, line)
  amount, used = 0, ""

  value.letters do |char|
    amount += if used.include?(char) || line.include?(char)
      -OCCURRENCE[char]
    else
      WORDS_LENGTH / OCCURRENCE[char]
    end
    used << char
  end

  amount
end

# ...

This method grades the passed value for inclusion in the line. Each
letter is
scored, creating an overall suitability rating for this value.

If the letter is already in the line or the letters we have seen so far
from
this value, the OCCURRENCE value for that letter is tacked onto the
score as a
penalty. This ensures the duplicates, especially of common letters, are
costly
so the code will try to avoid them.

If this is the first time the letter has been encountered, the score is
the
OCCURRENCE for that letter divided into the total letter count. This
makes it
valuable for the code to place letters like the “z” early on, since we
don’t
have any choices on that one and thus will also need to accept the
letters it
comes with.

Here’s the final piece of the puzzle, where we see suitability() put to
work:

# ...

line = ""

until line.pangram?
  words.sort! do |x, y|
    suitability(y, line) <=> suitability(x, line)
  end

  line += "#{words.shift} "
end

stats line

The process is easy to follow here. First an empty line is created.
Then we
loop until the line has become a pangram. At each step, the utility
list is
sorted by suitability() and the best word added to the line. Just
before exit,
the program dumps the statistics as the result.

My thanks to all you pangram hunters. As always you taught me clever
new tricks
in optimization.

Tomorrow we’ll play with the king of pathfinding algorithms…

Ruby Q. [email protected] writes:

My thanks to all you pangram hunters. As always you taught me
clever new tricks in optimization.

For the record, this is how I found the shortest possible solution:

  1. Look at the words. Recognize “zcat” needs to be in there, and
    remove it from the list.

  2. Download dance.w1 from Donald E. Knuth, tangle and compile.

  3. Write a small program prep.rb to preprocess the words for the
    expected input format:

    words = %w{
    admin alias ar asa at awk basename batch bc bg c99 cal cat cd cflow
    chgrp chmod chown cksum cmp comm command compress cp crontab csplit
    ctags cut cxref date dd delta df diff dirname du echo ed env ex
    expand
    expr false fc fg file find fold fort77 fuser gencat get getconf
    getopts
    grep hash head iconv id ipcrm ipcs jobs join kill lex link ln locale
    localedef logger logname lp ls m4 mailx make man mesg mkdir mkfifo
    more
    mv newgrp nice nl nm nohup od paste patch pathchk pax pr printf prs
    ps
    pwd qalter qdel qhold qmove qmsg qrerun qrls qselect qsig qstat qsub
    read renice rm rmdel rmdir sact sccs sed sh sleep sort split strings
    strip stty tabs tail talk tee test time touch tput tr true tsort tty
    type ulimit umask unalias uname uncompress unexpand unget uniq
    unlink
    uucp uudecode uuencode uustat uux val vi wait wc what who write
    xargs
    yacc zcat
    }

    ----------------- required -----------------|-- optional –

    puts “b d e f g h i j k l m n o p q r s u v w x y | z c a t 4 7 9”
    words.each { |w|
    puts w.split(//).join(" ")
    }

  4. Run it like this:

    $ ruby prep.rb | dance verbose

    1:
    j o b s (1 of 2)
    y t t (1 of 3)
    h g r p c (1 of -2)
    e x l (5 of -3)
    w k a (1 of -2)
    i q u n (4 of -2)
    f d (1 of 0)
    m v (2 of 0)
    2:
    j o i n (2 of 2)
    v m (1 of 2)
    k a w (1 of 3)
    h g r p c (1 of 2)
    b q s u (1 of 1)
    f d (1 of 1)
    l e x (1 of 1)
    y t t (1 of 1)
    3:
    j o i n (2 of 2)
    v m (1 of 2)
    k a w (1 of 3)
    h s (2 of 2)
    q d e l (1 of 1)
    y t t (1 of 1)
    r p (1 of 0)
    b c (1 of 2)
    f g (1 of 1)
    u u x (1 of 2)
    4:
    j o i n (2 of 2)
    v m (1 of 2)
    k a w (1 of 3)
    h s (2 of 2)
    q d e l (1 of 1)
    y t t (1 of 1)
    r p (1 of 0)
    b c (1 of 2)
    f g (1 of 1)
    u x u
    5:
    j o i n (2 of 2)
    v m (1 of 2)
    k a w (1 of 3)
    h s (2 of 2)
    q d e l (1 of 1)
    y t t (1 of 1)
    r p (1 of 0)
    b g (2 of 2)
    f c (1 of 1)
    u u x (1 of 3)
    6:
    j o i n (2 of 2)
    v m (1 of 2)
    k a w (1 of 3)
    h s (2 of 2)
    q d e l (1 of 1)
    y t t (1 of 1)
    r p (1 of 0)
    b g (2 of 2)
    f c (1 of 1)
    u x u
    Altogether 6 solutions, after 3056 updates.
    0 0 1 1 0 0 2 nodes, 597 updates
    1 1 0 1 0 0 3 nodes, 663 updates
    3 0 1 0 0 0 4 nodes, 663 updates
    0 2 0 0 0 0 2 nodes, 355 updates
    0 2 0 0 0 0 2 nodes, 257 updates
    2 1 0 0 0 0 3 nodes, 263 updates
    1 1 1 0 0 0 3 nodes, 139 updates
    3 2 0 0 0 0 5 nodes, 78 updates
    0 0 0 0 1 1 2 nodes, 19 updates
    4 0 0 0 0 0 4 nodes, 22 updates
    Total 31 nodes.

  5. Look at the shortest solutions (1 and 2), and write them down as
    unix commands.

  6. Since dance.w uses DLX to find optimal matches, these are the
    shortest solution (but not with least possible words, as I first
    assumed, too). We needed to remove “zcat”, because there is no way
    with it to have an optimal match (i.e. every letter only once).

    “awk chgrp df jobs lex mv tty uniq zcat”
    “awk chgrp df join lex mv tty qsub zcat”

Enjoy, “and know thy tools”,