Forum: Ruby Crossword Solver (#132)

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
James G. (Guest)
on 2007-08-02 19:08
(Received via mailing list)
Many of these solutions got pretty sluggish when run on large
crosswords.  The
reason is simple:  the search space for this problem is quite large.
You have
to try a significant number of words in each position before you will
find a
good fit for the board.

The good news is that this summer Google has paid to bring a very
powerful
search tool to Ruby.  I've been lucky enough to have a ring-side seat
for this
process and now I want to show you a little about this new tool.

You've probably seen Andreas L. solving a fair number of the recent
quizzes
using his Gecode/R library.  Gecode/R is a wrapper over the C Gecode
library for
constraint programming.  Constraint programming is a technique for
describing
problems in a way that a tool like Gecode can then use to search the
solution
space and find answers for you.  Gecode is a very smart little searcher
though
and will heavily prune the search space based on your constraints.  This
leads
to some mighty quick results, as you can see:

  $ time ruby solve_crossword.rb /path/to/scowl-6/final/english-words.20
                                 < /path/to/test_board.txt
  Reading the dictionary...
  Please enter the template (end with ^D)
  Building the model...
  Searching for a solution...
  A B I D E

  N   C   A

  G H O S T

  E   N   E

  L A S E R

  real    0m0.430s
  user    0m0.360s
  sys     0m0.068s

Let's have a look at Andreas's code to see how it sets things up for
Gecode.
Here's the start of the code:

  require 'enumerator'
  require 'rubygems'
  require 'gecoder'

  # The base we use when converting words to and from numbers.
  BASE = ('a'..'z').to_a.size
  # The offset of characters compared to digits in word-numbers.
  OFFSET = 'a'[0]
  # The range of integers that we allow converted words to be in. We are
  # only using the unsigned half, we could use both halves, but it would
  # complicate things without giving a larger allowed word length.
  ALLOWED_INT_RANGE = 0..Gecode::Raw::Limits::Int::INT_MAX
  # The maximum length of a word allowed.
  MAX_WORD_LENGTH = (Math.log(ALLOWED_INT_RANGE.last) /
    Math.log(BASE)).floor

  # ...

You can see that Andreas loads Enumerator and Gecode/R here as well as
setting
up some constants.  The constants relate to how this code will model
words in
the database.  The plan here is to represent words as numbers made up of
base 26
digits which represent letters of the alphabet.  This will allow Andreas
to use
Gecode's integer variables to model the problem.  The downside is that
word size
will be limited by the maximum size of an integer and thus this solution
has a
weakness in that it can't be used to solve the larger puzzles.

You can see this conversion from numbers to words in the Dictionary
class:

  # ...

  # Describes an immutable dictionary which represents all contained
words
  # as numbers of base BASE where each digit is the corresponding letter
  # itself converted to a number of base BASE.
  class Dictionary
    # Creates a dictionary from the contents of the specified dictionary
    # file which is assumed to contain one word per line and be sorted.
    def initialize(dictionary_location)
      @word_arrays = []
      File.open(dictionary_location) do |dict|
        previous_word = nil
        dict.each_line do |line|
          word = line.chomp.downcase
          # Only allow words that only contain the characters a-z and
are
          # short enough.
          next if previous_word == word or word.size > MAX_WORD_LENGTH
or
            word =~ /[^a-z]/
          (@word_arrays[word.length] ||= []) << self.class.s_to_i(word)
          previous_word = word
        end
      end
    end

    # Gets an enumeration containing all numbers representing word of
the
    # specified length.
    def words_of_size(n)
      @word_arrays[n] || []
    end

    # Converts a string to a number of base BASE (inverse of #i_to_s ).
    def self.s_to_i(string)
      string.downcase.unpack('C*').map{ |x| x - OFFSET }.to_number(BASE)
    end

    # Converts a number of base BASE back to the corresponding string
    # (inverse of #s_to_i ).
    def self.i_to_s(int)
      res = []
      loop do
        digit = int % BASE
        res << digit
        int /= BASE
        break if int.zero?
      end
      res.reverse.map{ |x| x + OFFSET }.pack('C*')
    end
  end

  # ...

We've already talked about the number representation which is the
majority of
the code here.  Do have a look at initialize() and words_of_size()
though, to
see how words are being stored.  An Array is created where indices
represent
word lengths and the values at those indices are nested Arrays of words
with
that length.  This makes getting a list of words that could work in a
given slot
of the puzzle easy and fast.

The s_to_i() method above relies on a helper method added to Array,
which is
simply this:

  class Array
    # Computes a number of the specified base using the array's elements
    # as digits.
    def to_number(base = 10)
      inject{ |result, variable| variable + result * base }
    end
  end

Again, this is just another piece of the conversion I explained earlier.

With a Dictionary created, it's time to model the problem in Gecode
constraints:

  # Models the solution to a partially completed crossword.
  class Crossword < Gecode::Model
    # The template should take the format described in RubyQuiz #132 .
The
    # words used are selected from the specified dictionary.
    def initialize(template, dictionary)
      @dictionary = dictionary

      # Break down the template and create a corresponding square
matrix.
      # We let each square be represented by integer variable with
domain
      # -1...BASE where -1 signifies # and the rest signify letters.
      squares = template.split(/\n\s*\n/).map!{ |line| line.split(' ') }
      @letters = int_var_matrix(squares.size, squares.first.size,
        -1...BASE)

      # Do an initial pass, filling in the prefilled squares.
      squares.each_with_index do |row, i|
        row.each_with_index do |letter, j|
          unless letter == '_'
            # Prefilled letter.
            @letters[i,j].must == self.class.s_to_i(letter)
          end
        end
      end

      # Add the constraint that sequences longer than one letter must
form
      # words. @words will accumulate all word variables created.
      @words = []
      # Left to right pass.
      left_to_right_pass(squares, @letters)
      # Top to bottom pass.
      left_to_right_pass(squares.transpose, @letters.transpose)

      branch_on wrap_enum(@words), :variable => :largest_degree,
        :value => :min
    end

    # Displays the solved crossword in the same format as shown in the
    # quiz examples.
    def to_s
      output = []
      @letters.values.each_slice(@letters.column_size) do |row|
        output << row.map{ |x| self.class.i_to_s(x) }.join(' ')
      end
      output.join("\n\n").upcase.gsub('#', ' ')
    end

    # ...

After storing the dictionary, this code breaks the crossword template
down into
an integer matrix created using the Gecode/R helper method
int_var_matrix().
This will be our puzzle of words Gecode is expected to fill in.

The next two sections of the initialize() method build up the
constraints.
These are the rules that must be satisfied when we have found a correct
solution.

The first of these chunks of code sets rules for any letters that were
given to
us in the template.  This code tells Gecode that the number in that
position of
the matrix must equal the value of the provided letter.  Take a good
look at
this RSpec like syntax because Andreas has spent a considerable effort
on making
it easy to express your constraints in a natural syntax and I hope you
will
agree with me that the end result is quite nice.

The other chunk of constraints are defined using a helper method we will
examine
in just a moment.  The result of this code though is to ensure that the
numbers
selected represent letters that form actual words.

The final step in describing the problem to Gecode is to choose a
branching
strategy.  This tells Gecode which variables it will need to make
guesses about
in order to find a solution as well as selecting a heuristic to use when
guesses
must be made.  In this case, words will be selected based on how much of
the
overall puzzle they affect, hopefully ruling out large sections of the
search
space quickly.

The problem model we just examined is pretty much always how constraint
programming goes.  You just need to remember the three steps:  create
some
variables for Gecode to fill in, define the rules for the values you
want Gecode
to find, and select a strategy for Gecode to use in solving the problem.

The to_s() method above just creates the output used in the quiz
examples for
display to the user.

Let's have a look at the helper methods used in the model definition
now:

  # ...

    private

    # Parses the template from left to right, line for line,
constraining
    # sequences of two or more subsequent squares to form a word in the
    # dictionary.
    def left_to_right_pass(template, variables)
      template.each_with_index do |row, i|
        letters = []
        row.each_with_index do |letter, j|
          if letter == '#'
            must_form_word(letters) if letters.size > 1
            letters = []
          else
            letters << variables[i,j]
          end
        end
        must_form_word(letters) if letters.size > 1
      end
    end

    # Converts a word from integer form to string form, including the #.
    def self.i_to_s(int)
      if int == -1
        return '#'
      else
        Dictionary.i_to_s(int)
      end
    end

    # Converts a word from string form to integer form, including the #.
    def self.s_to_i(string)
      if string == '#'
        return -1
      else
        Dictionary.s_to_i(string)
      end
    end

    # Constrains the specified variables to form a word contained in the
    # dictionary.
    def must_form_word(letter_vars)
      raise 'The word is too long.' if letter_vars.size >
MAX_WORD_LENGTH
      # Create a variable for the word with the dictionary's words as
      # domain and add the constraint.
      word = int_var @dictionary.words_of_size(letter_vars.size)
      letter_vars.to_number(BASE).must == word
      @words << word
    end
  end

  # ...

The i_to_s() and s_to_i() methods here are mostly just wrappers over the
Dictionary counterparts we examined earlier.  The real interest is the
other two
methods that together define the word constraints.

First, left_to_right_pass() is used to walk the puzzle looking for runs
of two
or more letters that will need to become words in the Dictionary.  Each
time it
finds one, a hand-off is made to must_form_word(), which builds the
actual
constraint.

With the problem modeled, it takes just a touch more code to turn this
into a
full solution:

  # ...

  puts 'Reading the dictionary...'
  dictionary = Dictionary.new(ARGV.shift || '/usr/share/dict/words')
  puts 'Please enter the template (end with ^D)'
  template = ''
  loop do
    line = $stdin.gets
    break if line.nil?
    template << line
  end
  puts 'Building the model...'
  model = Crossword.new(template, dictionary)
  puts 'Searching for a solution...'
  puts((model.solve! || 'Failed').to_s)

You can see that this application code is just reading in the dictionary
and
template, then constructing a model and pulling a solution with the
solve!()
method.  That's all it really takes to get answers after describing your
problem
to Gecode.

If you want to continue exploring Andreas's Gecode/R wrapper, drop by
the Web
site which has links to many useful resources:

  http://gecoder.rubyforge.org/

My thanks to all who put a good deal of effort into a hard search
problem.

Tomorrow we will take another peek at our dictionary, this time to see
what
numbers are hiding in there...
This topic is locked and can not be replied to.