Encyclopedia Construction (#205)

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

The three rules of Ruby Q.:

  1. Please do not post any solutions or spoiler discussion for this
    quiz until 48 hours have elapsed from the time this message was
    sent.

  2. Support Ruby Q. by submitting ideas and responses
    as often as you can!
    Visit: http://rubyquiz.strd6.com/suggestions

  3. Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem
helps everyone on Ruby T. follow the discussion. Please reply to
the original quiz message, if you can.

RSS Feed: http://rubyquiz.strd6.com/quizzes.rss

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Encyclopedia Construction (#205)

This week’s quiz comes from Tim H… Thank you Tim for the great
write up!

When I was a kid we had an encyclopedia. It had thousands of articles
in alphabetical order, divided into about 25 volumes. Sometimes there
were so many articles with the same initial letter (“M”, for example)
that an entire volume was dedicated to articles with that initial
letter. In some cases, though, there weren’t enough articles with the
same initial letter to devote an entire volume to them, so those
articles were grouped in with alphabetically adjacent articles in the
same volume. For example, all the articles starting with W, X, Y, and
Z were in the same volume. Of course, all the articles that started
with the same initial letter were always in the same volume. Each
volume was labeled with the initial character of the articles it
contained. For example, the “M” volume was simply labeled “M”. If
the articles were spread over more than one initial character, the
label displayed the first and last characters in the range. The W, X,
Y, and Z volume was labeled “W-Z”.

The quiz is to devise a way of grouping a list of encyclopedia articles
into
volumes. Since this is Ruby Q., we’ll say that for “articles” we
have a list of about 100 words composed of alphabetical characters,
a-z. A “volume” is an array.

Sort the list and then put the words into arrays using the following
rules.

  1. All the words with the same initial character must be in the same
    array.
  2. If an array contains words with different initial characters, the
    words must be alphabetically adjacent to each other. That is, if
    “cat”, “dog”, and “elephant”, are in the list of articles, and you
    put “cat” and “elephant” in the same array, then “dog” must also be in
    that array.
  3. Without breaking rules 1 and 2, divide the words into about 10
    arrays.
  4. Without breaking rules 1 and 2, each of the arrays should contain
    about the same number of words.

To display the encyclopedia, make a hash with one element for each
array. The element value is the array. The element key is a single
letter (“M”) when the array only contains words with the same initial
character, and a range (“W-Z”) when the array contains words with more
than one initial character. Print the hash with the “pp” command.

Here’s the list of words:

ape
apple
apricot
aquamarine
architect
artichoke
azure
baboon
badger
banana
bat
battleship
beeswax
beetles
black
blogger
blue
bricklayer
Brigadoon
card
cat
cherry
chokeberry
coconut
coral
crimson
currant
dam
debate
deliberations
demagogue
dog
durian
elephant
empathy
encyclopedia
fig
fox
gazelle
giraffe
goldenrod
gray
hippopotamus
huckleberry
ibex
imprint
indigo
ivory
jab
jackrabbit
jake
khaki
kiwi
kudzu
lavender
lemming
leopard
lime
loganberry
mango
maroon
memory
mole
monkey
mouse
muskox
Nigeria
Navajo
olive
opera
panther
party
peach
pear
penguin
persimmon
plum
poet
politician
privy
programmer
rabbit
red
research
rhino
Rumpelstiltskin
salmon
silver
snow
soldier
songsmith
squirrel
starship
steel
strawberry
student
tan
tapir
theater
thesaurus
tree
turquoise
umbrella
warthog
waterloo
wildebeest
wolf
woodchuck
xylophone
yahoo
yellow
zebra

Hi,

here’s my solution. It’s based on classic dynamic programming over the
number of words for each letter.

I used different objective functions to get good book sizes:

  1. Maximize the minimal number of words in one book,
  2. minimize the maximal number of words in one book,
  3. minimize the absolute deviation of the number of words in one
    book to the average number of words in one book ,
  4. minimize the quadratic deviation of the number of words in one
    book to the average number of words in one book.

For the given word list, objective function 4 seems to yield the best
results with good distributed word counts.

The program is called with three arguments:

  1. The name of the word-list file (one word per line),
  2. the number of books in which the words should be distributed,
  3. a number 1,2,3 or 4 selecting the objective function.

The programm prints

  1. the number of words per letter,
  2. the words per book,
  3. the number of words per book.

The current implementation is quite inelegant and could be improved
and be more rubyish.


require ‘facets/enumerable’
require ‘enumerator’
require ‘pp’

word_file = ARGV.shift || “words.txt”
nbooks = ARGV.shift.to_i || 10
algorithm = ARGV.shift.to_i || 3

ruby 1.8

class Integer
def ord; self; end
end

class Wordlist

attr_reader :books

def initialize( words )
    # sort them
    @words = words.sort {|w1, w2| w1.downcase <=> w2.downcase }

    # get number of words per letter
    @nletters = Array.new(26, 0)
    @words.each do |w|
        @nletters[w.downcase[0].ord - ?a.ord] += 1
    end

end


# Returns the number of words per character
def letter_counts
    (?A .. ?Z).mash{|c| [c.chr, @nletters[c.ord - ?A.ord]]}
end


# Returns number of words in each book
def counts
    @books.mash{|range, words| [range, words.size]}
end


# divide into +nbooks+ books by minimizing the maximal number
# of words in one book
def min_maximum( nbooks )
    dyn_min( nbooks ) { |*s| s.max }
end


# divide into +nbooks+ books by maximizing the minmal number
# of words in one book
def max_minimum( nbooks )
    dyn_min( nbooks ) { |*s| -s.min }
end


# divide into +nbooks+ books by minimizing the deviation from the
# average number of words in one book
def min_deviat( nbooks )
    mean = sum( [email protected] ).to_f / nbooks
    dyn_min( nbooks ) { |*s| s.inject(0){|sum,x| sum + (x - 

mean).abs} }
end

# divide into +nbooks+ books by minimizing the quadratic deviation
# from the average number of words in one book
def min_deviat2( nbooks )
    mean = sum( [email protected] ).to_f / nbooks
    dyn_min( nbooks ) { |*s| s.inject(0){|sum,x| sum + (x - mean) ** 

2 } }
end

private

# computes
#   sum_{i\in range} @nletters[i]
def sum( range )
    range.inject(0) {|s,i| @nletters[i] + s}
end


# A range of letters in the same book.
class Book
    def initialize( range, n_words )
        @range = range
        @n_words = n_words
    end

    # The first letter in this book
    def min; @range.min; end

    # The last letter in this book
    def max; @range.max; end

    # Is letter x in this book?
    def include?(x); @range.include?(x); end

    # returns the number of all words in this book
    attr_reader :n_words
end


# Computes a solution where
#   max{ func(part_sum) }
# is minimized
def dyn_min( nbooks, &func )
    mean = sum( [email protected] ).to_f / nbooks

    books = Array.new( @nletters.size )
    s = 0
    (@nletters.size - 1).downto(0) do |i|
        s += @nletters[i]
        range = i ... @nletters.size
        books[i] = [Book.new( range, sum(range) )]
    end

    nbooks.times do

        new_books = Array.new( @nletters.size )

        for s in 0 ... @nletters.size
            best_value = nil
            best_end = nil
            sum = @nletters[s] # value of the current part
            for e in (s + 1) ... @nletters.size
                # stop if no further subdivisions are possible
                break unless books[e]

                value = func.call(sum, *books[e].map{|r| r.n_words})
                if best_value.nil? || value < best_value
                    best_value = value
                    best_end = e
                end

                sum += @nletters[e] || 0
            end

            if best_value
                new_books[s] = [Book.new(s ... best_end,
                             sum(s ... best_end))] +
        books[best_end]
            end
        end

        books = new_books
    end

    @books = Hash.new
    books[0].each do |book|
        words = @words.find_all{|w|
      book.include?(w.downcase[0].ord - ?a.ord)
  }
        if book.min == book.max
      key = (?A.ord + book.min).chr
        else
      key = "#{(?A.ord + book.min).chr}-#{(?A.ord + book.max).chr}"
        end
        @books[key] = words
    end

    @books
end

end

read word list

words = []
File.open( word_file, “r” ) do |f|
f.each_line do |line|
line.strip!
words << line unless line.empty?
end
end

list = Wordlist.new( words )
p list.letter_counts

case algorithm
when 1
list.min_maximum( nbooks )
when 2
list.max_minimum( nbooks )
when 3
list.min_deviat( nbooks )
when 4
list.min_deviat2( nbooks )
else
raise “Unknown algorithm: #{algorithm} (should be in {0,1,2,3})”
end

pp list.books
p list.counts

Bye,
Frank

Here’s my solution:
http://www.pastie.org/482422

Usage:
e = Encyclopedia.new(articles)
e.volumes(20) #<-- how many volumes should it be condensed to?
pp e.hash

puts e.dump #<-- this will print a nice little chart to help you
visualize the distribution of articles.

My strategy was to:
-divide each group of words starting with the same letter into it’s own
array
-if I had to reduce the number of volumes:
-combine 2 adjacent letter arrays into a single array based on which
2 arrays have the smallest combined length
-repeat this procedure until the desired number of volumes is
achieved

Notes:
-This algorithm can result in poorly balanced distribution of volumes
because it glues together adjacent volumes permanently. A more even
distribution could be achieved if you could break a volume off after it
had already been combined.
-Certain number of volumes will result in really even distribution. 17
with the provided articles worked out well.
-I think it would be interesting to split larger volumes as well. I
think in some encyclopedias you’ll find volumes with more common words
split (eg. Ma - Me and Mi -Mz)

My code could be much improved - quick and dirty. Well, maybe not so
much quick, but dirty.

Luke

I updated my solution too. No real improvements to the algorithm, but
the code is more ‘rubyish’. For good or bad, I found it useful to
subclass Array.

http://www.pastie.org/484220

Luke

On 2009-05-18, Frank F. [email protected] wrote:

Here’s a second version, this time returning the correct number of books
(the first version returned the requested number of books + 1). And it
should be a little cleaner and faster.

Frank


require ‘facets/enumerable’
require ‘enumerator’
require ‘pp’

word_file = ARGV.shift || “words.txt”
nbooks = (ARGV.shift || 10).to_i
algorithm = (ARGV.shift || 4).to_i

ruby 1.8

if RUBY_VERSION[0,4] == “1.8.”
class Integer
def ord; self; end
end
end

A book containing the words starting with a letter in first … last

class Book

attr_reader :first, :last, :n_words, :words

# Create a book with n words whose first letter is in range.
def initialize( range, n_words )
    @range = range.min.ord .. range.max.ord
    @n_words = n_words
end


# Set the words in this book
def words=( words )
    @words = words.find_all{|w| include?(w.downcase[0])}
    @n_words = @words.size
end


# Returns true if the words starting with the given letter are in
# this book
def include?( letter )
    @range.include?( letter.ord - ?a.ord )
end

# Returns the character-range.
def range
    if @range.min == @range.max
        (?A.ord + @range.min).chr
    else
        (?A.ord + @range.min).chr + "-" + (?A.ord + @range.max).chr
    end
end


def to_s; "<Book: %3s, size: #@n_words>" % range; end

def inspect; to_s; end

end

class Wordlist

attr_reader :books

def initialize( words )
    # sort them
    @words = words.sort {|w1, w2| w1.downcase <=> w2.downcase }

    # get number of words per letter
    @nletters = Array.new( 26, 0 )
    @words.each do |w|
        @nletters[w.downcase[0].ord - ?a.ord] += 1
    end
end


# Returns the number of words per character
def letter_counts
    ([email protected]).mash{|i| [(?a.ord + i).chr, @nletters[i]]}
end


# divide into +nbooks+ books by minimizing the maximal number
# of words in one book
def min_maximum( nbooks )
    dyn_min( nbooks ) { |s, rest| [s, rest || 0].max }
end


# divide into +nbooks+ books by maximizing the minmal number
# of words in one book
def max_minimum( nbooks )
    dyn_min( nbooks ) { |s, rest| [-s, (rest || -s)].max }
end


# divide into +nbooks+ books by minimizing the deviation from the
# average number of words in one book
def min_deviat( nbooks )
    mean = sum( 0 ... @nletters.size ).to_f / nbooks
    dyn_min( nbooks ) { |s, rest|  (rest || 0) + (s - mean).abs }
end


# divide into +nbooks+ books by minimizing the quadratic deviation
# from the average number of words in one book
def min_deviat2( nbooks )
    mean = sum( 0 ... @nletters.size ).to_f / nbooks
    dyn_min( nbooks ) { |s, rest| (rest || 0) + (s - mean) ** 2 }
end

private

# computes
#   sum_{i\in range} @nletters[i]
def sum( range )
    range.inject(0) {|s,i| @nletters[i] + s}
end


# Computes a solution minimizing the objective function.
def dyn_min( nbooks, &func )
    # create initial books (1 book)
    books = Array.new( @nletters.size )
    s = 0
    (@nletters.size - 1).downto(0) do |i|
        s += @nletters[i]
        range = i ... @nletters.size
        sum = sum(range)
        books[i] = [[Book.new( range, sum )], func.call(sum, nil)]
    end

    (nbooks - 1).times do
        books = (0 ... @nletters.size).map { |s|
            best_value = nil
            best_end = nil
            sum = @nletters[s] # value of the current part
            for e in s.next ... @nletters.size
                if books[e]
                    value = func.call(sum, books[e][1])
                    if best_value.nil? || value < best_value
                        best_value = value
                        best_end = e
                    end
                end
                sum += @nletters[e] || 0
            end

            if best_value
                [[Book.new(s...best_end, sum(s...best_end))] +
                  books[best_end][0],
                 best_value]
            else
                nil
            end
        }

    end

    @books = books[0][0]
    # collect the words into the books
    @books.each do |book|
        book.words = @words
    end
end

end

read word list

words = []
File.open( word_file, “r” ) do |f|
f.each_line do |line|
line.strip!
words << line unless line.empty?
end
end

list = Wordlist.new( words )
p list.letter_counts
puts “Number of words: #{list.letter_counts.values.inject(0){|s,x| s +
x}}”

case algorithm
when 1
list.min_maximum( nbooks )
when 2
list.max_minimum( nbooks )
when 3
list.min_deviat( nbooks )
when 4
list.min_deviat2( nbooks )
else
raise “Unknown algorithm: #{algorithm} (should be in {1,2,3,4})”
end

pp list.books.mash{|book| [book.range, book.words]}
p list.books.mash{|book| [book.range, book.n_words]}

There were three submitted solutions to this week’s quiz. Srijayanth
Sridhar provided the following:

#! /usr/bin/ruby

require 'pp'

words=File.readlines("../words.txt").map { |word| 

word.chomp.capitalize }
dict = {}
words.each do |word|
dict[word[0…0]] ||= []
dict[word[0…0]] << word
end

count=0
sub_array = []
main_dict =  {}
dict.keys.sort.each do |alphawords|
   sub_array += dict[alphawords]
   count += sub_array.flatten.size
   if ( count >= 10 )
      alpha=sub_array.flatten.map { |x| x[0..0] }.uniq
      alpha.size > 1 ? key=Range.new(alpha.min,alpha.max) : alpha[0]
      main_dict[key] = sub_array.flatten
      count = 0
      sub_array=[]
   end
end

# Attach last remaining words
alpha=sub_array.flatten.map { |x| x[0..0] }.uniq
alpha.size > 1 ? key=Range.new(alpha.min,alpha.max) : alpha[0]
main_dict[key] = sub_array.flatten
count = 0
sub_array=[]

pp main_dict

Srijayanth’s solution works by building a dictionary of all the words
grouped by first letter. Starting from the beginning the words in the
next letter are added until the total number of words in the current
volume is greater than or equal to count (10 by default).

The advantage of this solution is that it is simple to implement and
only needs to make one pass through the lists of words. One potential
weakness is that if the words in the second to last letter added and
it goes above the threshold, then the words beginning with the last
letter won’t have any possibility to be merged in and could be left as
a very small volume.

Luke C.'s solution repeatedly merges the smallest adjacent pairs
until the requested number of volumes is reached. This is a simpler
technique and produces pretty good results as long as the input data
is well behaved.

Frank F.'s solution uses dynamic programming1. Dynamic
programming is a method of solving complex problems by breaking them
down into simpler steps. There are two key attributes that a problem
must have in order for dynamic programming to be applicable: optimal
substructure and overlapping subproblems. Optimal substructure means
that the solution to a given optimization problem can be obtained by
the combination of optimal solutions to its subproblems. Overlapping
subproblems means that the space of subproblems must be small, that
is, any recursive algorithm solving the problem should solve the same
subproblems over and over, rather than generating new subproblems.

Frank’s solution provides different methods to optimize the
arrangement of the volumes.

# divide into +nbooks+ books by minimizing the maximal number
# of words in one book
def min_maximum( nbooks )
    dyn_min( nbooks ) { |s, rest| [s, rest || 0].max }
end

#...

# divide into +nbooks+ books by minimizing the quadratic deviation
# from the average number of words in one book
def min_deviat2( nbooks )
    mean = sum( 0 ... @nletters.size ).to_f / nbooks
    dyn_min( nbooks ) { |s, rest| (rest || 0) + (s - mean) ** 2 }
end

Each of these methods passes a different block to the dyn_min method
depending on what objective you wish to optimize. The dyn_min method
iterates through the possible solutions and stores the best
combinations of volumes according to the metric passed in.

Thank you Srijayanth, Frank, and Luke for your contributions this week.

P.S. From now on the summaries will have an attached archive of the
quiz solutions.