Forum: Ruby Encyclopedia Construction (#205)

33117162fff8a9cf50544a604f60c045?d=identicon&s=25 Daniel X Moore (yahivin)
on 2009-05-15 18:40
(Received via mailing list)
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

The three rules of Ruby Quiz:

1.  Please do not post any solutions or spoiler discussion for this
quiz until 48 hours have elapsed from the time this message was
sent.

2.  Support Ruby Quiz by submitting ideas and responses
as often as you can!
Visit: <http://rubyquiz.strd6.com/suggestions>

3.  Enjoy!

Suggestion:  A [QUIZ] in the subject of emails about the problem
helps everyone on Ruby Talk follow the discussion.  Please reply to
the original quiz message, if you can.

RSS Feed: http://rubyquiz.strd6.com/quizzes.rss

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

## Encyclopedia Construction (#205)

This week's quiz comes from Tim Hunter. Thank you Tim for the great
write up!

When I was a kid we had an encyclopedia.  It had thousands of articles
in alphabetical order, divided into about 25 volumes.  Sometimes there
were so many articles with the same initial letter ("M", for example)
that an entire volume was dedicated to articles with that initial
letter.  In some cases, though, there weren't enough articles with the
same initial letter to devote an entire volume to them, so those
articles were grouped in with alphabetically adjacent articles in the
same volume.  For example, all the articles starting with W, X, Y, and
Z were in the same volume.  Of course, all the articles that started
with the same initial letter were always in the same volume.  Each
volume was labeled with the initial character of the articles it
contained.  For example, the "M" volume was simply labeled "M".  If
the articles were spread over more than one initial character, the
label displayed the first and last characters in the range.  The W, X,
Y, and Z volume was labeled "W-Z".

The quiz is to devise a way of grouping a list of encyclopedia articles
into
volumes.  Since this is Ruby Quiz, we'll say that for "articles" we
have a list of about 100 words composed of alphabetical characters,
a-z.  A "volume" is an array.

Sort the list and then put the words into arrays using the following
rules.

1. All the words with the same initial character must be in the same
array.
2. If an array contains words with different initial characters, the
words must be alphabetically adjacent to each other.  That is, if
"cat", "dog", and  "elephant", are in the list of articles, and you
put "cat" and "elephant" in the same array, then "dog" must also be in
that array.
3. Without breaking rules 1 and 2, divide the words into about 10
arrays.
4. Without breaking rules 1 and 2, each of the arrays should contain
about the same number of words.

To display the encyclopedia, make a hash with one element for each
array. The element value is the array. The element key is a single
letter ("M") when the array only contains words with the same initial
character, and a range ("W-Z") when the array contains words with more
than one initial character. Print the hash with the "pp" command.

Here's the list of words:

ape
apple
apricot
aquamarine
architect
artichoke
azure
baboon
badger
banana
bat
battleship
beeswax
beetles
black
blogger
blue
bricklayer
Brigadoon
card
cat
cherry
chokeberry
coconut
coral
crimson
currant
dam
debate
deliberations
demagogue
dog
durian
elephant
empathy
encyclopedia
fig
fox
gazelle
giraffe
goldenrod
gray
hippopotamus
huckleberry
ibex
imprint
indigo
ivory
jab
jackrabbit
jake
khaki
kiwi
kudzu
lavender
lemming
leopard
lime
loganberry
mango
maroon
memory
mole
monkey
mouse
muskox
Nigeria
Navajo
olive
opera
panther
party
peach
pear
penguin
persimmon
plum
poet
politician
privy
programmer
rabbit
red
research
rhino
Rumpelstiltskin
salmon
silver
snow
soldier
songsmith
squirrel
starship
steel
strawberry
student
tan
tapir
theater
thesaurus
tree
turquoise
umbrella
warthog
waterloo
wildebeest
wolf
woodchuck
xylophone
yahoo
yellow
zebra
74732449919ac2f5626e726d41cdb893?d=identicon&s=25 Frank Fischer (Guest)
on 2009-05-18 17:41
(Received via mailing list)
Hi,

here's my solution.  It's based on classic dynamic programming over the
number of words for each letter.

I used different objective functions to get good book sizes:

  1. Maximize the minimal number of words in one book,
  2. minimize the maximal number of words in one book,
  3. minimize the absolute deviation of the number of words in one
     book to the average number of words in one book ,
  4. minimize the quadratic deviation of the number of words in one
     book to the average number of words in one book.

For the given word list, objective function 4 seems to yield the best
results with good distributed word counts.

The program is called with three arguments:

  1. The name of the word-list file (one word per line),
  2. the number of books in which the words should be distributed,
  3. a number 1,2,3 or 4 selecting the objective function.

The programm prints

  1. the number of words per letter,
  2. the words per book,
  3. the number of words per book.

The current implementation is quite inelegant and could be improved
and be more rubyish.


---
require 'facets/enumerable'
require 'enumerator'
require 'pp'

word_file = ARGV.shift      || "words.txt"
nbooks = ARGV.shift.to_i    || 10
algorithm = ARGV.shift.to_i || 3

# ruby 1.8
class Integer
    def ord; self; end
end


class Wordlist

    attr_reader :books

    def initialize( words )
        # sort them
        @words = words.sort {|w1, w2| w1.downcase <=> w2.downcase }

        # get number of words per letter
        @nletters = Array.new(26, 0)
        @words.each do |w|
            @nletters[w.downcase[0].ord - ?a.ord] += 1
        end

    end


    # Returns the number of words per character
    def letter_counts
        (?A .. ?Z).mash{|c| [c.chr, @nletters[c.ord - ?A.ord]]}
    end


    # Returns number of words in each book
    def counts
        @books.mash{|range, words| [range, words.size]}
    end


    # divide into +nbooks+ books by minimizing the maximal number
    # of words in one book
    def min_maximum( nbooks )
        dyn_min( nbooks ) { |*s| s.max }
    end


    # divide into +nbooks+ books by maximizing the minmal number
    # of words in one book
    def max_minimum( nbooks )
        dyn_min( nbooks ) { |*s| -s.min }
    end


    # divide into +nbooks+ books by minimizing the deviation from the
    # average number of words in one book
    def min_deviat( nbooks )
        mean = sum( 0...@nletters.size ).to_f / nbooks
        dyn_min( nbooks ) { |*s| s.inject(0){|sum,x| sum + (x -
mean).abs} }
    end


    # divide into +nbooks+ books by minimizing the quadratic deviation
    # from the average number of words in one book
    def min_deviat2( nbooks )
        mean = sum( 0...@nletters.size ).to_f / nbooks
        dyn_min( nbooks ) { |*s| s.inject(0){|sum,x| sum + (x - mean) **
2 } }
    end

private

    # computes
    #   sum_{i\in range} @nletters[i]
    def sum( range )
        range.inject(0) {|s,i| @nletters[i] + s}
    end


    # A range of letters in the same book.
    class Book
        def initialize( range, n_words )
            @range = range
            @n_words = n_words
        end

        # The first letter in this book
        def min; @range.min; end

        # The last letter in this book
        def max; @range.max; end

        # Is letter x in this book?
        def include?(x); @range.include?(x); end

        # returns the number of all words in this book
        attr_reader :n_words
    end


    # Computes a solution where
    #   max{ func(part_sum) }
    # is minimized
    def dyn_min( nbooks, &func )
        mean = sum( 0...@nletters.size ).to_f / nbooks

        books = Array.new( @nletters.size )
        s = 0
        (@nletters.size - 1).downto(0) do |i|
            s += @nletters[i]
            range = i ... @nletters.size
            books[i] = [Book.new( range, sum(range) )]
        end

        nbooks.times do

            new_books = Array.new( @nletters.size )

            for s in 0 ... @nletters.size
                best_value = nil
                best_end = nil
                sum = @nletters[s] # value of the current part
                for e in (s + 1) ... @nletters.size
                    # stop if no further subdivisions are possible
                    break unless books[e]

                    value = func.call(sum, *books[e].map{|r| r.n_words})
                    if best_value.nil? || value < best_value
                        best_value = value
                        best_end = e
                    end

                    sum += @nletters[e] || 0
                end

                if best_value
                    new_books[s] = [Book.new(s ... best_end,
                                 sum(s ... best_end))] +
            books[best_end]
                end
            end

            books = new_books
        end

        @books = Hash.new
        books[0].each do |book|
            words = @words.find_all{|w|
          book.include?(w.downcase[0].ord - ?a.ord)
      }
            if book.min == book.max
          key = (?A.ord + book.min).chr
            else
          key = "#{(?A.ord + book.min).chr}-#{(?A.ord + book.max).chr}"
            end
            @books[key] = words
        end

        @books
    end
end

# read word list
words = []
File.open( word_file, "r" ) do |f|
    f.each_line do |line|
        line.strip!
        words << line unless line.empty?
    end
end



list = Wordlist.new( words )
p list.letter_counts

case algorithm
when 1
    list.min_maximum( nbooks )
when 2
    list.max_minimum( nbooks )
when 3
    list.min_deviat( nbooks )
when 4
    list.min_deviat2( nbooks )
else
    raise "Unknown algorithm: #{algorithm} (should be in {0,1,2,3})"
end

pp list.books
p list.counts
---

Bye,
Frank
1566d4066e11205ec3e3aaeeaf89348b?d=identicon&s=25 Luke Cowell (lcowell)
on 2009-05-19 06:53
Here's my solution:
http://www.pastie.org/482422


Usage:
e = Encyclopedia.new(articles)
e.volumes(20) #<-- how many volumes should it be condensed to?
pp e.hash

puts e.dump #<-- this will print a nice little chart to help you
visualize the distribution of articles.


My strategy was to:
-divide each group of words starting with the same letter into it's own
array
-if I had to reduce the number of volumes:
   -combine 2 adjacent letter arrays into a single array based on which
2 arrays have the smallest combined length
   -repeat this procedure until the desired number of volumes is
achieved

Notes:
-This algorithm can result in poorly balanced distribution of volumes
because it glues together adjacent volumes permanently. A more even
distribution could be achieved if you could break a volume off after it
had already been combined.
-Certain number of volumes will result in really even distribution. 17
with the provided articles worked out well.
-I think it would be interesting to split larger volumes as well. I
think in some encyclopedias you'll find volumes with more common words
split (eg. Ma - Me and Mi -Mz)


My code could be much improved - quick and dirty. Well, maybe not so
much quick, but dirty.

Luke
74732449919ac2f5626e726d41cdb893?d=identicon&s=25 Frank Fischer (Guest)
on 2009-05-20 09:30
(Received via mailing list)
On 2009-05-18, Frank Fischer <frank-fischer@shadow-soft.de> wrote:

Here's a second version, this time returning the correct number of books
(the first version returned the requested number of books + 1).  And it
should be a little cleaner and faster.

Frank

--
require 'facets/enumerable'
require 'enumerator'
require 'pp'

word_file = ARGV.shift      || "words.txt"
nbooks = (ARGV.shift || 10).to_i
algorithm = (ARGV.shift || 4).to_i

# ruby 1.8
if RUBY_VERSION[0,4] == "1.8."
    class Integer
        def ord; self; end
    end
end


# A book containing the words starting with a letter in first .. last
class Book

    attr_reader :first, :last, :n_words, :words

    # Create a book with n words whose first letter is in range.
    def initialize( range, n_words )
        @range = range.min.ord .. range.max.ord
        @n_words = n_words
    end


    # Set the words in this book
    def words=( words )
        @words = words.find_all{|w| include?(w.downcase[0])}
        @n_words = @words.size
    end


    # Returns true if the words starting with the given letter are in
    # this book
    def include?( letter )
        @range.include?( letter.ord - ?a.ord )
    end

    # Returns the character-range.
    def range
        if @range.min == @range.max
            (?A.ord + @range.min).chr
        else
            (?A.ord + @range.min).chr + "-" + (?A.ord + @range.max).chr
        end
    end


    def to_s; "<Book: %3s, size: #@n_words>" % range; end

    def inspect; to_s; end

end


class Wordlist

    attr_reader :books

    def initialize( words )
        # sort them
        @words = words.sort {|w1, w2| w1.downcase <=> w2.downcase }

        # get number of words per letter
        @nletters = Array.new( 26, 0 )
        @words.each do |w|
            @nletters[w.downcase[0].ord - ?a.ord] += 1
        end
    end


    # Returns the number of words per character
    def letter_counts
        (0...@nletters.size).mash{|i| [(?a.ord + i).chr, @nletters[i]]}
    end


    # divide into +nbooks+ books by minimizing the maximal number
    # of words in one book
    def min_maximum( nbooks )
        dyn_min( nbooks ) { |s, rest| [s, rest || 0].max }
    end


    # divide into +nbooks+ books by maximizing the minmal number
    # of words in one book
    def max_minimum( nbooks )
        dyn_min( nbooks ) { |s, rest| [-s, (rest || -s)].max }
    end


    # divide into +nbooks+ books by minimizing the deviation from the
    # average number of words in one book
    def min_deviat( nbooks )
        mean = sum( 0 ... @nletters.size ).to_f / nbooks
        dyn_min( nbooks ) { |s, rest|  (rest || 0) + (s - mean).abs }
    end


    # divide into +nbooks+ books by minimizing the quadratic deviation
    # from the average number of words in one book
    def min_deviat2( nbooks )
        mean = sum( 0 ... @nletters.size ).to_f / nbooks
        dyn_min( nbooks ) { |s, rest| (rest || 0) + (s - mean) ** 2 }
    end


private

    # computes
    #   sum_{i\in range} @nletters[i]
    def sum( range )
        range.inject(0) {|s,i| @nletters[i] + s}
    end


    # Computes a solution minimizing the objective function.
    def dyn_min( nbooks, &func )
        # create initial books (1 book)
        books = Array.new( @nletters.size )
        s = 0
        (@nletters.size - 1).downto(0) do |i|
            s += @nletters[i]
            range = i ... @nletters.size
            sum = sum(range)
            books[i] = [[Book.new( range, sum )], func.call(sum, nil)]
        end

        (nbooks - 1).times do
            books = (0 ... @nletters.size).map { |s|
                best_value = nil
                best_end = nil
                sum = @nletters[s] # value of the current part
                for e in s.next ... @nletters.size
                    if books[e]
                        value = func.call(sum, books[e][1])
                        if best_value.nil? || value < best_value
                            best_value = value
                            best_end = e
                        end
                    end
                    sum += @nletters[e] || 0
                end

                if best_value
                    [[Book.new(s...best_end, sum(s...best_end))] +
                      books[best_end][0],
                     best_value]
                else
                    nil
                end
            }

        end

        @books = books[0][0]
        # collect the words into the books
        @books.each do |book|
            book.words = @words
        end
    end
end

# read word list
words = []
File.open( word_file, "r" ) do |f|
    f.each_line do |line|
        line.strip!
        words << line unless line.empty?
    end
end



list = Wordlist.new( words )
p list.letter_counts
puts "Number of words: #{list.letter_counts.values.inject(0){|s,x| s +
x}}"

case algorithm
when 1
    list.min_maximum( nbooks )
when 2
    list.max_minimum( nbooks )
when 3
    list.min_deviat( nbooks )
when 4
    list.min_deviat2( nbooks )
else
    raise "Unknown algorithm: #{algorithm} (should be in {1,2,3,4})"
end

pp list.books.mash{|book| [book.range, book.words]}
p list.books.mash{|book| [book.range, book.n_words]}
1566d4066e11205ec3e3aaeeaf89348b?d=identicon&s=25 Luke Cowell (lcowell)
on 2009-05-20 17:28
I updated my solution too. No real improvements to the algorithm, but
the code is more 'rubyish'. For good or bad, I found it useful to
subclass Array.

http://www.pastie.org/484220

Luke
33117162fff8a9cf50544a604f60c045?d=identicon&s=25 Daniel X Moore (yahivin)
on 2009-05-24 20:32
(Received via mailing list)
Attachment: 205.tar.gz (5 KB)
There were three submitted solutions to this week's quiz. Srijayanth
Sridhar provided the following:

    #! /usr/bin/ruby

    require 'pp'

    words=File.readlines("../words.txt").map { |word|
word.chomp.capitalize }
    dict = {}
    words.each do |word|
       dict[word[0..0]] ||= []
       dict[word[0..0]] << word
    end

    count=0
    sub_array = []
    main_dict =  {}
    dict.keys.sort.each do |alphawords|
       sub_array += dict[alphawords]
       count += sub_array.flatten.size
       if ( count >= 10 )
          alpha=sub_array.flatten.map { |x| x[0..0] }.uniq
          alpha.size > 1 ? key=Range.new(alpha.min,alpha.max) : alpha[0]
          main_dict[key] = sub_array.flatten
          count = 0
          sub_array=[]
       end
    end

    # Attach last remaining words
    alpha=sub_array.flatten.map { |x| x[0..0] }.uniq
    alpha.size > 1 ? key=Range.new(alpha.min,alpha.max) : alpha[0]
    main_dict[key] = sub_array.flatten
    count = 0
    sub_array=[]

    pp main_dict

Srijayanth's solution works by building a dictionary of all the words
grouped by first letter. Starting from the beginning the words in the
next letter are added until the total number of words in the current
volume is greater than or equal to `count` (10 by default).

The advantage of this solution is that it is simple to implement and
only needs to make one pass through the lists of words. One potential
weakness is that if the words in the second to last letter added and
it goes above the threshold, then the words beginning with the last
letter won't have any possibility to be merged in and could be left as
a very small volume.

Luke Cowell's solution repeatedly merges the smallest adjacent pairs
until the requested number of volumes is reached. This is a simpler
technique and produces pretty good results as long as the input data
is well behaved.

Frank Fischer's solution uses dynamic programming[1]. Dynamic
programming is a method of solving complex problems by breaking them
down into simpler steps. There are two key attributes that a problem
must have in order for dynamic programming to be applicable: optimal
substructure and overlapping subproblems. Optimal substructure means
that the solution to a given optimization problem can be obtained by
the combination of optimal solutions to its subproblems. Overlapping
subproblems means that the space of subproblems must be small, that
is, any recursive algorithm solving the problem should solve the same
subproblems over and over, rather than generating new subproblems.

Frank's solution provides different methods to optimize the
arrangement of the volumes.

    # divide into +nbooks+ books by minimizing the maximal number
    # of words in one book
    def min_maximum( nbooks )
        dyn_min( nbooks ) { |s, rest| [s, rest || 0].max }
    end

    #...

    # divide into +nbooks+ books by minimizing the quadratic deviation
    # from the average number of words in one book
    def min_deviat2( nbooks )
        mean = sum( 0 ... @nletters.size ).to_f / nbooks
        dyn_min( nbooks ) { |s, rest| (rest || 0) + (s - mean) ** 2 }
    end

Each of these methods passes a different block to the `dyn_min` method
depending on what objective you wish to optimize. The `dyn_min` method
iterates through the possible solutions and stores the best
combinations of volumes according to the metric passed in.

Thank you Srijayanth, Frank, and Luke for your contributions this week.

P.S. From now on the summaries will have an attached archive of the
quiz solutions.

[1]: http://en.wikipedia.org/wiki/Dynamic_programming
Please log in before posting. Registration is free and takes only a minute.
Existing account

NEW: Do you have a Google/GoogleMail, Yahoo or Facebook account? No registration required!
Log in with Google account | Log in with Yahoo account | Log in with Facebook account
No account? Register here.