Proper Case (#89)

I’m sad no one but the quiz creator himself gave this problem a shot.
This is a
very real problem with all manner of source texts and fixing it is
tricky.
There was even a discussion on the mailing list about how you can’t
count on
there being two spaces at the end of a sentence.

You really need natural language processing to correctly determine which
words
to capitalize. Unfortunately, natural language processing is complex
and often
not a perfect solution anyway.

The good news is that we can use some heuristics to get close. A
heuristic is a
loosely defined rule or, put another way, the computer science
equivalent to a
guess. These are often developed by just trying to get close to a
solution and
then tweaking little things here and there to close in on the target.
The
result won’t be perfect, of course, but it may be good enough. It’s a
very
agile process and Rubyists love that.

Let’s see what heuristics Elliot came up with now, starting with some
code used
to correct common Netspeak misspellings:

Abbreviations = { "ppl"  => "people",
                  "btwn" => "between",
                  "ur"   => "your",
                  "u"    => "you",
                  "diff" => "different",
                  "ofc"  => "of course",
                  "liek" => "like",
                  "rly"  => "really",
                  "i"    => "I",
                  "i'm"  => "I'm" }

def fix_abbreviations text
  Abbreviations.each_key do |abbrev|
    text = text.gsub %r[(^|(\s))#{abbrev}((\s)|[.,?!]|$)]i do |m|
      m.gsub(/\w+/, "#{Abbreviations[abbrev]}")
    end
  end
  text
end

# ...

This code is fairly trivial, but still quite effective. Using a
predefined
Hash, the method just scans the text for the keys, swapping them out for
the
provided values when found. Note that the expression used to find the
key tries
to ensure it is not in the middle of some larger word by looking for
leading and
trailing whitespace or punctuation.

That expression could probably be simplified to %r[\b#{abbrev}\b] which
looks
for word boundaries (a \W\w or \w\W transition) and means close to the
same
thing. This would allow Elliot do the search and replace in a single
call to
gsub(), instead of the current nested call to avoid replacing the
surrounding
space or punctuation. (You can do it with a single gsub() call even
without
using \b, just FYI: text.gsub(%r[(^|(\s))#{abbrev}((\s)|[.,?!]|$)]i,
“#{Abbreviations[abbrev]}”).)

The important aspect of this solution though is that it knows it’s not
perfect
and gives you the Hash as a means to make it better. If it doesn’t
handle your
text correctly, you can always add or delete entries from the Hash to
improve
the results.

Let’s look at some more code, this time for capitalizing proper nouns:

require "yaml"

# ...

def capitalize_proper_nouns text
  if not File.exists?("proper_nouns.yaml")
    make_capitalize_proper_nouns_file
  end
  proper_nouns = YAML.load_file "proper_nouns.yaml"
  text = text.gsub /\w+/ do |word|
    proper_nouns[word] || word
  end
  text
end

def make_capitalize_proper_nouns_file
  words = File.read("/Users/curi/me/words.txt").split "\n"
  lowercase_words = words.select {|w| w =~ /^[a-z]/}.map{|w| 

w.downcase}
words = words.map{|w| w.downcase} - lowercase_words
proper_nouns = words.inject({}) { |h, w| h[w] = w.capitalize; h }
File.open(“proper_nouns.yaml”, “w”) {|f| YAML.dump(proper_nouns, f)}
end

# ...

This is an interesting two-tiered approach. If the program can locate a
proper_nouns.yaml file, a Hash is pulled from it and used to capitalize
the
listed nouns. If the file cannot be found, a hand-off is made to
make_capitalize_proper_nouns_file(). The code in that method appears to
read a
word list file and build up its own list of proper nouns. This list is
then
flushed to the YAML file, so it will be found on future loads.

What I liked about this was how I could customize it, yet again. When
testing
Elliot’s code against the quiz text, I just built a quick Hash with the
needed
keys and values:

$ ruby -r yaml -e 'y Hash[*%w[Elliot T.].map { |pn| [pn.downcase, 

pn] }.
flatten]’ >
proper_nouns.yaml
$ cat proper_nouns.yaml

temple: Temple
elliot: Elliot

Getting back to the code, we’re again using a trivial regular expression
based
swap, which you can see in the second half of capitalize_proper_nouns().
It
matches all words (well, a run of \w characters) and replaces them with
the
proper noun capitalization, if there is such a thing, or the word
itself,
causing no change.

Now we can put all of that together with a few more heuristics to get a
complete
solution:

# ...

def capitalize text
  return "" if text.nil?
  text = fix_abbreviations text
  text = text.gsub /([?!.-]\s+)(\w+)/ do |m|
    "#$1#{$2.capitalize}"
  end
  text = text.gsub /(\n)(\w+)/ do |m|
    "#$1#{$2.capitalize}"
  end
  text = text.gsub /\A(\w+)/ do |m|
    "#{$1.capitalize}"
  end
  text = text.gsub %r[\sHttp://] do |m|
    "#{$&.downcase}"
  end
  text = capitalize_proper_nouns text
  text
end

puts capitalize(ARGF.read)

This method triggers the fixes for abbreviations and proper nouns that
we have
already examined. In addition, it uses regular expressions to
capitalize word
characters following sentence end punctuation as well as words
characters at the
beginning of a line or the document. It then corrects the protocol
identifier
for inline links it may have damaged in the process.

So, how does this do on the quiz document? Generally quite good. It
makes only
two obvious errors:

By Elliot T.

and:

Sometimes I might want to write about gsub vs. Gsub! Without the...

The first error is that we generally do not capitalize the by in a
byline. That
could probably be worked around with another regular expression
correction.

The second issue is much harder to get right and here is where we start
to miss
a natural language processing facility. When humans read that line we
know that
gusb!() and without should not be capitalized because of the context
they are
used in. The script is not-so-clever though and the period and
exclamation
point throw it off. You could add rules to work around these cases as
well, but
you will definitely be fighting an uphill battle at that point.

I still say the end result is quite good though. Count how many
characters are
wrong in the quiz and subtract from that the three output issues. It’s
a big
improvement.

My thanks to Elliot T. for the problem and being brave enough to put
together a solution.

Tomorrow we’ll try our hand at another simple pen and paper game and see
who can
solve it in record time…

On 8/10/06, Ruby Q. [email protected] wrote:

I’m sad no one but the quiz creator himself gave this problem a shot. This is a
very real problem with all manner of source texts and fixing it is tricky.
There was even a discussion on the mailing list about how you can’t count on
there being two spaces at the end of a sentence.

Oh well, I was working on this, and got a fair ways, but got
distracted by a problem with my server. I guess I’ll have to post my
solution on my blog in a few days.