Proper Case (#89)

[email protected] writes:

100% sure that this rule does not exist for french typography. I suspect that
every country will have different spacing schemes according to the punctuation,
and if you intend to correct english written by foreigner (and a lot of it is)
or, even better, if you want your program to work with any latin-written
language, you’d better not rely on anything like that ! (I know that I make
loads of english typography errors because I naturally follow the french
rules… unless I make special effort)

That’s why the command to disable additional space after sentences in
plain TeX is called \frenchspacing.

French differs with respect to some other conventions too, e.g. you
put a space before the exclamation marks, too. If you do that in a
German newsgroup or chat, everyone will laugh at you and tell you are
“plenking”.

On 8/5/06, Matthew S. [email protected] wrote:

difficult.
basic propositions for and against using two spaces at the end of a
sentence, wikipedia makes a decent start.

matthew smillie.

You talking Wikipedia, see I have memories like an elephant :wink:
Cheers
Robert
Thaught I add some useful stuff to this interesting thread


Deux choses sont infinies : l’univers et la bêtise humaine ; en ce qui
concerne l’univers, je n’en ai pas acquis la certitude absolue.

  • Albert Einstein

My solution will at the minimum capitalize the starts of sentences
just judging by periods.

If supplied with an example source using the -s option, it will try to
find words that should always be capitalized (I, Ruby, proper nouns in
general), words that imply that the next word should be capitalized
(Lake, General) and words in which punctuation does not imply an end
of a sentence (abbreviations) although this is only helpful if there
is some capitalization in the text.

here’s what i have. it does a few abbreviations, proper nouns, and
some regexs.

require “yaml”
Abbreviations = {“ppl” => “people”, “btwn” => “between”, “ur” =>
“your”, “u” => “you”, “diff” => “different”, “ofc” => “of course”,
“liek” => “like”, “rly” => “really”, “i” => “I”, “i’m” => “I’m”}

def fix_abbreviations text
Abbreviations.each_key do |abbrev|
text = text.gsub %r[(^|(\s))#{abbrev}((\s)|[.,?!]|$)]i do |m|
m.gsub(/\w+/, “#{Abbreviations[abbrev]}”)
end
end
text
end

def capitalize_proper_nouns text
if not File.exists?(“proper_nouns.yaml”)
make_capitalize_proper_nouns_file
end
proper_nouns = YAML.load_file “proper_nouns.yaml”
text = text.gsub /\w+/ do |word|
proper_nouns[word] || word
end
text
end

def make_capitalize_proper_nouns_file
words = File.read("/Users/curi/me/words.txt").split “\n”
lowercase_words = words.select {|w| w =~ /^[a-z]/}.map{|w|
w.downcase}
words = words.map{|w| w.downcase} - lowercase_words
proper_nouns = words.inject({}) { |h, w| h[w] = w.capitalize; h }
File.open(“proper_nouns.yaml”, “w”) {|f| YAML.dump(proper_nouns, f)}
end

def capitalize text
return “” if text.nil?
text = fix_abbreviations text
text = text.gsub /([?!.-]\s+)(\w+)/ do |m|
“#$1#{$2.capitalize}”
end
text = text.gsub /(\n)(\w+)/ do |m|
“#$1#{$2.capitalize}”
end
text = text.gsub /\A(\w+)/ do |m|
“#{$1.capitalize}”
end
text = text.gsub %r[\sHttp://] do |m|
“#{$&.downcase}”
end
text = capitalize_proper_nouns text
text
end

On Aug 9, 2006, at 9:20 PM, Mitchell Koch wrote:

If supplied with an example source using the -s option, it will try to
find words that should always be capitalized (I, Ruby, proper nouns in
general), words that imply that the next word should be capitalized
(Lake, General) and words in which punctuation does not imply an end
of a sentence (abbreviations) although this is only helpful if there
is some capitalization in the text.

This is quite clever Mitchell. Thanks so much for sharing it with us!

Sadly, I wrote the summary earlier today when I had a few free
moments. Don’t take it personally that it doesn’t mention this
code. :frowning:

James Edward G. II

On Aug 9, 2006, at 7:20 PM, Mitchell Koch wrote:

My solution will at the minimum capitalize the starts of sentences
just judging by periods.

If supplied with an example source using the -s option, it will try to
find words that should always be capitalized (I, Ruby, proper nouns in
general), words that imply that the next word should be capitalized
(Lake, General) and words in which punctuation does not imply an end
of a sentence (abbreviations) although this is only helpful if there
is some capitalization in the text.

I’m still reading through the code but just a minor tip:

lines like:

if EOSPunc.index word[-1].chr then true else false end

can be replaced with

EOSPunc.index word[-1].chr

it will either be true or false, and then the if statement is giving
the same thing.

if you really want to have true or false (and not 3 or “hi” or nil,
even though those will work fine if you treat the variable as a
boolean) one way to do it is !!var. using not twice gets you true or
false. there’s probably something more readable though.

– Elliot T.

if EOSPunc.index word[-1].chr then true else false end
can be replaced with
EOSPunc.index word[-1].chr
Ah, yeah that’s a shorter way to do it. For some reason I had it
stuck in my head that I just wanted to express truth value and tried
to avoid passing on extraneous information (like in this case the
index in the punctuation array of the entry with the punctuation
attached to the last word).

I didn’t spend too much time refactoring; at first I was dreaming up
abstracting parts of both the source reading and the proper casing
into a token parser kind of thing, but then it was more like, okay it
works and it’s the Wednesday before the quiz summary goes up, so let’s
send it off. :wink:

Mitchell Koch

This is quite clever Mitchell. Thanks so much for sharing it with us!

Sadly, I wrote the summary earlier today when I had a few free
moments. Don’t take it personally that it doesn’t mention this
code. :frowning:

No worries. I shouldn’t have put it off to the last minute anyway. :slight_smile:

It’s interesting to me that so few of us submitted code for this
quiz. It’s a problem that has no clear solutions, partly because
capitalization isn’t just about grammatical rules, but does
communicate unique things as hinted in Elliot’s initial examples.

For example if the word “gray” appears in a lowercase message, it
could mean the color in which it should actually be lowercase, or it
could be a surname, in which it should be capitalized. A computer
reader has no way to know, a human reader should be able to tell, but
really only the original author knows for sure.

It’s like image interpolation. If I start out with a small photo,
expand it, and try to infer the extra pixels, a good algorithm will
give you something that looks okay, but it will not be as good as if
you started out by taking it at the larger size in the first place.

Incidentally, that’s a good reason to not type in lowercase. :wink:

Mitchell Koch