Forum: Ruby Splitting a text file into sentences

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Acdabad15b58ba105df230e3ea02523d?d=identicon&s=25 basi_lio (Guest)
on 2005-11-30 00:51
(Received via mailing list)
Looking for ideas on how to split a text file into sentences. I see the
problem of basing the split on [.!?] -- they're also used in ways other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?
Thanks for the help.
basi
31af45939fec7e3c4ed8a798c0bd9b1a?d=identicon&s=25 M.B.Smillie (Guest)
on 2005-11-30 03:08
(Received via mailing list)
On Nov 29, 2005, at 23:49, basi wrote:

>
Doing really, really good sentence boundary detection is an on-going
problem in natural language processing.  I'm not aware of any Ruby-
based NLP packages, but if you want better accuracy than just using
[.!?:] there are several free NLP packages around (NLTK in Python,
and Stanford's Java NLP package spring to mind) that might help you.
A googling of "sentence tokenization" may also yield some help.

If that sounds like overkill, then you can get accuracy "good enough
for government work" by making a list of regular expressions to catch
exceptions to the punctuation rule.  These will necessarily vary a
little depending on your source text, but a typical examples are
catching titles like "Mr.", "Mrs." "Dr.", and all-caps abbreviations
like "U.S.A." or "M.D." (something like this: /([A-Z]\.([A-Z]\.)+/)

good luck,
matthew smillie.

----
Matthew Smillie            <M.B.Smillie@sms.ed.ac.uk>
Institute for Communicating and Collaborative Systems
University of Edinburgh
C8a634a01a2c4508360874bff7fb1a7f?d=identicon&s=25 Kevin Olbrich (olbrich)
on 2005-11-30 03:40
(Received via mailing list)
Depending on the text you might be able to search for a period (or other
punctuation) followed by two spaces.  It's not robust, but if you know
that
convention will be followed by the authors, then it can work.

_Kevin
902654bac6dff9567f018bd2ed933151?d=identicon&s=25 vanweerd (Guest)
on 2005-11-30 03:48
(Received via mailing list)
On 11/29/05, basi <basi_lio@hotmail.com> wrote:
>
I dimly recall something on this list about 9 months ago or so.

Nick
902654bac6dff9567f018bd2ed933151?d=identicon&s=25 vanweerd (Guest)
on 2005-11-30 03:53
(Received via mailing list)
On 11/29/05, Nicholas Van Weerdenburg <vanweerd@gmail.com> wrote:
> >
> >
> >
> I dimly recall something on this list about 9 months ago or so.
>
> Nick
> --
> Nicholas Van Weerdenburg
>


http://www.pressure.to/ruby/ is the reference I found in an old email
thread
on this list.

Nick
149379873fe2cb70e550c6bff8fedd0c?d=identicon&s=25 jeff (Guest)
on 2005-11-30 04:41
(Received via mailing list)
basi wrote:
> Looking for ideas on how to split a text file into sentences. I see the
> problem of basing the split on [.!?] -- they're also used in ways other
> than to end a sentence. If I have to do manual pre-processing of the
> text file, what editing might I do? Has anyone had to deal with this
> problem and how did you make life easier for you?

It's a common convention to separate sentences by double spaces.  I
started following this convention because Emacs expected it, and now I
use it always.
Acdabad15b58ba105df230e3ea02523d?d=identicon&s=25 basi_lio (Guest)
on 2005-11-30 05:29
(Received via mailing list)
Hi,
I have looked at NLTK in Python (and had been hoping a Rubyist would
rewrite it in Ruby). I will go back to NLTK and see if it has a
split-sentence algorithm of sort. And thanks for the tip on Stanfords
Java NLP package. Yes, those abbreviations are pesky, and I may have to
resort to an exceptions list containing the most common ones.
Thanks much,
basi
Acdabad15b58ba105df230e3ea02523d?d=identicon&s=25 basi_lio (Guest)
on 2005-11-30 05:33
(Received via mailing list)
Hi,
I will google. Thanks!
basi
Acdabad15b58ba105df230e3ea02523d?d=identicon&s=25 basi_lio (Guest)
on 2005-11-30 05:37
(Received via mailing list)
Yes, I learned this convention when I took a keyboarding (i.e., typing)
lesson in high school. Sometime ago, a style manual for word processing
appeared, and one of the advice is to use only one space to separate
sentences. The reason given is that in a justified format, those two
spaces can become four spaces, or even more. Anyway, a lot of text now
has one or two spaces between sentences, and this wouldn't be a
reliable indicator of sentence boundary.
Cheers!
basi
Acdabad15b58ba105df230e3ea02523d?d=identicon&s=25 basi_lio (Guest)
on 2005-11-30 05:45
(Received via mailing list)
Hi,
This looks promising. I'm downloading as I write.
Thanks!
basi
4b174722d1b1a4bbd9672e1ab50c30a9?d=identicon&s=25 leavengood (Guest)
on 2005-11-30 05:58
(Received via mailing list)
On 11/29/05, basi <basi_lio@hotmail.com> wrote:
> Yes, I learned this convention when I took a keyboarding (i.e., typing)
> lesson in high school. Sometime ago, a style manual for word processing
> appeared, and one of the advice is to use only one space to separate
> sentences. The reason given is that in a justified format, those two
> spaces can become four spaces, or even more. Anyway, a lot of text now
> has one or two spaces between sentences, and this wouldn't be a
> reliable indicator of sentence boundary.

I too learned the two space after a period convention years ago and
also recently learned that with modern fonts and word processors it is
not necessary. It was tricky to retrain myself, but I did, and have
been using just one space ever since.

So like you say, that isn't a reliable way to discern sentences.

I would recommend following the advice of first filtering out false
positives (possibly even replacing them with temporary markers, Mr.
becomes $MISTER$ or similar), then splitting on punctuation. If you
then test on various sample texts you should be able to find more
false positives that you might have missed.

Ryan
Acdabad15b58ba105df230e3ea02523d?d=identicon&s=25 basi_lio (Guest)
on 2005-11-30 06:34
(Received via mailing list)
Hi,
This just might be easier than what I have in mind. I will try this
first.
Thanks!
basi
Cfdeff3ac35010e4de8f85d954f24f4a?d=identicon&s=25 damphyr (Guest)
on 2005-11-30 09:44
(Received via mailing list)
Ryan Leavengood wrote:
>
> becomes $MISTER$ or similar), then splitting on punctuation. If you
> then test on various sample texts you should be able to find more
> false positives that you might have missed.
Which will not help you at all with foreign languages. And don't forget
putting i.e., e.g. or etc. in the list.
This is an ongoing problem (think about the auto-correction 'feature' of
capitalizing the first letter of every sentence in Openoffice or Word -
something I always turn off because it is so insistent when it's wrong)
Cheers,
V.-
--
http://www.braveworld.net/riva
784481e009179262d133db1f1eb3bfb1?d=identicon&s=25 Edwin Van leeuwen (blackedder)
on 2005-11-30 11:41
basi_lio wrote:
> Looking for ideas on how to split a text file into sentences. I see the
> problem of basing the split on [.!?] -- they're also used in ways other
> than to end a sentence. If I have to do manual pre-processing of the
> text file, what editing might I do? Has anyone had to deal with this
> problem and how did you make life easier for you?
> Thanks for the help.
> basi

If you make a regexp: [.!?]\s+[A-Z] you will already capture most. Most
Abbreviations normally aren't followed by a space/capital letter.

One change to this rule that I can think of is Mr. Name, Mrs. Name. But
as you can see these have a <uppercase> followed by only one or two
downcase letters. Most sentences would have at least five non uppercase
in front of the <.> ->
[A-Z]\w\w?\w?\w?\.
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 halostatue (Guest)
on 2005-11-30 13:55
(Received via mailing list)
On 11/29/05, Kevin Olbrich <kevin.olbrich@duke.edu> wrote:
> Depending on the text you might be able to search for a period (or other
> punctuation) followed by two spaces.  It's not robust, but if you know that
> convention will be followed by the authors, then it can work.

That, in fact, is a very *bad* metric to follow, as the proper spacing
after sentence punctuation is a single space. The only reason that two
spaces was used in the past is the space used between sentence endings
in typeset work is a little wider than that used between words (an
em-space vs. an en-space).

-austin
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 halostatue (Guest)
on 2005-11-30 13:55
(Received via mailing list)
On 11/29/05, basi <basi_lio@hotmail.com> wrote:
> I have looked at NLTK in Python (and had been hoping a Rubyist would
> rewrite it in Ruby). I will go back to NLTK and see if it has a
> split-sentence algorithm of sort. And thanks for the tip on Stanfords
> Java NLP package. Yes, those abbreviations are pesky, and I may have to
> resort to an exceptions list containing the most common ones.

Look at Text::Format for some indication on how abbreviations could be
handled.

-austin
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 halostatue (Guest)
on 2005-11-30 14:36
(Received via mailing list)
On 11/29/05, Jeffrey Schwab <jeff@schwabcenter.com> wrote:
> basi wrote:
> > Looking for ideas on how to split a text file into sentences. I see the
> > problem of basing the split on [.!?] -- they're also used in ways other
> > than to end a sentence. If I have to do manual pre-processing of the
> > text file, what editing might I do? Has anyone had to deal with this
> > problem and how did you make life easier for you?
> It's a common convention to separate sentences by double spaces.  I
> started following this convention because Emacs expected it, and now I
> use it always.

As I noted above, this is an improper convention outside of the
typewriter realm. If you are using anything other than a fixed-pitch
font for display or print, you should *never* use two spaces.

-austin
7264fb16beeea92b89bb42023738259d?d=identicon&s=25 chneukirchen (Guest)
on 2005-11-30 16:58
(Received via mailing list)
Austin Ziegler <halostatue@gmail.com> writes:

>
> As I noted above, this is an improper convention outside of the
> typewriter realm. If you are using anything other than a fixed-pitch
> font for display or print, you should *never* use two spaces.

Alternatively, use text processing systems that do the "right thing";
i.e. transform two spaces into one (e.g. TeX, HTML-based products).
There is no good reason a text processor should show two spaces after
each other in print.
149379873fe2cb70e550c6bff8fedd0c?d=identicon&s=25 jeff (Guest)
on 2005-11-30 17:22
(Received via mailing list)
Austin Ziegler wrote:
> in typeset work is a little wider than that used between words (an
> em-space vs. an en-space).
>

Not true at all.  I was always taught to use double spaces after
sentences in grade-school homework assignments done on plain word
processors or typewriters.
4299e35bacef054df40583da2d51edea?d=identicon&s=25 James Gray (bbazzarrakk)
on 2005-11-30 18:11
(Received via mailing list)
On Nov 30, 2005, at 10:22 AM, Jeffrey Schwab wrote:

>> two
>> spaces was used in the past is the space used between sentence
>> endings
>> in typeset work is a little wider than that used between words (an
>> em-space vs. an en-space).
>
> Not true at all.  I was always taught to use double spaces after
> sentences in grade-school homework assignments done on plain word
> processors or typewriters.

Many of us were and I'll admit that I can't shake the habit.  I still
know it's wrong though.  ;)

James Edward Gray II
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 halostatue (Guest)
on 2005-11-30 18:40
(Received via mailing list)
On 11/30/05, Jeffrey Schwab <jeff@schwabcenter.com> wrote:
> Not true at all.  I was always taught to use double spaces after
> sentences in grade-school homework assignments done on plain word
> processors or typewriters.

Then, quite honestly, you were taught wrong. I was taught to use
double spaces with a typewriter or when using fixed-pitch fonts
(although that was later, since most computers and printers didn't
have reliable kerning routines until I was out of university).
Ultimately, the use of double spaces after a period is wrong *even
with fixed-pitch fonts*, but it was done to be clearer since the width
of the em-space and an en-space on a typewriter with a Courier-like
font is exactly the same. The two spaces *simulates* an em-space in a
typeset piece of work. (And that is *fact*, not opinion.)

-austin
Aa1f23332dbb5408a55ed190648ba172?d=identicon&s=25 mark.ericson (Guest)
on 2005-11-30 18:56
(Received via mailing list)
I too learned two-spaces in typing class.  However, I'm now in the one
space camp

Here is a great treatment on the topic,
http://www.webword.com/reports/period.html
C8a634a01a2c4508360874bff7fb1a7f?d=identicon&s=25 Kevin Olbrich (olbrich)
on 2005-11-30 19:28
(Received via mailing list)
Whatever the original reason for the double spaces at the end of a line
started, the practice still continues.
In fact, MS word has an option in its grammar checker to enforce one or
two
spaces at the end of a sentence.  For a lot of people (like me), it is
nothing more than an old habit that is hard to break.

The utility of this method for determining the end of a sentence depends
entirely on the purpose of the program.  If I were to write a routine to
parse text that I wrote, it would probably work pretty well, and it
would
save me several hours of work trying to implement a fancier, more robust
routine.

The same routine would probably fail horribly for other users or a more
generic corpus of text.

As a general rule, I like to use algorithms that are as simple as
possible
for the job.  That, of course, depends a lot on what the job is.

Funny, I never thought something like spacing between sentences would be
so
controversial.  I can almost envision _why making an esoteric remark
about
the beauty of 'negative space' in text files.

_Kevin
149379873fe2cb70e550c6bff8fedd0c?d=identicon&s=25 jeff (Guest)
on 2005-11-30 19:36
(Received via mailing list)
Austin Ziegler wrote:
>>>That, in fact, is a very *bad* metric to follow, as the proper spacing
> Then, quite honestly, you were taught wrong. I was taught to use
> double spaces with a typewriter or when using fixed-pitch fonts
> (although that was later, since most computers and printers didn't
> have reliable kerning routines until I was out of university).
> Ultimately, the use of double spaces after a period is wrong *even
> with fixed-pitch fonts*, but it was done to be clearer since the width
> of the em-space and an en-space on a typewriter with a Courier-like
> font is exactly the same. The two spaces *simulates* an em-space in a
> typeset piece of work. (And that is *fact*, not opinion.)

The Bedford Handbook, which has been my bible for writing conventions
through the past ten years, lists two sets of guidelines:  Those
recommended by the Modern Language Association (MLA), and those
recommended by the American Psychological Association (APA).  It says
that the MLA style is typically taught in English classes, but that the
APA style is common in the social sciences.  Here is the explanation of
the MLA guidelines, from page 633 of the Bedford Handbook for Writers,
(c) 1994:


MLA Guidelines [for essays]:

	In typing the text of the essay, leave one space after words, commas,
colons, and semicolons and between the dots in ellipsis marks.  Leave
two spaces after periods, question marks, and exclamation points.
	To form a dash, type two hyphens with no space between them.  Do not
put a space on either side of a dash.


The Handbook goes on to say (p. 635):


	Although the APA guidelines call for one space after all punctuation,
most college professors prefer two spaces at the end of a sentence.  Use
one space after all other punctuation.
	Although two spaces are used after a period that ends a sentence, use
only one space after a period that follows a person's initial (B.F.
Skinner).
	To form a dash, type two hyphens with no space between them.  Do not
put a space on either side of a dash.


The Handbook itself uses only single spaces at the ends of sentences.
Still, I hardly think there is one conclusively "right" or "wrong"
convention.  Until I am convinced otherwise, I will continue to use two
spaces to separate sentences.  This makes sentences easier to lex with
regular expressions, and makes them stand out to text editors and human
readers.
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 halostatue (Guest)
on 2005-11-30 20:13
(Received via mailing list)
On 11/30/05, Jeffrey Schwab <jeff@schwabcenter.com> wrote:
>> opinion.)
Before we go much further, I have not used either MLA or APA guidelines
since I left university about ten years ago. However, I used both in
University and have since learned a lot more about typesetting and
layout and all that (and with PDF::Writer, have learned even more). My
degree was in English, not in Computer Science.

> The Bedford Handbook, which has been my bible for writing conventions
> through the past ten years, lists two sets of guidelines:  Those
> recommended by the Modern Language Association (MLA), and those
> recommended by the American Psychological Association (APA).  It says
> that the MLA style is typically taught in English classes, but that
> the APA style is common in the social sciences.  Here is the
> explanation of the MLA guidelines, from page 633 of the Bedford
> Handbook for Writers, (c) 1994:

Okay, but something like the Bedford Handbook tells you *what* something
is, not *why* something is. A lot of teachers and reference guides do
that; it's good because it saves space. It's bad because practices that
do not or should not apply are continued for reasons that no one quite
understands and are applied in circumstances outside of where the
practice was intended to apply.

> MLA Guidelines [for essays]:
>	In typing the text of the essay, leave one space after words,
>	commas, colons, and semicolons and between the dots in ellipsis
>	marks.  Leave two spaces after periods, question marks, and
>	exclamation points. To form a dash, type two hyphens with no space
>	between them.  Do not put a space on either side of a dash.

> The Handbook goes on to say (p. 635):
>	  Although the APA guidelines call for one space after all
>	punctuation, most college professors prefer two spaces at the end of
>	a sentence.  Use one space after all other punctuation.
>
>     Although two spaces are used after a period that ends a sentence,
>   use only one space after a period that follows a person's initial
>   (B.F. Skinner). To form a dash, type two hyphens with no space
>   between them.  Do not put a space on either side of a dash.

Yes. Note that this primarily focuses on *academic* writing. The rules
for academic writing are very interesting because you are being taught
the rules that most journals require for publication--which has nothing
to do with readability outside of that environment.

Note, however, that there is an important clue to the *reason* behind
the rule in the part that you quoted, and that the APA *specifically*
indicates "one space after all punctuation" and the Bedford overrides
that for professors. The clue, by the way, is that *both* guidelines
indicate that a dash (? or &#8212; or &mdash;) should be formed with two
hyphens. This is again because the typical hyphen is approximately the
same size as an en-dash in a proportional font (and the en-dash may be
used for hyphens, although it is also used for dashes indicating ranges,
e.g., "1-5") and two en-dashes are about the size of an em-dash--the
long dash you see from the HTML entities I pointed out above. Sentence
ending spaces are em-spaces--but there's *still* only one of them.

> The Handbook itself uses only single spaces at the ends of sentences.
> Still, I hardly think there is one conclusively "right" or "wrong"
> convention.

It depends on the purpose. In *general* writing, it is conclusively
*wrong* to use two spaces because it will mess up justified text and it
will sometimes generate more space than you want even if you aren't
justifying your text. (In justified text, space is added to sentence
endings before it is added between words.) In writing for a publication,
it is conclusively *wrong* to do anything other than what the style
guide for the publication says.

It's less problematic in email, where it is sometimes considered easier
to read if the correspondents are using non-HTML and/or fixed-pitch
fonts. It is definitely wrong to use two spaces after a sentence in Word
or OpenOffice, unless you are (as I noted) writing to a specific style
guide mandated by the person for whom you are writing.

-austin
Caf20e6e1161cae8fe1b2eca08a59724?d=identicon&s=25 jallan (Guest)
on 2005-11-30 21:50
(Received via mailing list)
Jeffrey Schwab wrote:
> The Handbook itself uses only single spaces at the ends of sentences.
> Still, I hardly think there is one conclusively "right" or "wrong"
> convention.  Until I am convinced otherwise, I will continue to use two
> spaces to separate sentences.  This makes sentences easier to lex with
> regular expressions, and makes them stand out to text editors and human
> readers.

 "Right" or "wrong" in this kind of styling has to do with whether
something is right or wrong according to a particular convention.

The normal convention for professional typography is to use one space
between sentences, whether you are convinced or not, whether using hard
type, a professinoal typesetting program, a desktop publishing program,
or a word processing program.

The older typewiter conventions are still often requested for
manuscripts for academic essays and mansucripts for submission to
publishing houses. These conventions also require underlining rather
than italics, use of double-hyphen for a dash rather than the specific
dash character, and so forth. But should this same manuscript be
professionally printed, even if the text is actually to be set by a
word processor, it would almost certainly be edited first to convert it
to typographical standard: changing all double-spaces to single spaces,
all occurrences of double-hyphen to em-dash or en-dash, using fancy
quotation marks instead of possible straight typewriter quotation
marks, italics instead of underlining, and so forth.

Note that HTML has from the beginning automatically changed any
multiple runs of spaces into a single space when displaying text.

Yes, a convention of always using two spaces would make sentences
easier to lex with regular expressions. Similarly, enforcing one single
spelling of English throughout the world would make searches and
matches easier. However, it is philosphically unsound to ask that the
world change to fit particular data-processing routines, rather than
that data-processing routines be built to properly to deal with
real-world situations.

If your lexing routine fails because many people don't end
non-paragraph-final setences with double-spaces, or do so only in
particular plain text files, it is the fault of your lexing routine for
failing to handling common formatting, unless your lexing is intended
to be a limited tool that works only with manuscript formatted text.

The best general sentence lexing algorithm I've seen is the one set
forth by the Unicode Consoritium at
http://www.unicode.org/reports/tr29/tr29-4.html#Se... .
This is designed to work reasonably well in any language and writing
system supported by Unicode, not just in English.

Jallan
12271b6df73fe29930d65586be5a4a70?d=identicon&s=25 groups (Guest)
on 2005-11-30 23:03
(Received via mailing list)
I think "right" or "wrong" are a tad strong for most of the cases
sited. But as a professional book designer and typographer, there's
unquestionably "better" and "worse."

For improved legibility, inter-sentence space should generally be a bit
greater than inter-word space.

Typewriters only had one distance they could travel. Either 1/10th of
an inch ("Pica") or 1/12th ("Elite"). So the only way to add extra
space after a sentence was to double it. That's way too much extra
space, but it was generally better than the alternative. The real
problem was that the words were too far apart, not that the sentences
were too close, but again, the fixed spacing was already an abominable
situation.

Proportional type, dating all the way back to Gutenberg, would
generally use 1/3rd or 1/4th of the height of type type as the
inter-word spacing. This would usually work out to about the width of a
lower case "t" or "l".

When setting modern (by which you may also read "all type before
typewriters" as well) proportional type in fully justified form (left
and right margins both even), the spaces must be stretched out on a
line-by-line basis to fit. Really good typesetting programs (and really
good typesetters sticking little bits of lead between their words (and
I've done that, too)) will add more of the space between sentences than
between words, so as the line stretches, the inter-word space to
inter-sentence space ratio actually changes. (Take a look at a narrow
newspaper column sometime.)

More sophisticated approaches to space will ignore a user's attempt to
sprinkle extraneous space in. Less sophisticated ones might allow it,
and even treat them as individual spaces, stretching both of them
during expansion. {shudder}

The fact that both the MLA Guidelines and the Bedford Handbook
encourage poor typography is regrettable. ("If you cannot type
appropriate punctuation, e.g. an em-dash or en-dash, please use
appropriate substitutions. For both dashes, substitute a pair of
hyphens, which, like true dashes, are typed without adjacent spaces."
There's still software out there that will happily wrap a line between
the two hyphens. Ick!) Nevertheless, if you're submitting a paper to an
institution that expects or requires that, then to not follow them is
wrong, even if the legibility of the submission is better.

What it all boils down to is "Putting two spaces after a period at the
end of a sentence is an artifact left over from the days when the
typewriter was the prevalent text-making tool. Unless you have a
specific reason or requirement to do otherwise, it's preferable to put
only one space between sentences."

*****

For breaking text into sentences, sometimes I find it easier to work
backwards.  Also, only very colloquial writing will have  a one-word
sentence, so you can solve all "Mr./Dr./Ph.D." cases by the fact that
if a word starts with a cap and ends with a period, it's not a
sentence. For a more sophisticated approach that's still not too
complex to program, check the final word of a sentence against a
dictionary. If it's found there without a final dot, then you're almost
certainly looking at the end of a sentence. If it isn't, then is it
found anywhere else in the document without a dot? If not, then you're
probably looking at an abbreviation. (My mail program uses a monospaced
font. If I thought most readers would read it with a proportional font,
I'd have typed "Ph. D." above, since it should have a thin space before
the D.)
Bfcc9047bea80035a936648dc1912ec4?d=identicon&s=25 gsinclair (Guest)
on 2005-12-01 00:28
(Received via mailing list)
Austin Ziegler wrote:
> of the em-space and an en-space on a typewriter with a Courier-like
> font is exactly the same. The two spaces *simulates* an em-space in a
> typeset piece of work. (And that is *fact*, not opinion.)

What rot.  How can anything like that be a fact?  You're regurgitating
the opinion of a style manual.

Gavin
31af45939fec7e3c4ed8a798c0bd9b1a?d=identicon&s=25 M.B.Smillie (Guest)
on 2005-12-01 00:36
(Received via mailing list)
On Nov 30, 2005, at 22:02, Dave Howell wrote:

> you can solve all "Mr./Dr./Ph.D." cases by the fact that if a word
> starts with a cap and ends with a period, it's not a sentence.

I'm not sure that's a very good rule, Dave. There are two sentences
here.

The above rule may catch titular abbreviations, but over-generalises
to produce a false negative in the above example.  So in solving one
problem, you introduce another one.  It's relatively easy to make
another rule to catch the problem in this case, but it would probably
have been simpler to just make a specific rule to eliminate titular
abbreviations, since there really aren't that many of them.

matthew smillie.
149379873fe2cb70e550c6bff8fedd0c?d=identicon&s=25 jeff (Guest)
on 2005-12-01 00:44
(Received via mailing list)
Dave Howell wrote:
> it was generally better than the alternative. The real problem was that
> line-by-line basis to fit. Really good typesetting programs (and really
>
> What it all boils down to is "Putting two spaces after a period at the
> a word starts with a cap and ends with a period, it's not a sentence.
> For a more sophisticated approach that's still not too complex to
> program, check the final word of a sentence against a dictionary. If
> it's found there without a final dot, then you're almost certainly
> looking at the end of a sentence. If it isn't, then is it found anywhere
> else in the document without a dot? If not, then you're probably looking
> at an abbreviation. (My mail program uses a monospaced font. If I
> thought most readers would read it with a proportional font, I'd have
> typed "Ph. D." above, since it should have a thin space before the D.)

This is what I love about Usenet. :)
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 halostatue (Guest)
on 2005-12-01 01:09
(Received via mailing list)
On 11/30/05, Gavin Sinclair <gsinclair@gmail.com> wrote:
>> width of the em-space and an en-space on a typewriter with a
>> Courier-like font is exactly the same. The two spaces *simulates* an
>> em-space in a typeset piece of work. (And that is *fact*, not
>> opinion.)
> What rot.  How can anything like that be a fact?  You're regurgitating
> the opinion of a style manual.

Um. No, I'm stating fact. This isn't mere opinion: two spaces were done
to simulate em-spaces in fixed pitch environments. That's a fact. The
reason for that may often be forgotten, but it *remains* a fact. Please
remember that I've done quite a bit of typesetting-style work in the
last year with PDF::Writer and I have to know a bit more about this than
most folks, and it's something of a hobby of mine in any case to know
about printing mechanisms.

The only *opinion* I stated was that the first poster in the chain above
(I think Jeffrey) was taught wrongly. I maintain that as true
regardless, because if he was taught two spaces without the reason why,
then there's a practice being repeated for no good reason.

The practice is nonsense these days in most contexts.

-austin
93139b2c9893fd7dfafba4090db346c9?d=identicon&s=25 shot (Guest)
on 2005-12-01 01:09
(Received via mailing list)
Hello.

Dave Howell:

> For improved legibility, inter-sentence space should
> generally be a bit greater than inter-word space.

It's worth noting that actually turning this theory into reality seems
to apply to 'Western' (American, British, others?) typography (mostly?
only?).

I've yet to see a typical modern Polish book typeset with greater
inter-sentence spaces. Also (and, I guess, as a result of this),
I doubt I ever saw any Polish email or Usenet post with two
inter-sentence spaces, and I remember how happy I was to find
out about the 'joinspaces' vim option that finally let me reflow
paraghaprs properly, without doing a s/  / /g on them afterwards. :o)

Cheers,
-- Shot
12271b6df73fe29930d65586be5a4a70?d=identicon&s=25 groups (Guest)
on 2005-12-01 01:37
(Received via mailing list)
On Nov 30, 2005, at 15:35, Matthew Smillie wrote:

> On Nov 30, 2005, at 22:02, Dave Howell wrote:
>
>> you can solve all "Mr./Dr./Ph.D." cases by the fact that if a word
>> starts with a cap and ends with a period, it's not a sentence.
>
> I'm not sure that's a very good rule, Dave. There are two sentences
> here.
>
> The above rule may catch titular abbreviations, but over-generalises
> to produce a false negative in the above example.

I hadn't intended to provide a single magical rule that was perfect in
isolation, after all. {chuckle}


"Ph. D." is not a sentence. But where do you break
	My name is Dave, Ph. D. Pleased to meet you.
vs.
	You need my Ph. D. friend Dave to help you.

I don't think having a list of abbreviations and titles will improve
that situation much, although it's a lot more work and almost certain
to be incomplete. Any/every rule will have failures; avoiding them is
what takes you into that whole natural language high-octane engine
situation.

However, if you also use the *other* "rule" I mentioned, then you don't
have a problem. "Dave Howell" appears just a couple lines earlier,
establishing "Dave" as a word that doesn't require a period. Therefore,
it's more likely to be at the end of a sentence. The following word
("There") can be found in a dictionary, and in a non-capitalized form,
which means that its capitalization here following a dot strongly
indicates that it's beginning a sentence.

The capital "P" of "Ph." is not preceded by a period either time, so
it's not starting a sentence. After it, "friend" isn't capitalized, so
it's not ending a sentence. But "Pleased" is, and dictionary says "not
normally capitalized" so that's probably a sentence break.
Bfcc9047bea80035a936648dc1912ec4?d=identicon&s=25 gsinclair (Guest)
on 2005-12-01 01:53
(Received via mailing list)
Austin Ziegler wrote:
> On 11/30/05, Gavin Sinclair <gsinclair@gmail.com> wrote:
> >> [...] The two spaces *simulates* an
> >> em-space in a typeset piece of work. (And that is *fact*, not
> >> opinion.)
>
> > What rot.  How can anything like that be a fact?  You're regurgitating
> > the opinion of a style manual.
>
> Um. No, I'm stating fact. This isn't mere opinion: two spaces were done
> to simulate em-spaces in fixed pitch environments. That's a fact.  [...]

Fair enough.

Gavin
31af45939fec7e3c4ed8a798c0bd9b1a?d=identicon&s=25 M.B.Smillie (Guest)
on 2005-12-01 02:50
(Received via mailing list)
On Dec 1, 2005, at 0:36, Dave Howell wrote:

>>
>> The above rule may catch titular abbreviations, but over-
>> generalises to produce a false negative in the above example.
>
> I hadn't intended to provide a single magical rule that was perfect
> in isolation, after all. {chuckle}

Didn't assume you were!  It was just a good example to use for a
"this can be harder than it looks" couple of lines of warning, since
it's been my experience that people don't anticipate false negatives
as well as they do false positives.


matthew smillie.
Acdabad15b58ba105df230e3ea02523d?d=identicon&s=25 basi_lio (Guest)
on 2005-12-01 03:22
(Received via mailing list)
All was well with this strategy, until i hit a sentence similar to:

The abbreviation for Mister is Mr.
The head office is in New York, N.Y.

In other words, abbreviations that end a sentence. These sentences
don't end with a double dot, so if we replace Mr. with $MISTER$, the
sentence has no end marker.

Hmmm.
basi
0a39b46577145684b38fb0f6b478edc2?d=identicon&s=25 agquarx (Guest)
on 2005-12-03 21:38
(Received via mailing list)
On 2005-12-01 00:36:53 +0000, Dave Howell <groups@grandfenwick.net>
said:

>> The above rule may catch titular abbreviations, but over-generalises to
>> produce a false negative in the above example.
>
> I hadn't intended to provide a single magical rule that was perfect in
> isolation, after all. {chuckle}
>
>
	Want some magick? You are stuck in wrong coordinate
 system, like Newton. Stop thinking in terms of words and syntax
 rules governing how to put them in correct order. Think
 links (alinka). Think relations and revelation.
 Words (symbols) have no meaning. None. They *are* empty.
 If you want to infiltrate enemy ogranization the most
 effective method is not drilling into individual agents,
 but monitoring their communications (that is, relations).
 If you aquire enough of those relations (and recursively,
 but set some boundary unless you are Goddess and can
 do anything you fancy) you don't even need to decrypt the
 messages, unless you are bored. To destroy enemy
 organization, mess with the relations. Agents (symbols,
 words, punctuation marks...) are of no importance
 whatsowherever. That is why a person, if immersed enough
 in a alien language needs no dictinary day-to-day - if one does
 need to check, it's not the meaning you are after -
 it's definition, that is MORE SYMBOLS, so you can
 augment MORE RELATIONS from unfamiliar context (SYMBOL
 CLOUD, think quantum mechanics and particles) until
 you actually GET the pointer to "meaning" and can call on it
 (how to relate that
 symbol to some other symbol mesh, you can still have
 no idea what the hell fermion "means", but you can use it and
 fail to be misunderstood unless you want to).

	I have no idea how many "syntax errors" there are
 in above paragraph - for the reason sublime, my total
 lack of knowledge aboot rules of grammar for the
 language used to convey meaning heretofore. HTH.

	P.S. It makes me wonder, what 't bony "heretofore" word
 "means" right now to you, Reader. Compose witty remarks if
 it's a-kind funny miss-take, I enjoy my Self when people
 smirk. Yes, I did stick-in a word possessing none of it's
 meaning in my poor head. I must be mad? Or contrary-wise.
 I'm not sure, to be frank with you a-like Frank
 Herbert iff there was such word in usage "then". She
 will compensate for that - any dictionary dug
 up shall (she can't help it) explain in detail or else - she always
does
 that when I go at a genuine miracle in open source. It's
 the game we play. I need some time, we make
 a beatiful team... Prop me up with another
 pill! A-musing...

-- 
I am the One. I am A vampire A-calling for your love! A.A!
I am the fire that burns within your blood. I am the One!!
No bars or chains can keep me from your bed! I am the One!
Nothing on earth can get me from your head! I am the One!!
0a39b46577145684b38fb0f6b478edc2?d=identicon&s=25 agquarx (Guest)
on 2005-12-03 23:23
(Received via mailing list)
>
>         He has never been known to use a word that might send a reader
>         to the dictionary.     -- William Faulkner on Ernest Hemingway
>
>

	Now, that is a wise one - it actually helps
 to comprehend my jabber in the other post O
 spontaneously generated today...

-- 
I am the One. I am A vampire A-calling for your love! A.A!
I am the fire that burns within your blood. I am the One!!
No bars or chains can keep me from your bed! I am the One!
Nothing on earth can get me from your head! I am the One!!
This topic is locked and can not be replied to.