Splitting a text file into sentences

basi_lio · November 30, 2005, 5:22pm

Austin Z. wrote:

in typeset work is a little wider than that used between words (an
em-space vs. an en-space).

Not true at all. I was always taught to use double spaces after
sentences in grade-school homework assignments done on plain word
processors or typewriters.

basi_lio · November 30, 2005, 6:56pm

I too learned two-spaces in typing class. However, I’m now in the one
space camp

Here is a great treatment on the topic,
http://www.webword.com/reports/period.html

basi_lio · November 30, 2005, 7:28pm

Whatever the original reason for the double spaces at the end of a line
started, the practice still continues.
In fact, MS word has an option in its grammar checker to enforce one or
two
spaces at the end of a sentence. For a lot of people (like me), it is
nothing more than an old habit that is hard to break.

The utility of this method for determining the end of a sentence depends
entirely on the purpose of the program. If I were to write a routine to
parse text that I wrote, it would probably work pretty well, and it
would
save me several hours of work trying to implement a fancier, more robust
routine.

The same routine would probably fail horribly for other users or a more
generic corpus of text.

As a general rule, I like to use algorithms that are as simple as
possible
for the job. That, of course, depends a lot on what the job is.

Funny, I never thought something like spacing between sentences would be
so
controversial. I can almost envision _why making an esoteric remark
about
the beauty of ‘negative space’ in text files.

_Kevin

basi_lio · November 30, 2005, 7:36pm

Austin Z. wrote:

That, in fact, is a very bad metric to follow, as the proper spacing
Then, quite honestly, you were taught wrong. I was taught to use
double spaces with a typewriter or when using fixed-pitch fonts
(although that was later, since most computers and printers didn’t
have reliable kerning routines until I was out of university).
Ultimately, the use of double spaces after a period is wrong even
with fixed-pitch fonts, but it was done to be clearer since the width
of the em-space and an en-space on a typewriter with a Courier-like
font is exactly the same. The two spaces simulates an em-space in a
typeset piece of work. (And that is fact, not opinion.)

The Bedford Handbook, which has been my bible for writing conventions
through the past ten years, lists two sets of guidelines: Those
recommended by the Modern Language Association (MLA), and those
recommended by the American Psychological Association (APA). It says
that the MLA style is typically taught in English classes, but that the
APA style is common in the social sciences. Here is the explanation of
the MLA guidelines, from page 633 of the Bedford Handbook for Writers,
© 1994:

MLA Guidelines [for essays]:

In typing the text of the essay, leave one space after words, commas,

colons, and semicolons and between the dots in ellipsis marks. Leave
two spaces after periods, question marks, and exclamation points.
To form a dash, type two hyphens with no space between them. Do not
put a space on either side of a dash.

The Handbook goes on to say (p. 635):

Although the APA guidelines call for one space after all punctuation,

most college professors prefer two spaces at the end of a sentence. Use
one space after all other punctuation.
Although two spaces are used after a period that ends a sentence, use
only one space after a period that follows a person’s initial (B.F.
Skinner).
To form a dash, type two hyphens with no space between them. Do not
put a space on either side of a dash.

The Handbook itself uses only single spaces at the ends of sentences.
Still, I hardly think there is one conclusively “right” or “wrong”
convention. Until I am convinced otherwise, I will continue to use two
spaces to separate sentences. This makes sentences easier to lex with
regular expressions, and makes them stand out to text editors and human
readers.

basi_lio · November 30, 2005, 8:13pm

On 11/30/05, Jeffrey S. [email protected] wrote:

opinion.)
Before we go much further, I have not used either MLA or APA guidelines
since I left university about ten years ago. However, I used both in
University and have since learned a lot more about typesetting and
layout and all that (and with PDF::Writer, have learned even more). My
degree was in English, not in Computer Science.

The Bedford Handbook, which has been my bible for writing conventions
through the past ten years, lists two sets of guidelines: Those
recommended by the Modern Language Association (MLA), and those
recommended by the American Psychological Association (APA). It says
that the MLA style is typically taught in English classes, but that
the APA style is common in the social sciences. Here is the
explanation of the MLA guidelines, from page 633 of the Bedford
Handbook for Writers, (c) 1994:

Okay, but something like the Bedford Handbook tells you what something
is, not why something is. A lot of teachers and reference guides do
that; it’s good because it saves space. It’s bad because practices that
do not or should not apply are continued for reasons that no one quite
understands and are applied in circumstances outside of where the
practice was intended to apply.

MLA Guidelines [for essays]:
In typing the text of the essay, leave one space after words,
commas, colons, and semicolons and between the dots in ellipsis
marks. Leave two spaces after periods, question marks, and
exclamation points. To form a dash, type two hyphens with no space
between them. Do not put a space on either side of a dash.

The Handbook goes on to say (p. 635):
Although the APA guidelines call for one space after all
punctuation, most college professors prefer two spaces at the end of
a sentence. Use one space after all other punctuation.
Although two spaces are used after a period that ends a sentence,
use only one space after a period that follows a person’s initial
(B.F. Skinner). To form a dash, type two hyphens with no space
between them. Do not put a space on either side of a dash.

Yes. Note that this primarily focuses on academic writing. The rules
for academic writing are very interesting because you are being taught
the rules that most journals require for publication–which has nothing
to do with readability outside of that environment.

Note, however, that there is an important clue to the reason behind
the rule in the part that you quoted, and that the APA specifically
indicates “one space after all punctuation” and the Bedford overrides
that for professors. The clue, by the way, is that both guidelines
indicate that a dash (? or — or —) should be formed with two
hyphens. This is again because the typical hyphen is approximately the
same size as an en-dash in a proportional font (and the en-dash may be
used for hyphens, although it is also used for dashes indicating ranges,
e.g., “1-5”) and two en-dashes are about the size of an em-dash–the
long dash you see from the HTML entities I pointed out above. Sentence
ending spaces are em-spaces–but there’s still only one of them.

The Handbook itself uses only single spaces at the ends of sentences.
Still, I hardly think there is one conclusively “right” or “wrong”
convention.

It depends on the purpose. In general writing, it is conclusively
wrong to use two spaces because it will mess up justified text and it
will sometimes generate more space than you want even if you aren’t
justifying your text. (In justified text, space is added to sentence
endings before it is added between words.) In writing for a publication,
it is conclusively wrong to do anything other than what the style
guide for the publication says.

It’s less problematic in email, where it is sometimes considered easier
to read if the correspondents are using non-HTML and/or fixed-pitch
fonts. It is definitely wrong to use two spaces after a sentence in Word
or OpenOffice, unless you are (as I noted) writing to a specific style
guide mandated by the person for whom you are writing.

-austin

basi_lio · November 30, 2005, 11:03pm

I think “right” or “wrong” are a tad strong for most of the cases
sited. But as a professional book designer and typographer, there’s
unquestionably “better” and “worse.”

For improved legibility, inter-sentence space should generally be a bit
greater than inter-word space.

Typewriters only had one distance they could travel. Either 1/10th of
an inch (“Pica”) or 1/12th (“Elite”). So the only way to add extra
space after a sentence was to double it. That’s way too much extra
space, but it was generally better than the alternative. The real
problem was that the words were too far apart, not that the sentences
were too close, but again, the fixed spacing was already an abominable
situation.

Proportional type, dating all the way back to Gutenberg, would
generally use 1/3rd or 1/4th of the height of type type as the
inter-word spacing. This would usually work out to about the width of a
lower case “t” or “l”.

When setting modern (by which you may also read “all type before
typewriters” as well) proportional type in fully justified form (left
and right margins both even), the spaces must be stretched out on a
line-by-line basis to fit. Really good typesetting programs (and really
good typesetters sticking little bits of lead between their words (and
I’ve done that, too)) will add more of the space between sentences than
between words, so as the line stretches, the inter-word space to
inter-sentence space ratio actually changes. (Take a look at a narrow
newspaper column sometime.)

More sophisticated approaches to space will ignore a user’s attempt to
sprinkle extraneous space in. Less sophisticated ones might allow it,
and even treat them as individual spaces, stretching both of them
during expansion. {shudder}

The fact that both the MLA Guidelines and the Bedford Handbook
encourage poor typography is regrettable. (“If you cannot type
appropriate punctuation, e.g. an em-dash or en-dash, please use
appropriate substitutions. For both dashes, substitute a pair of
hyphens, which, like true dashes, are typed without adjacent spaces.”
There’s still software out there that will happily wrap a line between
the two hyphens. Ick!) Nevertheless, if you’re submitting a paper to an
institution that expects or requires that, then to not follow them is
wrong, even if the legibility of the submission is better.

What it all boils down to is “Putting two spaces after a period at the
end of a sentence is an artifact left over from the days when the
typewriter was the prevalent text-making tool. Unless you have a
specific reason or requirement to do otherwise, it’s preferable to put
only one space between sentences.”

For breaking text into sentences, sometimes I find it easier to work
backwards. Also, only very colloquial writing will have a one-word
sentence, so you can solve all “Mr./Dr./Ph.D.” cases by the fact that
if a word starts with a cap and ends with a period, it’s not a
sentence. For a more sophisticated approach that’s still not too
complex to program, check the final word of a sentence against a
dictionary. If it’s found there without a final dot, then you’re almost
certainly looking at the end of a sentence. If it isn’t, then is it
found anywhere else in the document without a dot? If not, then you’re
probably looking at an abbreviation. (My mail program uses a monospaced
font. If I thought most readers would read it with a proportional font,
I’d have typed “Ph. D.” above, since it should have a thin space before
the D.)

basi_lio · November 30, 2005, 9:50pm

Jeffrey S. wrote:

The Handbook itself uses only single spaces at the ends of sentences.
Still, I hardly think there is one conclusively “right” or “wrong”
convention. Until I am convinced otherwise, I will continue to use two
spaces to separate sentences. This makes sentences easier to lex with
regular expressions, and makes them stand out to text editors and human
readers.

“Right” or “wrong” in this kind of styling has to do with whether
something is right or wrong according to a particular convention.

The normal convention for professional typography is to use one space
between sentences, whether you are convinced or not, whether using hard
type, a professinoal typesetting program, a desktop publishing program,
or a word processing program.

The older typewiter conventions are still often requested for
manuscripts for academic essays and mansucripts for submission to
publishing houses. These conventions also require underlining rather
than italics, use of double-hyphen for a dash rather than the specific
dash character, and so forth. But should this same manuscript be
professionally printed, even if the text is actually to be set by a
word processor, it would almost certainly be edited first to convert it
to typographical standard: changing all double-spaces to single spaces,
all occurrences of double-hyphen to em-dash or en-dash, using fancy
quotation marks instead of possible straight typewriter quotation
marks, italics instead of underlining, and so forth.

Note that HTML has from the beginning automatically changed any
multiple runs of spaces into a single space when displaying text.

Yes, a convention of always using two spaces would make sentences
easier to lex with regular expressions. Similarly, enforcing one single
spelling of English throughout the world would make searches and
matches easier. However, it is philosphically unsound to ask that the
world change to fit particular data-processing routines, rather than
that data-processing routines be built to properly to deal with
real-world situations.

If your lexing routine fails because many people don’t end
non-paragraph-final setences with double-spaces, or do so only in
particular plain text files, it is the fault of your lexing routine for
failing to handling common formatting, unless your lexing is intended
to be a limited tool that works only with manuscript formatted text.

The best general sentence lexing algorithm I’ve seen is the one set
forth by the Unicode Consoritium at
UAX #29: Text Boundaries .
This is designed to work reasonably well in any language and writing
system supported by Unicode, not just in English.

Jallan

basi_lio · December 1, 2005, 12:36am

On Nov 30, 2005, at 22:02, Dave H. wrote:

you can solve all “Mr./Dr./Ph.D.” cases by the fact that if a word
starts with a cap and ends with a period, it’s not a sentence.

I’m not sure that’s a very good rule, Dave. There are two sentences
here.

The above rule may catch titular abbreviations, but over-generalises
to produce a false negative in the above example. So in solving one
problem, you introduce another one. It’s relatively easy to make
another rule to catch the problem in this case, but it would probably
have been simpler to just make a specific rule to eliminate titular
abbreviations, since there really aren’t that many of them.

matthew smillie.

basi_lio · December 1, 2005, 12:44am

Dave H. wrote:

it was generally better than the alternative. The real problem was that
line-by-line basis to fit. Really good typesetting programs (and really

What it all boils down to is "Putting two spaces after a period at the
a word starts with a cap and ends with a period, it’s not a sentence.
For a more sophisticated approach that’s still not too complex to
program, check the final word of a sentence against a dictionary. If
it’s found there without a final dot, then you’re almost certainly
looking at the end of a sentence. If it isn’t, then is it found anywhere
else in the document without a dot? If not, then you’re probably looking
at an abbreviation. (My mail program uses a monospaced font. If I
thought most readers would read it with a proportional font, I’d have
typed “Ph. D.” above, since it should have a thin space before the D.)

This is what I love about Usenet.

basi_lio · December 1, 2005, 1:09am

On 11/30/05, Gavin S. [email protected] wrote:

width of the em-space and an en-space on a typewriter with a
Courier-like font is exactly the same. The two spaces simulates an
em-space in a typeset piece of work. (And that is fact, not
opinion.)
What rot. How can anything like that be a fact? You’re regurgitating
the opinion of a style manual.

Um. No, I’m stating fact. This isn’t mere opinion: two spaces were done
to simulate em-spaces in fixed pitch environments. That’s a fact. The
reason for that may often be forgotten, but it remains a fact. Please
remember that I’ve done quite a bit of typesetting-style work in the
last year with PDF::Writer and I have to know a bit more about this than
most folks, and it’s something of a hobby of mine in any case to know
about printing mechanisms.

The only opinion I stated was that the first poster in the chain above
(I think Jeffrey) was taught wrongly. I maintain that as true
regardless, because if he was taught two spaces without the reason why,
then there’s a practice being repeated for no good reason.

The practice is nonsense these days in most contexts.

-austin

basi_lio · December 1, 2005, 1:09am

Hello.

Dave H.:

For improved legibility, inter-sentence space should
generally be a bit greater than inter-word space.

It’s worth noting that actually turning this theory into reality seems
to apply to ‘Western’ (American, British, others?) typography (mostly?
only?).

I’ve yet to see a typical modern Polish book typeset with greater
inter-sentence spaces. Also (and, I guess, as a result of this),
I doubt I ever saw any Polish email or Usenet post with two
inter-sentence spaces, and I remember how happy I was to find
out about the ‘joinspaces’ vim option that finally let me reflow
paraghaprs properly, without doing a s/ / /g on them afterwards. :o)

Cheers,
– Shot

basi_lio · December 1, 2005, 12:28am

Austin Z. wrote:

of the em-space and an en-space on a typewriter with a Courier-like
font is exactly the same. The two spaces simulates an em-space in a
typeset piece of work. (And that is fact, not opinion.)

What rot. How can anything like that be a fact? You’re regurgitating
the opinion of a style manual.

Gavin

basi_lio · December 1, 2005, 1:53am

Austin Z. wrote:

On 11/30/05, Gavin S. [email protected] wrote:

[…] The two spaces simulates an
em-space in a typeset piece of work. (And that is fact, not
opinion.)

What rot. How can anything like that be a fact? You’re regurgitating
the opinion of a style manual.

Um. No, I’m stating fact. This isn’t mere opinion: two spaces were done
to simulate em-spaces in fixed pitch environments. That’s a fact. […]

Fair enough.

Gavin

basi_lio · December 1, 2005, 1:37am

On Nov 30, 2005, at 15:35, Matthew S. wrote:

On Nov 30, 2005, at 22:02, Dave H. wrote:

you can solve all “Mr./Dr./Ph.D.” cases by the fact that if a word
starts with a cap and ends with a period, it’s not a sentence.

I’m not sure that’s a very good rule, Dave. There are two sentences
here.

The above rule may catch titular abbreviations, but over-generalises
to produce a false negative in the above example.

I hadn’t intended to provide a single magical rule that was perfect in
isolation, after all. {chuckle}

“Ph. D.” is not a sentence. But where do you break
My name is Dave, Ph. D. Pleased to meet you.
vs.
You need my Ph. D. friend Dave to help you.

I don’t think having a list of abbreviations and titles will improve
that situation much, although it’s a lot more work and almost certain
to be incomplete. Any/every rule will have failures; avoiding them is
what takes you into that whole natural language high-octane engine
situation.

However, if you also use the other “rule” I mentioned, then you don’t
have a problem. “Dave H.” appears just a couple lines earlier,
establishing “Dave” as a word that doesn’t require a period. Therefore,
it’s more likely to be at the end of a sentence. The following word
(“There”) can be found in a dictionary, and in a non-capitalized form,
which means that its capitalization here following a dot strongly
indicates that it’s beginning a sentence.

The capital “P” of “Ph.” is not preceded by a period either time, so
it’s not starting a sentence. After it, “friend” isn’t capitalized, so
it’s not ending a sentence. But “Pleased” is, and dictionary says “not
normally capitalized” so that’s probably a sentence break.

basi_lio · December 1, 2005, 3:22am

All was well with this strategy, until i hit a sentence similar to:

The abbreviation for Mister is Mr.
The head office is in New York, N.Y.

In other words, abbreviations that end a sentence. These sentences
don’t end with a double dot, so if we replace Mr. with $MISTER$, the
sentence has no end marker.

Hmmm.
basi

basi_lio · December 1, 2005, 2:50am

On Dec 1, 2005, at 0:36, Dave H. wrote:

The above rule may catch titular abbreviations, but over-
generalises to produce a false negative in the above example.

I hadn’t intended to provide a single magical rule that was perfect
in isolation, after all. {chuckle}

Didn’t assume you were! It was just a good example to use for a
“this can be harder than it looks” couple of lines of warning, since
it’s been my experience that people don’t anticipate false negatives
as well as they do false positives.

matthew smillie.

basi_lio · December 3, 2005, 11:23pm

    He has never been known to use a word that might send a reader
    to the dictionary.     -- William Faulkner on Ernest Hemingway

Now, that is a wise one - it actually helps

to comprehend my jabber in the other post O
spontaneously generated today…

–
I am the One. I am A vampire A-calling for your love! A.A!
I am the fire that burns within your blood. I am the One!!
No bars or chains can keep me from your bed! I am the One!
Nothing on earth can get me from your head! I am the One!!

basi_lio · December 3, 2005, 9:38pm

On 2005-12-01 00:36:53 +0000, Dave H. [email protected]
said:

The above rule may catch titular abbreviations, but over-generalises to
produce a false negative in the above example.

I hadn’t intended to provide a single magical rule that was perfect in
isolation, after all. {chuckle}

Want some magick? You are stuck in wrong coordinate

system, like Newton. Stop thinking in terms of words and syntax
rules governing how to put them in correct order. Think
links (alinka). Think relations and revelation.
Words (symbols) have no meaning. None. They are empty.
If you want to infiltrate enemy ogranization the most
effective method is not drilling into individual agents,
but monitoring their communications (that is, relations).
If you aquire enough of those relations (and recursively,
but set some boundary unless you are Goddess and can
do anything you fancy) you don’t even need to decrypt the
messages, unless you are bored. To destroy enemy
organization, mess with the relations. Agents (symbols,
words, punctuation marks…) are of no importance
whatsowherever. That is why a person, if immersed enough
in a alien language needs no dictinary day-to-day - if one does
need to check, it’s not the meaning you are after -
it’s definition, that is MORE SYMBOLS, so you can
augment MORE RELATIONS from unfamiliar context (SYMBOL
CLOUD, think quantum mechanics and particles) until
you actually GET the pointer to “meaning” and can call on it
(how to relate that
symbol to some other symbol mesh, you can still have
no idea what the hell fermion “means”, but you can use it and
fail to be misunderstood unless you want to).

I have no idea how many "syntax errors" there are

in above paragraph - for the reason sublime, my total
lack of knowledge aboot rules of grammar for the
language used to convey meaning heretofore. HTH.

P.S. It makes me wonder, what 't bony "heretofore" word

“means” right now to you, Reader. Compose witty remarks if
it’s a-kind funny miss-take, I enjoy my Self when people
smirk. Yes, I did stick-in a word possessing none of it’s
meaning in my poor head. I must be mad? Or contrary-wise.
I’m not sure, to be frank with you a-like Frank
Herbert iff there was such word in usage “then”. She
will compensate for that - any dictionary dug
up shall (she can’t help it) explain in detail or else - she always
does
that when I go at a genuine miracle in open source. It’s
the game we play. I need some time, we make
a beatiful team… Prop me up with another
pill! A-musing…

–
I am the One. I am A vampire A-calling for your love! A.A!
I am the fire that burns within your blood. I am the One!!
No bars or chains can keep me from your bed! I am the One!
Nothing on earth can get me from your head! I am the One!!