Splitting a text file into sentences

basi_lio · November 30, 2005, 12:51am

Looking for ideas on how to split a text file into sentences. I see the
problem of basing the split on [.!?] – they’re also used in ways other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?
Thanks for the help.
basi

basi_lio · November 30, 2005, 3:08am

On Nov 29, 2005, at 23:49, basi wrote:

Doing really, really good sentence boundary detection is an on-going
problem in natural language processing. I’m not aware of any Ruby-
based NLP packages, but if you want better accuracy than just using
[.!?:] there are several free NLP packages around (NLTK in Python,
and Stanford’s Java NLP package spring to mind) that might help you.
A googling of “sentence tokenization” may also yield some help.

If that sounds like overkill, then you can get accuracy “good enough
for government work” by making a list of regular expressions to catch
exceptions to the punctuation rule. These will necessarily vary a
little depending on your source text, but a typical examples are
catching titles like “Mr.”, “Mrs.” “Dr.”, and all-caps abbreviations
like “U.S.A.” or “M.D.” (something like this: /([A-Z].([A-Z].)+/)

good luck,
matthew smillie.

Matthew S. [email protected]
Institute for Communicating and Collaborative Systems
University of Edinburgh

basi_lio · November 30, 2005, 3:40am

Depending on the text you might be able to search for a period (or other
punctuation) followed by two spaces. It’s not robust, but if you know
that
convention will be followed by the authors, then it can work.

_Kevin

basi_lio · November 30, 2005, 3:48am

On 11/29/05, basi [email protected] wrote:

I dimly recall something on this list about 9 months ago or so.

Nick

basi_lio · November 30, 2005, 3:53am

On 11/29/05, Nicholas Van W. [email protected] wrote:

I dimly recall something on this list about 9 months ago or so.

Nick

Nicholas Van W.

http://www.pressure.to/ruby/ is the reference I found in an old email
thread
on this list.

Nick

basi_lio · November 30, 2005, 4:41am

basi wrote:

Looking for ideas on how to split a text file into sentences. I see the
problem of basing the split on [.!?] – they’re also used in ways other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?

It’s a common convention to separate sentences by double spaces. I
started following this convention because Emacs expected it, and now I
use it always.

basi_lio · November 30, 2005, 5:33am

Hi,
I will google. Thanks!
basi

basi_lio · November 30, 2005, 5:29am

Hi,
I have looked at NLTK in Python (and had been hoping a Rubyist would
rewrite it in Ruby). I will go back to NLTK and see if it has a
split-sentence algorithm of sort. And thanks for the tip on Stanfords
Java NLP package. Yes, those abbreviations are pesky, and I may have to
resort to an exceptions list containing the most common ones.
Thanks much,
basi

basi_lio · November 30, 2005, 5:37am

Yes, I learned this convention when I took a keyboarding (i.e., typing)
lesson in high school. Sometime ago, a style manual for word processing
appeared, and one of the advice is to use only one space to separate
sentences. The reason given is that in a justified format, those two
spaces can become four spaces, or even more. Anyway, a lot of text now
has one or two spaces between sentences, and this wouldn’t be a
reliable indicator of sentence boundary.
Cheers!
basi

basi_lio · November 30, 2005, 5:45am

Hi,
This looks promising. I’m downloading as I write.
Thanks!
basi

basi_lio · November 30, 2005, 5:58am

On 11/29/05, basi [email protected] wrote:

Yes, I learned this convention when I took a keyboarding (i.e., typing)
lesson in high school. Sometime ago, a style manual for word processing
appeared, and one of the advice is to use only one space to separate
sentences. The reason given is that in a justified format, those two
spaces can become four spaces, or even more. Anyway, a lot of text now
has one or two spaces between sentences, and this wouldn’t be a
reliable indicator of sentence boundary.

I too learned the two space after a period convention years ago and
also recently learned that with modern fonts and word processors it is
not necessary. It was tricky to retrain myself, but I did, and have
been using just one space ever since.

So like you say, that isn’t a reliable way to discern sentences.

I would recommend following the advice of first filtering out false
positives (possibly even replacing them with temporary markers, Mr.
becomes $MISTER$ or similar), then splitting on punctuation. If you
then test on various sample texts you should be able to find more
false positives that you might have missed.

Ryan

basi_lio · November 30, 2005, 6:34am

Hi,
This just might be easier than what I have in mind. I will try this
first.
Thanks!
basi

basi_lio · November 30, 2005, 9:44am

Ryan L. wrote:

becomes $MISTER$ or similar), then splitting on punctuation. If you
then test on various sample texts you should be able to find more
false positives that you might have missed.
Which will not help you at all with foreign languages. And don’t forget
putting i.e., e.g. or etc. in the list.
This is an ongoing problem (think about the auto-correction ‘feature’ of
capitalizing the first letter of every sentence in Openoffice or Word -
something I always turn off because it is so insistent when it’s wrong)
Cheers,
V.-
–
http://www.braveworld.net/riva

basi_lio · November 30, 2005, 11:41am

basi_lio wrote:

Looking for ideas on how to split a text file into sentences. I see the
problem of basing the split on [.!?] – they’re also used in ways other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?
Thanks for the help.
basi

If you make a regexp: [.!?]\s+[A-Z] you will already capture most. Most
Abbreviations normally aren’t followed by a space/capital letter.

One change to this rule that I can think of is Mr. Name, Mrs. Name. But
as you can see these have a followed by only one or two
downcase letters. Most sentences would have at least five non uppercase
in front of the <.> ->
[A-Z]\w\w?\w?\w?.

basi_lio · November 30, 2005, 1:55pm

On 11/29/05, Kevin O. [email protected] wrote:

Depending on the text you might be able to search for a period (or other
punctuation) followed by two spaces. It’s not robust, but if you know that
convention will be followed by the authors, then it can work.

That, in fact, is a very bad metric to follow, as the proper spacing
after sentence punctuation is a single space. The only reason that two
spaces was used in the past is the space used between sentence endings
in typeset work is a little wider than that used between words (an
em-space vs. an en-space).

-austin

basi_lio · November 30, 2005, 2:36pm

On 11/29/05, Jeffrey S. [email protected] wrote:

basi wrote:

Looking for ideas on how to split a text file into sentences. I see the
problem of basing the split on [.!?] – they’re also used in ways other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?
It’s a common convention to separate sentences by double spaces. I
started following this convention because Emacs expected it, and now I
use it always.

As I noted above, this is an improper convention outside of the
typewriter realm. If you are using anything other than a fixed-pitch
font for display or print, you should never use two spaces.

-austin

basi_lio · November 30, 2005, 4:58pm

Austin Z. [email protected] writes:

As I noted above, this is an improper convention outside of the
typewriter realm. If you are using anything other than a fixed-pitch
font for display or print, you should never use two spaces.

Alternatively, use text processing systems that do the “right thing”;
i.e. transform two spaces into one (e.g. TeX, HTML-based products).
There is no good reason a text processor should show two spaces after
each other in print.

basi_lio · November 30, 2005, 1:55pm

On 11/29/05, basi [email protected] wrote:

I have looked at NLTK in Python (and had been hoping a Rubyist would
rewrite it in Ruby). I will go back to NLTK and see if it has a
split-sentence algorithm of sort. And thanks for the tip on Stanfords
Java NLP package. Yes, those abbreviations are pesky, and I may have to
resort to an exceptions list containing the most common ones.

Look at Text::Format for some indication on how abbreviations could be
handled.

-austin

basi_lio · November 30, 2005, 6:11pm

On Nov 30, 2005, at 10:22 AM, Jeffrey S. wrote:

two
spaces was used in the past is the space used between sentence
endings
in typeset work is a little wider than that used between words (an
em-space vs. an en-space).

Not true at all. I was always taught to use double spaces after
sentences in grade-school homework assignments done on plain word
processors or typewriters.

Many of us were and I’ll admit that I can’t shake the habit. I still
know it’s wrong though.

James Edward G. II

basi_lio · November 30, 2005, 6:40pm

On 11/30/05, Jeffrey S. [email protected] wrote:

Not true at all. I was always taught to use double spaces after
sentences in grade-school homework assignments done on plain word
processors or typewriters.

Then, quite honestly, you were taught wrong. I was taught to use
double spaces with a typewriter or when using fixed-pitch fonts
(although that was later, since most computers and printers didn’t
have reliable kerning routines until I was out of university).
Ultimately, the use of double spaces after a period is wrong even
with fixed-pitch fonts, but it was done to be clearer since the width
of the em-space and an en-space on a typewriter with a Courier-like
font is exactly the same. The two spaces simulates an em-space in a
typeset piece of work. (And that is fact, not opinion.)

-austin