Splitting a text file into sentences


#1

Looking for ideas on how to split a text file into sentences. I see the
problem of basing the split on [.!?] – they’re also used in ways other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?
Thanks for the help.
basi


#2

On Nov 29, 2005, at 23:49, basi wrote:

Doing really, really good sentence boundary detection is an on-going
problem in natural language processing. I’m not aware of any Ruby-
based NLP packages, but if you want better accuracy than just using
[.!?:] there are several free NLP packages around (NLTK in Python,
and Stanford’s Java NLP package spring to mind) that might help you.
A googling of “sentence tokenization” may also yield some help.

If that sounds like overkill, then you can get accuracy “good enough
for government work” by making a list of regular expressions to catch
exceptions to the punctuation rule. These will necessarily vary a
little depending on your source text, but a typical examples are
catching titles like “Mr.”, “Mrs.” “Dr.”, and all-caps abbreviations
like “U.S.A.” or “M.D.” (something like this: /([A-Z].([A-Z].)+/)

good luck,
matthew smillie.


Matthew S. removed_email_address@domain.invalid
Institute for Communicating and Collaborative Systems
University of Edinburgh


#3

Depending on the text you might be able to search for a period (or other
punctuation) followed by two spaces. It’s not robust, but if you know
that
convention will be followed by the authors, then it can work.

_Kevin


#4

On 11/29/05, basi removed_email_address@domain.invalid wrote:

I dimly recall something on this list about 9 months ago or so.

Nick


#5

On 11/29/05, Nicholas Van W. removed_email_address@domain.invalid wrote:

I dimly recall something on this list about 9 months ago or so.

Nick

Nicholas Van W.

http://www.pressure.to/ruby/ is the reference I found in an old email
thread
on this list.

Nick


#6

basi wrote:

Looking for ideas on how to split a text file into sentences. I see the
problem of basing the split on [.!?] – they’re also used in ways other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?

It’s a common convention to separate sentences by double spaces. I
started following this convention because Emacs expected it, and now I
use it always.


#7

Hi,
I will google. Thanks!
basi


#8

Hi,
I have looked at NLTK in Python (and had been hoping a Rubyist would
rewrite it in Ruby). I will go back to NLTK and see if it has a
split-sentence algorithm of sort. And thanks for the tip on Stanfords
Java NLP package. Yes, those abbreviations are pesky, and I may have to
resort to an exceptions list containing the most common ones.
Thanks much,
basi


#9

Yes, I learned this convention when I took a keyboarding (i.e., typing)
lesson in high school. Sometime ago, a style manual for word processing
appeared, and one of the advice is to use only one space to separate
sentences. The reason given is that in a justified format, those two
spaces can become four spaces, or even more. Anyway, a lot of text now
has one or two spaces between sentences, and this wouldn’t be a
reliable indicator of sentence boundary.
Cheers!
basi


#10

Hi,
This looks promising. I’m downloading as I write.
Thanks!
basi


#11

On 11/29/05, basi removed_email_address@domain.invalid wrote:

Yes, I learned this convention when I took a keyboarding (i.e., typing)
lesson in high school. Sometime ago, a style manual for word processing
appeared, and one of the advice is to use only one space to separate
sentences. The reason given is that in a justified format, those two
spaces can become four spaces, or even more. Anyway, a lot of text now
has one or two spaces between sentences, and this wouldn’t be a
reliable indicator of sentence boundary.

I too learned the two space after a period convention years ago and
also recently learned that with modern fonts and word processors it is
not necessary. It was tricky to retrain myself, but I did, and have
been using just one space ever since.

So like you say, that isn’t a reliable way to discern sentences.

I would recommend following the advice of first filtering out false
positives (possibly even replacing them with temporary markers, Mr.
becomes $MISTER$ or similar), then splitting on punctuation. If you
then test on various sample texts you should be able to find more
false positives that you might have missed.

Ryan


#12

Hi,
This just might be easier than what I have in mind. I will try this
first.
Thanks!
basi


#13

Ryan L. wrote:

becomes $MISTER$ or similar), then splitting on punctuation. If you
then test on various sample texts you should be able to find more
false positives that you might have missed.
Which will not help you at all with foreign languages. And don’t forget
putting i.e., e.g. or etc. in the list.
This is an ongoing problem (think about the auto-correction ‘feature’ of
capitalizing the first letter of every sentence in Openoffice or Word -
something I always turn off because it is so insistent when it’s wrong)
Cheers,
V.-

http://www.braveworld.net/riva


#14

basi_lio wrote:

Looking for ideas on how to split a text file into sentences. I see the
problem of basing the split on [.!?] – they’re also used in ways other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?
Thanks for the help.
basi

If you make a regexp: [.!?]\s+[A-Z] you will already capture most. Most
Abbreviations normally aren’t followed by a space/capital letter.

One change to this rule that I can think of is Mr. Name, Mrs. Name. But
as you can see these have a followed by only one or two
downcase letters. Most sentences would have at least five non uppercase
in front of the <.> ->
[A-Z]\w\w?\w?\w?.


#15

On 11/29/05, Kevin O. removed_email_address@domain.invalid wrote:

Depending on the text you might be able to search for a period (or other
punctuation) followed by two spaces. It’s not robust, but if you know that
convention will be followed by the authors, then it can work.

That, in fact, is a very bad metric to follow, as the proper spacing
after sentence punctuation is a single space. The only reason that two
spaces was used in the past is the space used between sentence endings
in typeset work is a little wider than that used between words (an
em-space vs. an en-space).

-austin


#16

On 11/29/05, Jeffrey S. removed_email_address@domain.invalid wrote:

basi wrote:

Looking for ideas on how to split a text file into sentences. I see the
problem of basing the split on [.!?] – they’re also used in ways other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?
It’s a common convention to separate sentences by double spaces. I
started following this convention because Emacs expected it, and now I
use it always.

As I noted above, this is an improper convention outside of the
typewriter realm. If you are using anything other than a fixed-pitch
font for display or print, you should never use two spaces.

-austin


#17

Austin Z. removed_email_address@domain.invalid writes:

As I noted above, this is an improper convention outside of the
typewriter realm. If you are using anything other than a fixed-pitch
font for display or print, you should never use two spaces.

Alternatively, use text processing systems that do the “right thing”;
i.e. transform two spaces into one (e.g. TeX, HTML-based products).
There is no good reason a text processor should show two spaces after
each other in print.


#18

On 11/29/05, basi removed_email_address@domain.invalid wrote:

I have looked at NLTK in Python (and had been hoping a Rubyist would
rewrite it in Ruby). I will go back to NLTK and see if it has a
split-sentence algorithm of sort. And thanks for the tip on Stanfords
Java NLP package. Yes, those abbreviations are pesky, and I may have to
resort to an exceptions list containing the most common ones.

Look at Text::Format for some indication on how abbreviations could be
handled.

-austin


#19

On Nov 30, 2005, at 10:22 AM, Jeffrey S. wrote:

two
spaces was used in the past is the space used between sentence
endings
in typeset work is a little wider than that used between words (an
em-space vs. an en-space).

Not true at all. I was always taught to use double spaces after
sentences in grade-school homework assignments done on plain word
processors or typewriters.

Many of us were and I’ll admit that I can’t shake the habit. I still
know it’s wrong though. :wink:

James Edward G. II


#20

On 11/30/05, Jeffrey S. removed_email_address@domain.invalid wrote:

Not true at all. I was always taught to use double spaces after
sentences in grade-school homework assignments done on plain word
processors or typewriters.

Then, quite honestly, you were taught wrong. I was taught to use
double spaces with a typewriter or when using fixed-pitch fonts
(although that was later, since most computers and printers didn’t
have reliable kerning routines until I was out of university).
Ultimately, the use of double spaces after a period is wrong even
with fixed-pitch fonts
, but it was done to be clearer since the width
of the em-space and an en-space on a typewriter with a Courier-like
font is exactly the same. The two spaces simulates an em-space in a
typeset piece of work. (And that is fact, not opinion.)

-austin