Forum: Ruby Tokenizing text

Announcement (2017-05-07): is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see and for other Rails- und Ruby-related community platforms.
Juan A. (Guest)
on 2009-02-25 01:07

I need to tokenize English text into sentences. I realize this is a very
complex task to get right all of the time (if possible at all) but for
the time being I'm only trying to implement a better solution than

Bowsing around I found this snippet:

 string.scan( /\w.+?[.!?]+(?=\s|\Z)/ )

which almost works for what I need except for two cases: ellipses and at
least most common abbreviations. Abbreviations are the hardest part and
I've been tinkering with a couple possible solutions. How would you
approach this?

Thanks in advance
This topic is locked and can not be replied to.