Hello,
I need to tokenize English text into sentences. I realize this is a very
complex task to get right all of the time (if possible at all) but for
the time being I’m only trying to implement a better solution than
strintg.split(’.’).
Bowsing around I found this snippet:
string.scan( /\w.+?[.!?]+(?=\s|\Z)/ )
which almost works for what I need except for two cases: ellipses and at
least most common abbreviations. Abbreviations are the hardest part and
I’ve been tinkering with a couple possible solutions. How would you
approach this?
Thanks in advance
Juan