Forum: Ruby Tokenizing text

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
B404a438e7106c61e31fa6ebcc089a5f?d=identicon&s=25 Juan Alvarez (soltikoff)
on 2009-02-25 00:07
Hello,

I need to tokenize English text into sentences. I realize this is a very
complex task to get right all of the time (if possible at all) but for
the time being I'm only trying to implement a better solution than
strintg.split('.').

Bowsing around I found this snippet:

 string.scan( /\w.+?[.!?]+(?=\s|\Z)/ )

which almost works for what I need except for two cases: ellipses and at
least most common abbreviations. Abbreviations are the hardest part and
I've been tinkering with a couple possible solutions. How would you
approach this?

Thanks in advance
Juan
This topic is locked and can not be replied to.