On Jul 30, 2007, at 3:50 AM, mike b. wrote:
I have to parse about 2000 files that are written in multiple
languages (some English, some Korean, some Arabic and some Japanese).
I have to split these UTF-8 encoded into individual sentences.
As has been stated, Ruby’s regular expression engine has a Unicode
mode and that may be all you need here, depending on how you
recognize sentence boundaries.
Has anyone written a good parser that can parse all these non-Latin
character languages or can someone give me some advice on how to go
about writing a parser that can handle all these fairly different
I’ve released an initial version of my Ghost Wheel parser generator
library. It doesn’t have documentation yet, but it was built using
TDD and you should be able to look over the tests to see how it
works. I’m also happy to answer questions.
My hope is that it works fine for non-Latin languages, but I’ll
confess that I haven’t tested it that way yet. I would try to fix
any issues you uncovered though.
James Edward G. II