All this talk about Unicode support and HTML parsing got me to
wondering about how to parse Japanese text. There are no spaces to
separate words, and though there are some modifiers, or particles in
the Japanese language they are used sometime inconsistently. I could
quote examples, but of you can’t read Kanji, Hiragana, and Katakana
they would most likely be meaningless.
So, knowing what little I do of Japanese (been studying for a while
and living in Japan for close to four years), I was wondering how
search engines like Google and Yahoo parse Japanese text, much less
web pages. There are numerous filters to extract text from web
pages, but parsing Japanese text is another matter altogether.
So, I have found one Open Source project which seems to be addressing
this, but I was wondering if there is a solution for Ruby?
Now for the trivia… I’ve been reading some Japanese text,
“Hiragana Times” - a magazine which prints their articles in Japanese
and English as a learning tool and my newspaper “The Japan Times”
which has a weekly section devoted to bilingual education, as well as
my class textbooks. I’ve also read some Manga as well. They
generally present the Kanji with tiny Hiragana characters either
above them which are the phonetic equivalent to the Kanji.
Guess what these tiny Hiragana helpers are called… you guessed it
“Ruby Annotation”. Check out what I found on W3C, either click on
the link or: http://www.w3.org/TR/ruby/
“Any sufficiently advanced technology is indistinguishable from
- A. C. Clarke