Ruby1.9 Encoding

Hey, guys!

I’ve just started learning Ruby from Python and recently posted a
question here that was promptly and effectively answered (thanks,
Glenn!), so I decided to come here once more. I hope I’ll be able to
be answering some of the questions on my own soon. :wink:

Here it is.

I’m writing a wrapper for a Korean Morphological Parser that only
works with EUC-KR encoding and has some trouble with longer texts. So,
first, I have to preprocess the input text to divide it into
sentences, remove unicode characters which are not related to Korean
and save them for further reinclusion in the postprocessing stage.
This has worked out wonderfully and, I should say, easier than what
I’d done in Python.

The problem I’m having now is converting the string from UTF-8 (I’m
running Ubuntu with pt_BR.UTF-8 locale) into EUC-KR, run the
Morphological Parser, read its output and process it. I have to parse
a whole bunch of data “spidered” from the internet and it works great
until the encoder comes across a Korean typo… Let’s make this point
clearer: the EUC-KR encoding does not cover all the possible
combinations of initial and final consonants + vowels, as does
unicode. Explaining: Unicode has 21708 codepoints for hangul whereas
EUC-KR has only 11172. In fact, most of the extra chars are not used
in day by day life, but still they can be used as abbreviations,
slang, smileys or typos…

In Python, I would simply throw these chars away but I really didn’t
manage to understand the “Ruby encoding way”… I know I’m missing
something, but I can’t seem to find enough info around… Google
doesn’t seem to know much of this either… So, I’m coming here to ask
for your enlightenment, dear rubyist friends!

The part of my code which deals with this is as follows:

def run(txt)
txt = txt.encode(“EUC-KR”)
kts_file = Tempfile::new(‘kts_text’)
kts_file = open(kts_file.path, “w:EUC-KR”)
kts_file << “#{txt}\n”
kts_file.close
cmd = “ktspell < #{kts_file.path}” # 2> /dev/null"
IO::popen(cmd, “r:EUC-KR”).read.encode(“UTF-8”)
end

I found something about “ignoring” the non-existent codepoints, but it
doesn’t work… I’m even thinking that my Ruby installation might have
gotten corrupted somehow… Everytime I think I did it right, I still
get The Exception popping up on the screen…

Thanks for your patience reading this looong post.

Juliano

-------- Original-Nachricht --------

Datum: Thu, 10 Sep 2009 18:20:06 +0900
Von: “Juliano 준호” [email protected]
An: [email protected]
Betreff: Ruby1.9 Encoding

works with EUC-KR encoding and has some trouble with longer texts. So,
until the encoder comes across a Korean typo… Let’s make this point
doesn’t seem to know much of this either… So, I’m coming here to ask
cmd = “ktspell < #{kts_file.path}” # 2> /dev/null"
Juliano
Dear Juliano,

a disclaimer first: I know no Korean, so what’s below might not work.

I’ve had to do some coding to resolve Arabic ligatures (combinations
of two letters) recently. Similarly as what you describe, there is most
of the time no need to use a special combined form, and unluckily, the
same word is sometimes spelled in this and sometimes in that way, giving
a list of duplicate words.

I used a list of Unicode characters with names of the individual
characters
to solve that problem.

You might find the table below on this page useful :

http://www.kfunigraz.ac.at/~katzer/korean_hangul_unicode.html

I don’t know if that list is exhaustive, but you may try to individually
convert each of the syllables listed there from Unicode to EUC::KR, and
if that doesn’t work, decide what to do with the particular combination
of signs, based on the Latin transcription, creating a transform hash
for these encodings yourself.

There might also be some locale or OS-related problems with
Iconv::IGNORE .
There’s some discussion of this here :

Best regards,

Axel

Hi,

In message “Re: Ruby1.9 Encoding”
on Thu, 10 Sep 2009 18:20:06 +0900, Juliano ÁØÈ£ [email protected]
writes:

|The part of my code which deals with this is as follows:
|
| def run(txt)
| txt = txt.encode(“EUC-KR”)
| kts_file = Tempfile::new(‘kts_text’)
| kts_file = open(kts_file.path, “w:EUC-KR”)
| kts_file << “#{txt}\n”
| kts_file.close
| cmd = “ktspell < #{kts_file.path}” # 2> /dev/null"
| IO::popen(cmd, “r:EUC-KR”).read.encode(“UTF-8”)
| end
|
|I found something about “ignoring” the non-existent codepoints, but it
|doesn’t work… I’m even thinking that my Ruby installation might have
|gotten corrupted somehow… Everytime I think I did it right, I still
|get The Exception popping up on the screen…

I had some difficulty to see your intention from the code. Could you
show us the exception messages you’ve got?

          matz.

On Sep 10, 2009, at 4:20 AM, Juliano 준호 wrote:

Hey, guys!

I’ve just started learning Ruby from Python and recently posted a
question here that was promptly and effectively answered (thanks,
Glenn!), so I decided to come here once more. I hope I’ll be able to
be answering some of the questions on my own soon. :wink:

Welcome to Ruby.

slang, smileys or typos…

In Python, I would simply throw these chars away but I really didn’t
manage to understand the “Ruby encoding way”…

I think we can throw them away in Ruby too. See below.

I know I’m missing something, but I can’t seem to find enough info
around… Google
doesn’t seem to know much of this either…

I wrote a lot about Ruby’s encoding engine on my blog:

http://blog.grayproductions.net/articles/understanding_m17n

The part of my code which deals with this is as follows:

def run(txt)
txt = txt.encode(“EUC-KR”)

Try replacing the above line with:

txt = txt.encode(“EUC-KR”, invalid: :replace, undef: :replace,
replace: “”)

kts_file = Tempfile::new(‘kts_text’)
kts_file = open(kts_file.path, “w:EUC-KR”)
kts_file << “#{txt}\n”
kts_file.close
cmd = “ktspell < #{kts_file.path}” # 2> /dev/null"
IO::popen(cmd, “r:EUC-KR”).read.encode(“UTF-8”)
end

Hope that helps.

James Edward G. II

On Sep 10, 2009, at 09:39, Yukihiro M. wrote:

I had some difficulty to see your intention from the code. Could you
show us the exception messages you’ve got?

My interpretation of the code is:

def run(txt)
# translate the string into EUC-KR encoding
txt = txt.encode(“EUC-KR”)

# Create a temp file to store the data and
# write it to the file, using the EUC-KR encoding
kts_file = Tempfile::new('kts_text')
kts_file = open(kts_file.path, "w:EUC-KR")
kts_file << "#{txt}\n"
kts_file.close

# Run ktspell, feeding it the data from the file
cmd = "ktspell < #{kts_file.path}" # 2> /dev/null"

# Read the result and translate it into UTF-8
IO::popen(cmd, "r:EUC-KR").read.encode("UTF-8")

end

I don’t know much about 1.9’s encodings, but can suggest a more
rubyish way of writing the method:

def run(txt)
euc_txt = txt.encode(“EUC-KR”)

Tempfile::new actually returns a filehandle, so there’s no need

to re-open

the file based on the path. Since you seem to want to open the

file,

write to it, use the data, and then immediately do away with the

file,

the ‘open’ block form is probably more appropriate

kts_file = Tempfile::open(“kts_text”) do |kts_file|

 # The more common way to write a newline-terminated string to a

file is
# file.puts(foo) rather than file << “#{foo}\n”
kts_file.puts(euc_txt)
cmd = “ktspell < #{kts_file.path}”

 # You could do all the rest on one line as you did, but this lets

you
# look at the data first using ‘p processed_euc_txt’ or something
processed_euc_txt = IO::popen(cmd, “r:EUC-KR”).read

 # Again, a temporary variable to let you see if the data looks

right.
processed_utf_txt = processed_euc_txt.encode(“UTF-8”)
end

processed_utf_txt
end

Ben