Unicode illegal characters problem

Axel_E · November 3, 2007, 4:39pm

Dear all,

when using Iconv, I am repeatedly running into
problems.
I tried to run this bit of code:

#!/usr/bin/env ruby
$KCODE = ‘u’
require ‘iconv’

s =
‘caffè’
ic_ignore = Iconv.new(‘US-ASCII//IGNORE’, ‘UTF-8’)
puts ic_ignore.iconv(s) # => caff

ic_translit = Iconv.new(‘US-ASCII//TRANSLIT’, ‘UTF-8’)
puts ic_translit.iconv(s) # => caff`e

(from here:
Iconv and incompatible encodings - Ruby - Ruby-Forum),
but instead of the promised result in the comments above,
I am getting:

corr_ebook.rb:29:in `iconv’: “\351” (Iconv::InvalidCharacter)
from corr_ebook.rb:29

Why ?
I am using ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux] (OpenSuse
10.2)

Thank you very much!

Best regards,

Axel

Axel_E · November 3, 2007, 4:56pm

On Sat, 3 Nov 2007 10:38:22 -0500
[email protected] wrote:

s = ‘caffÃƒÂ¨’
I am getting:

corr_ebook.rb:29:in `iconv’: “\351” (Iconv::InvalidCharacter)
from corr_ebook.rb:29

Why ?
I am using ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux] (OpenSuse 10.2)

Thank you very much!

Your data is not UTF-8, from the error, most likely iso-8859-1 (or 15)

man iso_8859-1 shows octal 351 as expected.

351 233 E9 Ã© LATIN SMALL LETTER E WITH ACUTE

-jh

Axel_E · November 3, 2007, 5:46pm

On Sat, 3 Nov 2007 11:20:33 -0500
[email protected] wrote:

Dear all,

Your data is not UTF-8, from the error, most likely iso-8859-1 (or 15)
of the file I read the text in from,
“caff?” instead of “caff`e” as promised.

I believe that’s a “feature” of ruby iconv.

$ echo cafÃ© | iconv -f UTF-8 -t ASCII//TRANSLIT
cafe

while

s=“cafÃ©”
ic_translit = Iconv.new(‘ASCII//TRANSLIT’, ‘UTF-8’)
puts ic_translit.iconv(s)
=> caf?

-jonathan

Axel_E · November 3, 2007, 5:22pm

-------- Original-Nachricht --------

Datum: Sun, 4 Nov 2007 00:55:04 +0900
Von: Jonathan H. [email protected]
An: [email protected]
Betreff: Re: Unicode illegal characters problem

$KCODE = ‘u’
(from here:

Thank you very much!

Your data is not UTF-8, from the error, most likely iso-8859-1 (or 15)

man iso_8859-1 shows octal 351 as expected.

351 233 E9 é LATIN SMALL LETTER E WITH ACUTE

-jh

Dear Jonathan,

thanks for the hint. You are right. I corrected the encoding
of the file I read the text in from,

$KCODE = ‘u’
require ‘iconv’
s=IO.readlines(“/home/axel/text.txt”).to_s
p s # =>
‘caffè’
ic_translit = Iconv.new(‘US-ASCII//TRANSLIT’, ‘UTF-8’)
puts ic_translit.iconv(s) # => caff`e

However, now I still get
“caff?” instead of “caff`e” as promised.

I have several novel-length texts to convert with many
different accents.

Thanks for helping me again!

Best regards

Axel

Axel_E · November 3, 2007, 6:11pm

Dear Jonathan,

I believe that’s a “feature” of ruby iconv.

thanks for your clarifications!

Best regards,

Axel

Axel_E · November 3, 2007, 6:46pm

On Sat, 3 Nov 2007 12:06:21 -0500
[email protected] wrote:

Dear Jonathan,

I believe that’s a “feature” of ruby iconv.

thanks for your clarifications!

Further, its a feature of iconv on Linux. On my FreeBSD box I
get the expected results, both from iconv in a shell and ruby => caf’e.

As, on Linux, the iconv application produces better results from ruby’s
iconv, I tend to pipe data through iconv; at least I get a semblance
of usability that way.

-jonathan

Axel_E · November 3, 2007, 11:33pm

Axel E. wrote:

Jonathan H. wrote:

I believe that’s a “feature” of ruby iconv.

$ echo café | iconv -f UTF-8 -t ASCII//TRANSLIT
cafe

while

s=“café”
ic_translit = Iconv.new(‘ASCII//TRANSLIT’, ‘UTF-8’)
puts ic_translit.iconv(s)
=> caf?

Dear Jonathan,

I believe that’s a “feature” of ruby iconv.

thanks for your clarifications!

How does that clarify things for you? I read the other thread, and that
doesn’t clarify anything for me. Are you simply interpreting Jonathan
Hudson’s statement to mean the other thread is wrong?

Also, I don’t think it is very helpful to include every possible unicode
statement you can think of in an attempt solve unicode problems. For
instance, this line:

$KCODE = ‘u’

Why are you including that line in your program? According to Ruby
Way(2nd), p. 141,

"…$KCODE…determines the behavior of many core methods that
manipulate strings. "

However, in the code you posted, as far as I can tell, you aren’t
calling any methods where the $KCODE changes the way they work. Do you
just include that line anytime you are dealing with unicode, or did you
include it for some specific reason?

Thanks.

Axel_E · November 3, 2007, 11:54pm

Axel E. wrote:

-------- Original-Nachricht --------

Datum: Sun, 4 Nov 2007 00:55:04 +0900
Von: Jonathan H. [email protected]
An: [email protected]
Betreff: Re: Unicode illegal characters problem

$KCODE = ‘u’
(from here:

Thank you very much!

Your data is not UTF-8, from the error, most likely iso-8859-1 (or 15)

man iso_8859-1 shows octal 351 as expected.

351 233 E9 é LATIN SMALL LETTER E WITH ACUTE

-jh

Dear Jonathan,

thanks for the hint. You are right. I corrected the encoding
of the file I read the text in from,

$KCODE = ‘u’
require ‘iconv’
s=IO.readlines(“/home/axel/text.txt”).to_s
p s # =>
‘caffè’
ic_translit = Iconv.new(‘US-ASCII//TRANSLIT’, ‘UTF-8’)
puts ic_translit.iconv(s) # => caff`e

However, now I still get
“caff?” instead of “caff`e” as promised.

Another data point:

require ‘iconv’

s = “caf_x_c3_x_a9”
#The last char is the utf-8 encoding in hex format for ‘e’ with acute
#I added the underscores so that the encoding won’t be rendered
#into the actual character

puts s

#I see cafe where the ‘e’ is an ‘e’ with acute, which means my
#display device understands utf-8.

ic_translit = Iconv.new(‘US-ASCII//TRANSLIT’, ‘UTF-8’)
puts ic_translit.iconv(s)

#I see: caf’e

Axel_E · November 4, 2007, 12:05am

Axel E. wrote:

$KCODE = ‘u’
(from here:

Thank you very much!

Your data is not UTF-8, from the error, most likely iso-8859-1 (or 15)

man iso_8859-1 shows octal 351 as expected.

351 233 E9 é LATIN SMALL LETTER E WITH ACUTE

-jh

Dear Jonathan,

thanks for the hint. You are right. I corrected the encoding
of the file I read the text in from,

$KCODE = ‘u’
require ‘iconv’
s=IO.readlines("/home/axel/text.txt").to_s
p s # =>
‘caffè’
ic_translit = Iconv.new(‘US-ASCII//TRANSLIT’, ‘UTF-8’)
puts ic_translit.iconv(s) # => caff`e

However, now I still get
“caff?” instead of “caff`e” as promised.

Try running this code:

require ‘iconv’

s = “caf_x_c3_x_a9” #remove underscores
p s
#I see: caf_303_251 (without the underscores)
#_303_251 (without the underscores) is the utf-8
#encoding in octal format. I really hate that ruby
#displays octal format instead of hex format!

ic_translit = Iconv.new(‘US-ASCII//TRANSLIT’, ‘UTF-8’)
new_s = ic_translit.iconv(s) # => caff`e

p new_s #I see: caf’e

Axel_E · November 3, 2007, 10:02pm

[Jonathan H. [email protected], 2007-11-03 18.45 CET]

Further, its a feature of iconv on Linux. On my FreeBSD box I
get the expected results, both from iconv in a shell and ruby => caf’e.

As, on Linux, the iconv application produces better results from ruby’s
iconv, I tend to pipe data through iconv; at least I get a semblance
of usability that way.

It’s not iconv, it’s your locale data (which iconv uses). In german,
“ü” isprobably transliterated to ASCII as “ue”. In spanish, as “u”. There
isn’t a
single way to do it, and they are encoded in the system locale files.

Now, why ruby’s iconv gives a different result than the program iconv…
I
don’t know. Maybe ruby hides some LC_* environment variables from the
library (wild -and probably incorrect- guess)…

Summing up: don’t use iconv to transliterate to ASCII; build your own
table
instead. (It’s easy: the description of all latin letters with
diacritics
follow the same pattern.)

Good luck.

Axel_E · November 4, 2007, 12:58am

Dear 7stud,

thanks for the effort that you are putting into
solving this problem.
When I thanked about the clarifications Jonathan
gave, I meant that I believe the solution I hoped
to get from the thread I got that code from in
the first place isn’t going to work for me as
easily as thought.
I do indeed get a different behaviour for system
iconv and Ruby iconv, as Jonathan said.
With respect to the code you sent me, I
get:

require ‘iconv’

s = “caf\xc3\xa9” #(having removed underscores)
p s # => caf\303\251"
ic_translit = Iconv.new(‘US-ASCII//TRANSLIT’, ‘UTF-8’)
new_s = ic_translit.iconv(s) # => caff`e
p new_s #=> caf?

What system are you on ?
Mine is Linux OpenSuse 10.2, 64bit, Ruby 1.8.6.
If I had your behaviour on my system, this transliteration
would provide a nice conversion of the accents to Latex,
but now, I think I’ll do maybe two dozen gsub lines …
unless there already is some script that does
a Unicode name to Latex accent conversion, sth. like

small latin letter with acute => '{} ?

Best regards,

Axel

-------- Original-Nachricht --------

Datum: Sun, 4 Nov 2007 08:05:49 +0900
Von: 7stud – [email protected]
An: [email protected]
Betreff: Re: Unicode illegal characters problem

Axel_E · November 4, 2007, 11:09am

-------- Original-Nachricht --------

Datum: Sun, 4 Nov 2007 17:06:38 +0900
Von: 7stud – [email protected]
An: [email protected]
Betreff: Re: Unicode illegal characters problem

First of all, even posting messages about unicode is hard to do because
you have no idea what the other person is seeing. For instance, when
you say 'I see this output:

café

Dear 7stud,

you have no idea how my display device is displaying those characters,
and I have no idea how your display device is displaying those
characters. Does your display device not understand the encoding so
there is a question mark at the end: caf?, and my display device does
understand the encoding, so I see an ‘e’ with acute’? Or, do you see an
‘e’ with acute, but I see a question mark? You just can’t be sure what
the other person is seeing. I used underscores intermingled with my
character encodings to prevent any display device from rendering them.
That way anyone reading the code will know exactly what’s there.

As a result, to be clear what’s going on, you don’t want to be posting:

s = “caf\xc3\xa9” #(having removed underscores)

Well thanks for pointing that out, but that’s
not a problem here and cannot be, as I just posted some code snippet to
tell what I was doing to get that result – as it originally came from
you, you can’t possibly misunderstand it, can you ?

You want to leave the underscores in when posting about character
encodings. Of course, when you run the code, you need to remove the
underscores.

I am intelligent enough to understand this – the comment was just
to say that I indeed did remove those underscores.

What system are you on ?
Mine is Linux OpenSuse 10.2, 64bit, Ruby 1.8.6.
If I had your behaviour on my system, this transliteration
would provide a nice conversion of the accents to Latex,

mac os x 10.4.7, pre-installed ruby 1.8.2

So there seems to be some different behaviour of iconv (Ruby
and Linux/Unix) for different OS, independently
from actual or possible rendering issues, which is what Jonathan,
Carlos and I found in our previous discussion.

Nice to know that, nevertheless.

I’d like to say thanks to all of you for your posts.

Best regards,

Axel

Axel_E · November 4, 2007, 4:44pm

Axel E. wrote:

Well thanks for pointing that out, but that’s
not a problem here and cannot be, as I just posted some code snippet to
tell what I was doing to get that result – as it originally came from
you, you can’t possibly misunderstand it, can you ?

Who says I’m viewing your current post with the same display device
that I used to write my previous post? People these days own pc’s,
laptops, cell phones, etc. – all of which can be used to browse the
internet, and all of which may understand different encodings.
Who says the encoding that was set on the display device that I used
it to send my earlier post hasn’t been set to another encoding in the
meantime? All it takes is a simple click on View>Text Encoding>some
other encoding.

Axel_E · November 5, 2007, 10:46am

-------- Original-Nachricht --------

Datum: Mon, 5 Nov 2007 00:44:54 +0900
Von: 7stud – [email protected]
An: [email protected]
Betreff: Re: Unicode illegal characters problem

Axel E. wrote:

Well thanks for pointing that out, but that’s
not a problem here and cannot be, as I just posted some code snippet to
tell what I was doing to get that result – as it originally came from
you, you can’t possibly misunderstand it, can you ?

Dear 7stud,

Who says I’m viewing your current post with the same display device
that I used to write my previous post? People these days own pc’s,
laptops, cell phones, etc. – all of which can be used to browse the
internet, and all of which may understand different encodings.

nobody.

Who says the encoding that was set on the display device that I used
it to send my earlier post hasn’t been set to another encoding in the
meantime? All it takes is a simple click on View>Text Encoding>some
other encoding.

nobody. Yet if you do these things, you put yourself into the danger
of not being perceived as particularly helpful, as the quality of any
advice on this list is, if in doubt, to be measured against whether
it leads to working code on the original poster’s machine, not whether
one might deliberately be able to create misunderstandings.

Best regards,

Axel

Axel_E · November 5, 2007, 10:38pm

Axel E. wrote:

Who says the encoding that was set on the display device that I used
it to send my earlier post hasn’t been set to another encoding in the
meantime? All it takes is a simple click on View>Text Encoding>some
other encoding.

nobody. Yet if you do these things, you put yourself into the danger
of not being perceived as particularly helpful, as the quality of any
advice on this list is, if in doubt, to be measured against whether
it leads to working code on the original poster’s machine,

Unfortunately, you don’t get it. I have no idea what the encoding is on
your machine. You have no idea what the encoding is on anyone’s
machine that responds to your post–and most likely they’re all
different. In order to discuss unicode problems without creating
confusion that can often produce conflicting advice, you can’t just post
a bunch of characters which may or may not get rendered for other people
the same way you see them.

I recommended that when you ask unicode questions that you put
underscores in the characters in the code you post. That way no machine
can possibly render them into the character they represent. Then
everyone who reads your post can know exactly what characters you are
dealing with. I also recommended that when you post output that you
describe the output you see, rather than just posting the output–that
way everyone will know what you see. If you don’t care to do that,
that is your choice. Most people won’t even respond to unicode
questions. If you follow my suggestions, I think it will make it easier
for the few people who do.

Personally, I don’t keep track of the current settings for the encodings
on the various machines I use: work pc, home pc, multiple laptops, cell
phones. I certainly don’t synchronize them. And if someone changes the
encoding on one of those machines, or I change it and forget to change
it back, I won’t realize it. Typically, I’ll read a post and if it’s
totally confusing, presumably because what I see is different than what
the op is describing, I move on—which I’m sure at this point is
something you wish I would do. So I will.

Axel_E · November 4, 2007, 9:06am

Axel E. wrote:

With respect to the code you sent me, I
get:

require ‘iconv’

s = “caf\xc3\xa9” #(having removed underscores)

First of all, even posting messages about unicode is hard to do because
you have no idea what the other person is seeing. For instance, when
you say 'I see this output:

cafÃ©

you have no idea how my display device is displaying those characters,
and I have no idea how your display device is displaying those
characters. Does your display device not understand the encoding so
there is a question mark at the end: caf?, and my display device does
understand the encoding, so I see an ‘e’ with acute’? Or, do you see an
‘e’ with acute, but I see a question mark? You just can’t be sure what
the other person is seeing. I used underscores intermingled with my
character encodings to prevent any display device from rendering them.
That way anyone reading the code will know exactly what’s there.

As a result, to be clear what’s going on, you don’t want to be posting:

s = “caf\xc3\xa9” #(having removed underscores)

You want to leave the underscores in when posting about character
encodings. Of course, when you run the code, you need to remove the
underscores.

What system are you on ?
Mine is Linux OpenSuse 10.2, 64bit, Ruby 1.8.6.
If I had your behaviour on my system, this transliteration
would provide a nice conversion of the accents to Latex,

mac os x 10.4.7, pre-installed ruby 1.8.2