Ruby iconv UTF-8 to ISO-8859-2 (Polish)

Tobes · January 3, 2007, 10:37am

Hi there,

I’ve been using Iconv with Ruby (OnRails) quite successfully for some
time. We keep text as UTF-8 in our mysql database, and then convert it
using Iconv to ISO-8859-1 when rendering PDF files in English, Danish,
French or Dutch. Conversions work well, and RPDF outputs the converted
text nicely.

However, we now want to output PDFs in ISO-8859-2. For some reason the
output is garbled, see here:

http://www.tobinharris.com/media/mtq38.jpg

I just don’t get it, can anyone see what might be happening? Is there a
known issue with converting UTF-8 to 8859-2?

Many thanks

Tobin

Tobes · January 3, 2007, 11:37am

On 1/3/07, Tobes [email protected] wrote:

Hi there,

I’ve been using Iconv with Ruby (OnRails) quite successfully for some
time. We keep text as UTF-8 in our mysql database, and then convert it
using Iconv to ISO-8859-1 when rendering PDF files in English, Danish,
French or Dutch. Conversions work well, and RPDF outputs the converted
text nicely.

First of all, I am curious. Why are you converting? If the text is
already
in UTF-8 why is there a need to down-convert? I am not familiar enough
with the internals of PDF, but are there issues with using UTF-8
specifically
with PDF?

However, we now want to output PDFs in ISO-8859-2. For some reason the
output is garbled, see here:

http://www.tobinharris.com/media/mtq38.jpg

I just don’t get it, can anyone see what might be happening? Is there a
known issue with converting UTF-8 to 8859-2?

It looks like that PDF is 8859-1. It looks like it ignored your
request to translate
to Latin-2, or your PDF viewer is assuming that it is latin-1. The
Polish characters are mapped to various western european ones instead.
Double check that the characters
in your database were entered as UTF-8 (the Polish-specific characters
will be 2
bytes not 1).

Tobes · January 3, 2007, 11:55am

Tobes wrote:

http://www.tobinharris.com/media/mtq38.jpg

I just don’t get it, can anyone see what might be happening? Is there a
known issue with converting UTF-8 to 8859-2?

No, the problem is that PDF::Writer thinks that it is encoded in
“WinAnsiEncoding”. See the manual[1], page 5.

Try to pass to the writer a text encoded in UTF16-BE (manual, page 7),
or provide a custom mapping between byte codes and characters (page 6).

Good luck.

1: http://ruby-pdf.rubyforge.org/pdf-writer/manual/manual.pdf

Tobes · January 6, 2007, 7:30pm

Just to add a little more to the situation, I discovered that the
client is first entering their Polish language text in Microsoft
Notepad, and saving as Unicode.

Example document here:

http://www.tobinharris.com/media/mtq48.report.polish.1.txt

They are then cutting and pasting this into a web form. The web form is
UTF-8 (this is set in HTML header, and we’re using Lighttpd server).

The web form gets saved by RubyOnRails to the MySQL database, which is
set to store text in UTF-8.

Both the web site and reports use the UTF-8 data entered through the
web form. The web site displays it fine, it’s just that when I come to
pass it into Ruby FPDF, problems start (see
http://www.tobinharris.com/media/mtq38.jpg).

Also, when I view the data in the MySQL query browser, it looks jumbled
(see http://www.tobinharris.com/media/mtq38_db.jpg). Another site using
UTF8 looks fine in the query analyzer (Japanese, Chinese, Swedish etc).

Any thoughts welcome

Thanks

Tobin

Tobes · January 4, 2007, 11:12am

Hi Richard,

Firsly, thanks for the reply!

First of all, I am curious. Why are you converting? If the text is already
in UTF-8 why is there a need to down-convert? I am not familiar enough
with the internals of PDF, but are there issues with using UTF-8 specifically
with PDF?

I’m using RPDF and it has never coped with UTF-8 text, therefore I
convert to Latin.

It looks like that PDF is 8859-1. It looks like it ignored your
request to translate
to Latin-2, or your PDF viewer is assuming that it is latin-1. The
Polish characters are mapped to various western european ones instead.
Double check that the characters
in your database were entered as UTF-8 (the Polish-specific characters will be 2
bytes not 1).

I have a feeling it might be in the database now, as I observed that
the text looks strange even in the MySQL query tool. However, if I
manually enter some text through the tool, it appears ok. See the
bottom entry entitled ‘test’ here
http://www.tobinharris.com/media/mtq38_db.jpg. The test is ok, but all
the other stuff entered by the client through the web site is not.

If characters in the DB are not UTF-8, is there any way to convert them
or is the information lost forever?

Many thanks

Tobin

Tobes · January 6, 2007, 7:30pm

Carlos wrote:

No, the problem is that PDF::Writer thinks that it is encoded in
“WinAnsiEncoding”. See the manual[1], page 5.

Try to pass to the writer a text encoded in UTF16-BE (manual, page 7),
or provide a custom mapping between byte codes and characters (page 6).

Thanks for the links and the advice Carlos.

I’m actually using Ruby FPDF (http://zeropluszero.com/software/fpdf/),
and couldn’t see a dependency on PDF::Writer. However, using iconv to
convert to UTF-16 gives a different result
http://www.tobinharris.com/media/mtq38_utf16.jpg.

Do you know of any tools that will let me reliably inspect the data in
the database to see what encoding the information is being stored in.
MySQL was setup to store UTF-8, and since the text data is sent from a
UTF-8 formatted web page, I assumed this would be the case. However,
I’m thinking that it wasn’t UTF-8 at all, and so need to know what the
original encoding is?

I’m also definately lacking some knowledge in this area, so any
pointers to resources/tools would be appreciated.

Many thanks

Tobin

Tobes · January 19, 2007, 4:29pm

[Tobes [email protected], 2007-01-04 16.55 CET]

UTF-8 formatted web page, I assumed this would be the case. However,
I’m thinking that it wasn’t UTF-8 at all, and so need to know what the
original encoding is?

I’m also definately lacking some knowledge in this area, so any
pointers to resources/tools would be appreciated.

Hi. I assumed you were using “railspdfplugin”
http://rubyforge.org/projects/railspdfplugin/

which is the first Google result for RPDF, and depends on PDF::Writer.

I can’t access the Ruby FPDF page right now (“502 Bad Gateway” error
message), but if it is based on PHP’s FPDF, then you just have to follow
the
steps here:
Adding new fonts and encodings

(extrapolated to Ruby’s FPDF, of course).

WRT the screenshot of your other message, there are two possibilities:

that application, the MySQL query tool, is not UTF-8 aware. So, it
interprets the 2 bytes of “Å‚” (197, 130) as 2 characters in some
simple-byte
encoding (probably latin-1), which gives “Ã…” and an unprintable
character.
Your test line wasn’t UTF-8 encoded at all.
The application is UTF-8 aware, the test line is in UTF-8, but the
data
from your web pages was already in UTF-8 and you thought it wasn’t and
encoded it again to UTF-8.

To test if a string is encoded in UTF-8, just examine its bytes
p str.unpack(“C*”)

and see if the diacritic letters are encoded with 2 or more bytes
(UTF-8),
or only one (iso-8859-, cp, etc.). (If you see four then you encoded
them twice :).

HTH. Good luck.

Tobes · January 19, 2007, 4:30pm

[Tobes [email protected], 2007-01-04 21.50 CET]
[…]

So, I can see that the character “Å›” must correspond to the 3rd and
347 is in fact a “Å›”.

This would suggest that the database has UTF-8 text, and it’s getting
into Ruby without corruption! Is this right?

I think yes.

So, the question now is why doesn’t Iconv convert my UTF-8 to Latin2
correctly… That could just be because the original text can’t be
converted due to additional characters outside of the Latin2 set.

I still think the problem is not in the iconv library, but in the FPDF
one.
Think about it this way: “Â³” in UTF-8 ([179], byte sequence [194, 179])
translates to [179] in latin1. “Å‚” in UTF-8 ([322], byte sequence [197,
130]) translates to [179] in latin2. Now you provide a string with the
byte
179 to the PDF library. How should it render that byte? As a “Â³” or as a
“Å‚”? You must tell it how you want it by telling it which encoding you
are
using. If you don’t tell it, it assumes you are using latin1; this is
why
you see a lot of “Â³” instead of “Å‚” in the PDF output at
http://www.tobinharris.com/media/mtq38.jpg .

The way to tell FPDF to use another encoding (for the PHP version, I
suppose
the Ruby one would be similar) is in the link I put in my previous
message.

Good luck.

Tobes · January 19, 2007, 4:29pm

Hi Carlos,

Thanks v much for the advice. Thought I’d start with looking at what’s
already in the database using unpack.

that application, the MySQL query tool, is not UTF-8 aware. So, it
interprets the 2 bytes of “Å‚” (197, 130) as 2 characters in some simple-byte
encoding (probably latin-1), which gives “Ã…” and an unprintable character.
Your test line wasn’t UTF-8 encoded at all.

Yeah, for another db on the server it works fine, so I’m guessing it’s
your 2nd option. Your explanation of the 2 bytes solves another
question I had though

The application is UTF-8 aware, the test line is in UTF-8, but the data
from your web pages was already in UTF-8 and you thought it wasn’t and
encoded it again to UTF-8.

To test if a string is encoded in UTF-8, just examine its bytes
p str.unpack(“C*”)

and see if the diacritic letters are encoded with 2 or more bytes (UTF-8),
or only one (iso-8859-, cp, etc.). (If you see four then you encoded
them twice :).

Here’s a test case

On web page after being loaded from DB: “WyÅ›lij” [This is correct!]
In MySQL Analyser: “WyÃ…â€ºlij” [bad, even though MySQL analyser is
UTF-8]
In Interactive Ruby (IRB) printed to console, after loading from DB:
“Wyâ”¼Ã¸lij” [expected in a DOS prompt!]
In IRB unpacked, after loading from DB: [87, 121, 197, 155, 108, 105,
106]

So, I can see that the character “Å›” must correspond to the 3rd and
4th bytes of “WyÅ›lij”.

Looking at the Ruby help, I see I can do this

p str.unpack(“U*”) to get the UTF-8 characters, which gives:

[87, 121, 347, 108, 105, 106]

According to this,
Unicode Character 'LATIN SMALL LETTER S WITH ACUTE' (U+015B), character
347 is in fact a “Å›”.

This would suggest that the database has UTF-8 text, and it’s getting
into Ruby without corruption! Is this right?

So, the question now is why doesn’t Iconv convert my UTF-8 to Latin2
correctly… That could just be because the original text can’t be
converted due to additional characters outside of the Latin2 set.

I could probably give Iconv explicit mapping codes for how to handle
certain characters, that may do the trick… I’ll re-read your post and
see if I can find anything else.

Thanks for the help, feels like I’m a few steps forward now!

If you can spot any errors in the above a hint would be most welcome!

Tobin

Tobes · September 25, 2007, 10:30pm

Carlos wrote:

The way to tell FPDF to use another encoding (for the PHP version, I suppose
the Ruby one would be similar) is in the link I put in my previous message.

Hi Carlos,

You’re right, the problem was not in Iconv. I guess that is backed up
by the fact that Iconv didn’t barf during the translation, the PDF
output just looked wrong.

I’ve read those instructions, compiled a few fonts with Latin2
encoding, and now I’m getting pretty good looking PDFs! Thank you so
much for your help on this one, solving the prob felt like peeling an
onion, but it’s cool that I’m now able to check things myself.

Many many thanks.

Tobin