PDF::Writer and Unicode

fxn · February 16, 2007, 12:07pm

According to the current manual PDF documents generated by
PDF::Writer can use UTF-16BE, but after a few trials with iconv I
can’t get my UTF-8 strings right. Example:

$KCODE = ‘u’

require ‘rubygems’
require ‘pdf/writer’
require ‘iconv’

str = Iconv.iconv(‘UTF-16BE’, ‘UTF-8’, ‘á ß €’)
pdf = PDF::Writer.new

renders á and ß right, but not €

pdf.text str

same output with garbage prepended

pdf.text “\xfe\xff#{str}”
pdf.save_as(‘unicode_test.pdf’)

The manual does not document if any encoding is needed for
select_font, I’ve played around with variations of

gives complete garbage

pdf.select_font ‘Times-Roman’, :encoding => ‘UTF-16BE’

without luck.

TextMate is generating UTF-8 source files for sure. Any ideas?

– fxn

fxn · February 16, 2007, 12:59pm

Xavier N. wrote:

The manual does not document if any encoding is needed for select_font,
I’ve played around with variations of

gives complete garbage

pdf.select_font ‘Times-Roman’, :encoding => ‘UTF-16BE’

without luck.

I’m not familiar with PDF::Writer, but I would be surprised if you
really had all the glyphs for ‘UTF-16BE’ by default. What is the exact
output ? Does it produce the PDF file, or it simply fails with an
exception, or crashes ?

If a PDF file is produced (of reasonable size), would you mind posting
it ?

Cheers,

Vince

fxn · February 16, 2007, 2:21pm

On Feb 16, 2007, at 12:59 PM, Vincent F. wrote:

I’m not familiar with PDF::Writer, but I would be surprised if you
really had all the glyphs for ‘UTF-16BE’ by default. What is the exact
output ? Does it produce the PDF file, or it simply fails with an
exception, or crashes ?

If a PDF file is produced (of reasonable size), would you mind
posting
it ?

Sure, it’s just 4KB. This is the PDF generated by

$KCODE = ‘u’

require ‘rubygems’
require ‘pdf/writer’
require ‘iconv’

str = Iconv.iconv(‘UTF-16BE’, ‘UTF-8’, ‘Ã¡ ÃŸ â‚¬’)
pdf = PDF::Writer.new
pdf.text str
pdf.text “\xfe\xff#{str}”
pdf.save_as(‘unicode_test.pdf’)

As you see, the glyph we get wrong in this small test is the euro
symbol. This is important to me because not only my database in in
UTF-8 coming from an unrestricted UTF-8 frontend (website), but the
application has money here and there and needs to be able to output
that currency symbol.

– fxn

fxn · February 16, 2007, 2:50pm

Xavier N. wrote:

$KCODE = ‘u’

As you see, the glyph we get wrong in this small test is the euro
symbol. This is important to me because not only my database in in UTF-8
coming from an unrestricted UTF-8 frontend (website), but the
application has money here and there and needs to be able to output that
currency symbol.

Actually, what you see on the screen is the latin1 representation of
your UTF-16BE string (see below). ^@ means chr 0 and seem to be ignored
by the PDF viewers, and UTF-16BE has the good taste to map to latin1 for
values up to 255. See what less unicode_test.pdf is giving me (I’m on a
latin1 locale):

BT 36.000 744.440 Td /F1 10.0 Tf 0 Tr (^@á^@ ^@ß^@ ¬) Tj ET
BT 36.000 732.880 Td /F1 10.0 Tf 0 Tr (þÿ^@á^@ ^@ß^@ ¬) Tj ET

Moreover, in this particular case, you are using the Helvetica
built-in font, and I’m pretty sure it doesn’t have glyphes for a Euro
symbol. Finally, acroread says that the encoding of the font is ‘ansi’.
That is definitely not what you want. Keep in mind that most of the
fonts (about everywhere) are defined for a small encoding (ansi/latin1,
or other 8bits encodings). I unfortunately don’t think I can help you
further. If you don’t rely too much yet on PDF::Writer, you could use
pdfLaTeX as an alternative, although PDF produced will be significantly
bigger (for small files)…

Welcome to the nightmare world of fonts and encodings…

Vince

fxn · February 16, 2007, 4:40pm

On 2/16/07, Xavier N. [email protected] wrote:

According to the current manual PDF documents generated by
PDF::Writer can use UTF-16BE, but after a few trials with iconv I
can’t get my UTF-8 strings right. Example:

The manual is incorrect; I have recently figured out how to write
UTF-16 strings, but the current PDF::Writer doesn’t do this (and there
are issues that I need to resolve before this will even show up in any
release of PDF::Writer).

-austin

fxn · February 16, 2007, 7:00pm

Vincent F. wrote:

Welcome to the nightmare world of fonts and encodings…

… and PDF generation in Ruby.

If this helps, you can see myself struggle with the same
problem here:

http://groups.google.de/group/comp.lang.ruby/browse_thread/thread/54336c6a932903fe/f0bb48520dac2ba5

I ended up using libharu (http://libharu.sourceforge.net/)

It is cross platform, FAST and has ruby bindings (it is a little bit
clumsy to use and the ruby bindings are missing some functions but
it is the best i could find)

example:

require “hpdf”

pdf = HPDFDoc.new
font = pdf.get_font(“Helvetica”, “CP1254”)

page = pdf.add_page

page.set_size(HPDFDoc::HPDF_PAGE_SIZE_A4, HPDFDoc::HPDF_PAGE_PORTRAIT)
page.set_font_and_size(font, 96)

page.begin_text

page.move_text_pos(100, 700)
page.show_text(“\x80”)

page.end_text

pdf.save_to_file “c:/temp/test.pdf”

With a little love to the wrapper this could be really good…

cheers

Simon

fxn · February 17, 2007, 11:01am

On Feb 16, 2007, at 2:49 PM, Vincent F. wrote:

Moreover, in this particular case, you are using the Helvetica
built-in font, and I’m pretty sure it doesn’t have glyphes for a Euro
symbol.

Austin explained the issue. But to understand that remark in any
case, is that Helvetica in the PDF different from the Helvetica I use
in the system? The Helvetica here in the Mac certainly has the euro
symbol.

– fxn

fxn · February 17, 2007, 11:49am

Xavier N. wrote:

On Feb 16, 2007, at 2:49 PM, Vincent F. wrote:

Moreover, in this particular case, you are using the Helvetica
built-in font, and I’m pretty sure it doesn’t have glyphes for a Euro
symbol.

Austin explained the issue. But to understand that remark in any case,
is that Helvetica in the PDF different from the Helvetica I use in the
system? The Helvetica here in the Mac certainly has the euro symbol.

Well… It is a long and complex story. A font is (for the PDF
document) just a correspondance (char) -> (nice drawing + metrics). What
we call Helvetica is in real a fair number of different fonts, which
cover various symbols that have a helvetica look & feel… Even if a
font is called helvetica, you can’t be assured that there are all the
glyphs you’re interested in inside it. And I don’t even speak about more
delicate things like fonts with Chinese or Russian characters… I
didn’t mean to exaggerate when I wrote ‘nightmare’ !

But, in this particular case, I was wrong ;-)… I checked up in the
PDF documentation, which specifies char codes for the Euro symbol. The
real problem was that the font encoding wasn’t the right one. I tweaked
manually the file until I could get it. See the problems with the
encodings and fonts: I spent a long time trying to get the char \240
displayed as Euro until I realised the encoding wasn’t quite the right
one and \240 meant ‘unbreakable space’ ! I attached the file just for
the example.

Cheers

Vince