Escaping non-ASCII chars for RTF export

Dan_Herrera · November 1, 2007, 11:51pm

Greetings,

I’m attempting to convert non-ASCII characters to unicode escape
sequences for export to RTF, and I haven’t had much luck finding any
good information searching google. Anyone here have any good
resources for this sort of thing?

Thanks!

dan

Dan_Herrera · November 2, 2007, 1:15am

http://ruby-rtf.rubyforge.org/

Ruby RTF library. Creates RTF documents… might be a good start.

Dan_Herrera · November 2, 2007, 4:18am

On Nov 1, 5:14 pm, [email protected] wrote:

http://ruby-rtf.rubyforge.org/

Ruby RTF library. Creates RTF documents… might be a good start.

Hi, thanks for taking a look at my problem.

I am using the Ruby RTF library currently to generate RTF files. The
trouble I’m running into is with strings like ‘gør’. When you add
that ø character, it doesn’t get converted to it’s unicode counterpart
and the result is mangled when viewed.

Thanks again for your help,

dan

Dan_Herrera · November 2, 2007, 9:46am

Dan Herrera wrote:

On Nov 1, 5:14 pm, [email protected] wrote:

http://ruby-rtf.rubyforge.org/

Ruby RTF library. Creates RTF documents… might be a good start.

Hi, thanks for taking a look at my problem.

I am using the Ruby RTF library currently to generate RTF files. The
trouble I’m running into is with strings like ‘gï¿½r’. When you add
that ï¿½ character, it doesn’t get converted to it’s unicode counterpart
and the result is mangled when viewed.

A unicode has to be converted into a character language(called an
‘encoding’) that your display device can understand before the character
can be displayed. Common character languages(or ‘encodings’) are ascii
and utf-8. It sounds like the string you are starting with is encoded
in a character language that your display device doesn’t understand.

Therefore, you need to figure out what character language your display
device does understand. utf-8 is pretty common, so you can start off
trying to convert your strings to the utf-8 character language, and then
see if the strings will display correctly. But to convert your strings
to utf-8, you need to know the current character language that the
string is written in. If you don’t know the current language, you can
start off by trying ISO-8859-15. The characters that make up the
ISO-8859-15 language are listed here:

To convert from ISO-8859-15 to utf-8, you can do this:

str = “Hell\xf6 w\xf6rld” #\xf6 is ‘o’ with umlaut in ISO-8859-15
puts str

–output (which my display device shows me):–
Hell? w?rld #I see question marks instead of o’s with umlauts

Therefore, my display device does not understand the IS0-8859-15
character language. Since I want my display device to display the o’s
with umlauts, I’ll try converting the string to the utf-8 character
language:

require ‘iconv’ #‘Internationalization converter’?

converter = Iconv.new(‘UTF-8’, ‘ISO-8859-15’)
new_str = converter.iconv(str)
puts new_str

–output:–
HellÃ¶ wÃ¶rld #I see o’s with unlauts

Dan_Herrera · November 2, 2007, 10:21pm

Dan Herrera wrote:

This is great information, it’s really helped me move in the right
direction.

Thanks!

There is one missing piece to the puzzle. This is what happens behind
the scenes when you convert from a string written in UTF-8 format to a
string written in ISO-8859-15 format:

UTF-8 encoded character
|
|
V
Unicode integer
|
|
V
ISO-8859-15 encoded character

If for some reason, you ever need to get the unicode integer, you can do
this:

str = “\xc3\xb6” #‘o’ with umlaut encoded in utf-8
arr = str.unpack(‘U’) #‘U’ gets the unicode from a char encoded in
utf-8 only

p arr #[246] --> unicode in decimal format

Since unicode integers are usually written in hex format, you can do the
following to get the unicode in hex format:

puts “%04x” % arr[0] #00f6

Dan_Herrera · November 2, 2007, 7:23pm

On Nov 2, 1:46 am, [email protected] wrote:

that ? character, it doesn’t get converted to it’s unicode counterpart
trying to convert your strings to the utf-8 character language, and then
str = “Hell\xf6 w\xf6rld” #\xf6 is ‘o’ with umlaut in ISO-8859-15
require ‘iconv’ #‘Internationalization converter’?

converter = Iconv.new(‘UTF-8’, ‘ISO-8859-15’)
new_str = converter.iconv(str)
puts new_str

–output:–
Hellö wörld #I see o’s with unlauts

Hi,

This is great information, it’s really helped me move in the right
direction. I haven’t done enough testing yet, but here is what has
seemed to work.

Using an Iconv solution, where str is the string to convert.:

require ‘iconv’
converter = Iconv.new(‘ISO-8859-15’, ‘UTF-8’)
converted_str = converter.iconv(str)

So a little backwards from what we were thinking. Looks like swapping
UTF-8 and ISO-8859-15 did the trick since it appears that the string
was in UTF-8 to begin with.

Thanks!

dan