Character encoding question

amishera · March 26, 2010, 6:27pm

I have an html file which is encoded in UTF-8. The file contains the
following text:

It's a wonderful life

now the character code 39 is for aphostrohpe in UTF8. so suppose I got
the 39 out of the text using:

s=“It’s a wonderful life”

s.gsub(/&#(\d+);/, ‘\1’)

The output is

It39s a wonderful life

So firstly I am having trouble making it

It\39s a wonderful life

Secondly I manually did this in test_utf8.rb:

puts “It\39s a wonderful life”

and ran it

ruby test_utf8.rb > utf8.txt

but by opening it in the open office by setting the encoding to utf-8
the output is

It#9s a wonderful life

So how to correctly parse the collect and convert html character
reference to encoded charcters in utf-8 and then save file?

Thanks.

amishera · March 26, 2010, 7:17pm

s=“It’s a wonderful life”

I stumbled across this:

require ‘cgi’
s=CGI.unescapeHTML(“It’s a wonderful life”)

David

amishera · March 26, 2010, 9:22pm

David