Encoding

dubstep · June 17, 2011, 8:55am

What’s a good solution for fixing character encoding problems for
compatibility between ascii and utf-8? The database is postgres and
is encoded in utf-8.

Once in awhile there will be a compatibility error from strings from a
webform.

Is there a command to fix this besides using
a_string.force_encoding(‘utf-8’)? Even this doesn’t seem to always
work either.

Thanks,

Erica

Erica · June 18, 2011, 1:40am

Hi Erica,

I ran into similar situation a while ago for a webservice app I was
working on where I had to handle a lot of bad / untrusted non-utf8
data, and found a fix that met the needs of the app using Iconv
(http://www.ruby-doc.org/stdlib/libdoc/iconv/rdoc/index.html)
following a strategy outlined by Paul B. (http://po-ru.com/diary/
fixing-invalid-utf-8-in-ruby-revisited/):

…
def AppUtil.force_utf8(str)
ic = Iconv.new(‘UTF-8//IGNORE’, ‘UTF-8’)
return ic.iconv("#{str} ")[0…-2]
end
…

Jeff

Erica · June 21, 2011, 1:34am

Thanks for your response. I tried this on a string that was causing
the error and it didn’t work. The problem is with microsoft word
special characters. I can’t find a way to replace these characters.
Here is one website I found that describes the special characters:
The Fruits of my Labour,
although it’s not about rails.

Can anyone help me out?

Thanks,

Erica

Erica · June 21, 2011, 6:10pm

Hi Erica,

I personally haven’t had to deal with encoding issues yet, but remember
reading couple of posts from Yehuda K. (of merb fame and core
contributor
to rails) on that.
Maybe these can help you identify and fix your problem:

The articles are little long, but if you know a good deal about
encodings,
then you can skip towards end of the posts where he writes about how to
deal
with conversions.

Erica · June 21, 2011, 2:12pm

You probably need to figure out the actual encoding and explicitly
convert from that to UTF-8. This is a snippet of code that I have in a
real project:

   open(DATAFEED_URI) do |file|
     local_filename = local_path
     local_filename.open('w') do |outf|
       file.each do |line|
         begin
           outf.write Iconv.conv('UTF-8//TRANSLIT//IGNORE',

‘WINDOWS-1252’, line)
rescue Iconv::IllegalSequence => e
shlogger.error { “#{DATAFEED_URI} line #{file.lineno}
could not be translated:\n#{line}” }
end
end
end
local_filename.open(‘r’) {|opened| yield opened }
end

The part that you’re going to be interested in is the line that calls
Iconv and, in particular, the second argument of ‘WINDOWS-1252’ which
is likely the encoding of your data. There are also a couple aliases
for that code page:

$ iconv -l | grep -e 1252
CP1252 MS-ANSI WINDOWS-1252

(iconv -l prints a list of all the encodings known by iconv.)

I hope that helps.

-Rob

On Jun 20, 2011, at 7:33 PM, Erica wrote:

diary/

Is there a command to fix this besides using
To post to this group, send email to rubyonrails-
[email protected].
To unsubscribe from this group, send email to
[email protected]
.
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en
.

Rob B.
[email protected] http://AgileConsultingLLC.com/
[email protected] http://GaslightSoftware.com/

Erica · June 21, 2011, 7:41pm

Hey,

 I'm using Rails in a Microsoft platform, so I can't rely use iconv,

I had a lot of problems with encoding, and finally I solved with the
attached script.

 I hope it will help you!

El 21/06/2011 1:33, Erica escribi:

fixing-invalid-utf-8-in-ruby-revisited/):
On Jun 16, 5:27 pm,Erica[email protected] wrote:

Erica
–
Miquel C. Escarr
+34 699 73 22 46
[email protected]

“Computers are good at following instructions, but not at reading your
mind.” Donald Knuth.

“Los ordenadores son buenos siguiendo instrucciones, pero no leyendo tu
mente.” Donald Knuth.

Erica · June 25, 2011, 12:54am

Thank you everyone for your responses. They are helped me figure out
a solution. This seems to work for my problem:

s = s.gsub("\xe2\x80\x9c", ‘"’)
s = s.gsub("\xe2\x80\x9d", ‘"’)
s = s.gsub("\xe2\x80\x98", “’”)
s = s.gsub("\xe2\x80\x99", “’”)
s = s.gsub("\xe2\x80\x93", “-”)
s = s.gsub("\xe2\x80\x94", “–”)
s = s.gsub("\xe2\x80\xa6", “…”)
s = Iconv.conv(‘UTF-8//IGNORE’, ‘UTF-8’, s)

-Erica

Erica · June 21, 2011, 6:26pm

Maybe post an example of a string/char that’s causing the problem, as
it’s logged in your app’s log?

Here’s an example of a problem string/char that I was seeing in data
posted to my app:

$ ./script/rails console
…
ruby-1.9.2-p136 :001 > s = “foo\xAE bar”
=> “foo\xAE bar”

ruby-1.9.2-p136 :002 > s.is_utf8?
=> false

ruby-1.9.2-p136 :003 > s.valid_encoding?
=> false

ruby-1.9.2-p136 :004 > s.sub(/bar/, ‘biz’)
ArgumentError: invalid byte sequence in UTF-8
from (irb):4:in `sub’
…

ruby-1.9.2-p136 :005 > s2 = Iconv.new(‘UTF-8//IGNORE’,
‘UTF-8’).iconv("#{s} ")[0…-2]
=> “foo bar”

ruby-1.9.2-p136 :006 > s2.gsub(/bar/, ‘biz’)
=> “foo biz”

And if that’s not doing the trick, then maybe try forcing the string
to utf8 first?:

ruby-1.9.2-p136 :007 > s3 = Iconv.new(‘UTF-8//IGNORE’,
‘UTF-8’).iconv("#{s.force_encoding(‘UTF-8’)} ")[0…-2]
=> “foo bar”

Jeff