Forcing a string to valid UTF-8

phrogz · April 27, 2010, 12:45am

I have some legacy text data that’s gone through several databases and
web services in its life, playing promiscuously with dirty web
servers, browsers, and encodings.

It’s coming out of the source database as ASCII-8bit. I’m trying to
bring it all into UTF-8. I’ve found ways to coerce many of the bad
entries into compliance, but now I’ve hit one that is simply bad. I
want to just delete the minimum necessary to make it valid UTF-8. What
I’m trying isn’t working. Here’s my code:

if new_value.is_a? String
begin
utf8 = new_value.force_encoding(‘UTF-8’)
if utf8.valid_encoding?
new_value = utf8
else
new_value.encode!( ‘UTF-8’, ‘Windows-1252’ )
end
rescue EncodingError => e
puts “Bad encoding: #{old_table}.#{pk}:#{old_row[pk]} -
#{new_value.inspect}”
new_value.encode!( ‘UTF-8’, invalid: :replace, undef: :replace,
replace: ‘’ )
p new_value.encoding unless new_value.valid_encoding?
end
end

When I fall into the rescue clause, I’m getting out:
Bad encoding: bugs.id:2469 - “Indexing C:\\Ï€Ã©â”‚Ï€Ã¢Ã¶\xE3\x81E \x81E
\x81EZCa_zu5.264”
#Encoding:UTF-8
The conversion resulted in an invalid UTF-8 string (that happens to be
the same as the original, as far as I can tell.) I’m surprised,
because I thought the purpose of invalid/undef replace was to clean
things up.

How do I force it into a valid UTF-8 encoding, losing as little data
as possible but happily throwing out the senseless bits?

phrogz · April 27, 2010, 12:19pm

Gavin K. wrote:

How do I force it into a valid UTF-8 encoding, losing as little data
as possible but happily throwing out the senseless bits?

AFAICS, the trouble with your rescue clause is that the string failed to
be encoded into Windows-1252, so it remains with its existing UTF-8 tag,
and so an attempt to “re-encode” as UTF-8 is silently ignored because
it’s already UTF-8, even though it contains invalid characters.

For example, this doesn’t do anything:

a = “abc\xffdef”.force_encoding(“UTF-8”)
=> “abc\xFFdef”

b = a.encode(“UTF-8”, :invalid=>:replace, :replace=>"?")
=> “abc\xFFdef”

but this does:

b = a.encode(“UTF-16BE”, :invalid=>:replace, :replace=>"?").encode(“UTF-8”)
=> “abc?def”

Proviso: ruby 1.9 string handling is undocumented and subject to
continuous change. I tested the above with

RUBY_DESCRIPTION
=> “ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]”

so it may or may not work with your version, or with future versions of
Ruby.

phrogz · April 27, 2010, 4:45pm

On Apr 27, 4:19 am, Brian C. [email protected] wrote:

Gavin K. wrote:

How do I force it into a valid UTF-8 encoding, losing as little data
as possible but happily throwing out the senseless bits?

AFAICS, the trouble with your rescue clause is that the string failed to
be encoded into Windows-1252, so it remains with its existing UTF-8 tag,
and so an attempt to “re-encode” as UTF-8 is silently ignored because
it’s already UTF-8, even though it contains invalid characters.

Excellent point. Fixing that led me to a similar error earlier: I had
assumed that
s2 = s1.force_encoding(…)
left s1 intact. In fact, it modifies and returns s1. Thank you very
much, Brian.

For those that care or stumble upon this via Google, here’s a modified
version that works:

Converting ASCII-8BIT to UTF-8 based domain-specific guesses

if new_value.is_a? String
begin
# Try it as UTF-8 directly
cleaned = new_value.dup.force_encoding(‘UTF-8’)
unless cleaned.valid_encoding?
# Some of it might be old Windows code page
cleaned = new_value.encode( ‘UTF-8’, ‘Windows-1252’ )
end
new_value = cleaned
rescue EncodingError
# Force it to UTF-8, throwing out invalid bits
new_value.encode!( ‘UTF-8’, invalid: :replace, undef: :replace )
end
end

Proviso: ruby 1.9 string handling is undocumented and subject to
continuous change. I tested the above with

FWIW my new code works on ruby 1.9.1p243 (2009-07-16 revision 24175)
[i386-mingw32]

Thanks again!