Ruby 1.9.2: How to sanitize text with invalid characters?

I process a lot of text files of which I know the encoding, but that
might contain a few bytes that are invalid (i.e., make gsub fail with
“ArgumentError: invalid byte sequence in US-ASCII/UTF8”). What’s the
best way to handle this situation gracefully, by ignoring or removing
the invalid characters?

Code / Data samples?

r_string = ‘blah’.encode(‘UTF-8’)
r_regex = /#{r_string}/
text = “wahlahblahblahwahbalablah”.encode(“UTF-8”)
text.gsub!(r_regex, ‘’)

That’s a horrible example. Still, if you have ASCII in one place, and
UTF-8 in another, it’s conceivable that the matcher may just throw up
its hands. Force the encoding and try again. If it doesn’t work,
please post more information (preferably with a Gist / pastie). If
that helps, please mention it so that Google can direct other poor
souls to this post.

Scott

Scott G. wrote in post #949026:

Code / Data samples?

Trivial example:
“#{0xFF.chr} abcde”.force_encoding(“utf-8”).gsub(/a/,’’)
ArgumentError: invalid byte sequence in UTF-8

Will this work?

blah1 = “#{0xFF.chr} abcde”
blah2 = blah.split(/[^[:print:]]/).join

Using iconv to clean the string works:
Iconv.conv(‘utf-8//IGNORE’,‘utf-8’,"#{0xFF.chr} abcde")
=> " abcde"

However, it would be nicer if there was a way to do this with the
built-in encoding functions of Ruby 1.9.

Scott G. wrote in post #949256:

Will this work?

blah1 = “#{0xFF.chr} abcde”
blah2 = blah.split(/[^[:print:]]/).join

Only if the desired encoding is ASCII.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Am 12.10.2010 01:16, schrieb Andreas S.:

Using iconv to clean the string works:
Iconv.conv(‘utf-8//IGNORE’,‘utf-8’,“#{0xFF.chr} abcde”)
=> " abcde"

However, it would be nicer if there was a way to do this with the
built-in encoding functions of Ruby 1.9.

String#encode can do this much nicer:

$ irb
irb(main):001:0> RUBY_DESCRIPTION
=> “ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]”
irb(main):002:0> str = “#{0xFF.chr}”
=> “\xFF”
irb(main):003:0> str.encoding
=> #Encoding:ASCII-8BIT
irb(main):004:0> str.encode(“UTF-8”)
Encoding::UndefinedConversionError: “\xFF” from ASCII-8BIT to UTF-8
from (irb):4:in encode' from (irb):4 from /opt/rubies/ruby-1.9.2-p0/bin/irb:12:in
irb(main):005:0> str.encode(“UTF-8”, :invalid => :replace, :undef =>
:replace, :replace => “?”)
=> “?”
irb(main):006:0>

In order to remove invalid chars completely, use an empty string instead
of “?”.

Vale,
Marvin
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJMtA4iAAoJEGrS0YjAWTKV3T0H/0871zefFCUGMrNt69O2JjOJ
waH6Kwi3VqQzXS/AW/UdGFS7BGJwD70Rn62D43MMhqQ1gzPEdIlecMuDl1QZwp06
Fu1cuLE0lvWh0ecS0ahBRgmc0fdGPAM7/EKKIHsXuhfFJgoS0ttVVQ363UbMYXst
jMUrDAlJJ5fpasptxz9avq5MwAFyBvFXOqsRVuWrsZyuMy/akdWysUF9CoxtnIyp
mKh/dmkZ+tWZNuDHTRwFmXcxOFmwrJB8oXIGurKKDiseo2/K8KkldwCjNKRhNBfn
6RInFulYLDiywIYDPF/M4k5fDfnwhuFMF9qWtnoQuoXK/rPV4Al/oNXyEXLPICU=
=M4ng
-----END PGP SIGNATURE-----