I get some untrusted input from some of our partners that should be in
utf-8 (or generally plain 7-bit ascii), but isn’t always (and in fact in
some cases appears to be a multiple incompatible string encodings
concatenated together, truncated strangely then joined, or perhaps just
noise). I’d like to convert the string into something that’s valid
utf-8 so I can work with it, ideally keeping as much of the valid
encoding parts of the string as possible. I tried encode! but ran into
weirdness where it would return a string that claims to valid but isn’t
(which seems like a bug).
test strings
1.9.3p0> str1 = “ceramic
rollers1\x82ры/Рейд-боссы/50—59F\xAA\xB3\xF3\xC7\xF9)-\xB0\xA1\xB3\xAA\xB4ټ\xF8.xls&tempFileName=1310611982277\xC1\xA6110ȸ
\xC7հ\xDD\xC0ڸ\xED\xB4\xDC(\xC1\xF6\xBF\xAA\xB3\xF3\xC7\xF9)-\xB0\xA1\xB3\xAA\xB4ټ\xF8.xls”
1.9.3p0> str2 = “hydroxide+caustic 田由\xE7\xBE”
encode!
1.9.3p0> a = str1.dup
1.9.3p0> a.valid_encoding?
=> false
1.9.3p0> a.encode!(Encoding::UTF_8, Encoding::UTF_8, :invalid=>:replace,
:undef=>:replace, :replace=>’’)
=> “ceramic
rollers1\x82ры/Рейд-боссы/50—59F\xAA\xB3\xF3\xC7\xF9)-\xB0\xA1\xB3\xAA\xB4ټ\xF8.xls&tempFileName=1310611982277\xC1\xA6110ȸ
\xC7հ\xDD\xC0ڸ\xED\xB4\xDC(\xC1\xF6\xBF\xAA\xB3\xF3\xC7\xF9)-\xB0\xA1\xB3\xAA\xB4ټ\xF8.xls”
1.9.3p0> a.valid_encoding?
=> true
so far so good
1.9.3p0> a.squeeze(’ ')
ArgumentError: invalid byte sequence in UTF-8
from (irb):10:in squeeze' from (irb):10 from /home/tgarnett/.rvm/rubies/ruby-1.9.3-p0/bin/irb:16:in
’
!!! ruby just claimed the encoding was valid! BUG??
a.dup.squeeze(’ ‘), "#{a} ".squeeze(’ ') both fail as well
Also tried iconv with //IGNORE but it returns
invalid strings on some inputs, and also crashes on some others. I’ve
had better luck with unpack/pack, but I was wondering if anyone new a
better way to do this.
iconv
1.9.3p0> require ‘iconv’
1.9.3p0> a = str1.dup
1.9.3p0> a = Iconv.new(‘UTF-8//IGNORE’, ‘UTF-8’).iconv(a)
=> “ceramic
rollers1ры/Рейд-боссы/50—59F)-ټ.xls&tempFileName=1310611982277110ȸ
հڸ(\xF6\xBF\xAA\xB3)-ټ.xls”
1.9.3p0> a.valid_encoding?
=> false
no luck here either…
1.9.3p0> b = str2.dup
1.9.3p0> b = Iconv.new(‘UTF-8//IGNORE’, ‘UTF-8’).iconv(b)
Iconv::InvalidCharacter: “\xE7\xBE”
from (irb):22:in iconv' from (irb):22 from /home/tgarnett/.rvm/rubies/ruby-1.9.3-p0/bin/irb:16:in
’
ok, can crash too…
unpack, pack
1.9.3p0> a = str2.dup
1.9.3p0> a = a.unpack(‘C*’).pack(‘U*’)
=> “hydroxide+caustic ç\u0094°ç\u0094±ç¾”
1.9.3p0> a.valid_encoding?
=> true
1.9.3p0> a.squeeze(’ ')
=> “hydroxide+caustic ç\u0094°ç\u0094±ç¾”