Str.encode! sets valid_encoding even if resulting string is invalid (1.9.3)

I get some untrusted input from some of our partners that should be in
utf-8 (or generally plain 7-bit ascii), but isn’t always (and in fact in
some cases appears to be a multiple incompatible string encodings
concatenated together, truncated strangely then joined, or perhaps just
noise). I’d like to convert the string into something that’s valid
utf-8 so I can work with it, ideally keeping as much of the valid
encoding parts of the string as possible. I tried encode! but ran into
weirdness where it would return a string that claims to valid but isn’t
(which seems like a bug).

test strings

1.9.3p0> str1 = “ceramic
rollers1\x82ры/Рейд-боссы/50—59F\xAA\xB3\xF3\xC7\xF9)-\xB0\xA1\xB3\xAA\xB4ټ\xF8.xls&tempFileName=1310611982277\xC1\xA6110ȸ
\xC7հ\xDD\xC0ڸ\xED\xB4\xDC(\xC1\xF6\xBF\xAA\xB3\xF3\xC7\xF9)-\xB0\xA1\xB3\xAA\xB4ټ\xF8.xls”
1.9.3p0> str2 = “hydroxide+caustic 田由\xE7\xBE”

encode!

1.9.3p0> a = str1.dup
1.9.3p0> a.valid_encoding?
=> false
1.9.3p0> a.encode!(Encoding::UTF_8, Encoding::UTF_8, :invalid=>:replace,
:undef=>:replace, :replace=>’’)
=> “ceramic
rollers1\x82ры/Рейд-боссы/50—59F\xAA\xB3\xF3\xC7\xF9)-\xB0\xA1\xB3\xAA\xB4ټ\xF8.xls&tempFileName=1310611982277\xC1\xA6110ȸ
\xC7հ\xDD\xC0ڸ\xED\xB4\xDC(\xC1\xF6\xBF\xAA\xB3\xF3\xC7\xF9)-\xB0\xA1\xB3\xAA\xB4ټ\xF8.xls”
1.9.3p0> a.valid_encoding?
=> true

so far so good

1.9.3p0> a.squeeze(’ ')
ArgumentError: invalid byte sequence in UTF-8
from (irb):10:in squeeze' from (irb):10 from /home/tgarnett/.rvm/rubies/ruby-1.9.3-p0/bin/irb:16:in

!!! ruby just claimed the encoding was valid! BUG??

a.dup.squeeze(’ ‘), "#{a} ".squeeze(’ ') both fail as well

Also tried iconv with //IGNORE but it returns
invalid strings on some inputs, and also crashes on some others. I’ve
had better luck with unpack/pack, but I was wondering if anyone new a
better way to do this.

iconv

1.9.3p0> require ‘iconv’
1.9.3p0> a = str1.dup
1.9.3p0> a = Iconv.new(‘UTF-8//IGNORE’, ‘UTF-8’).iconv(a)
=> “ceramic
rollers1ры/Рейд-боссы/50—59F)-ټ.xls&tempFileName=1310611982277110ȸ
հڸ(\xF6\xBF\xAA\xB3)-ټ.xls”
1.9.3p0> a.valid_encoding?
=> false

no luck here either…

1.9.3p0> b = str2.dup
1.9.3p0> b = Iconv.new(‘UTF-8//IGNORE’, ‘UTF-8’).iconv(b)
Iconv::InvalidCharacter: “\xE7\xBE”
from (irb):22:in iconv' from (irb):22 from /home/tgarnett/.rvm/rubies/ruby-1.9.3-p0/bin/irb:16:in

ok, can crash too…

unpack, pack

1.9.3p0> a = str2.dup
1.9.3p0> a = a.unpack(‘C*’).pack(‘U*’)
=> “hydroxide+caustic ç\u0094°ç\u0094±ç¾”
1.9.3p0> a.valid_encoding?
=> true
1.9.3p0> a.squeeze(’ ')
=> “hydroxide+caustic ç\u0094°ç\u0094±ç¾”

some success, also works for str1

I’m observing the same problem. String#encode does not replace invalid
chars when {:invalid => :replace} option is used nor raises
Encoding::InvalidByteSequenceError when no option is set.

irb(main):023:0> f = File.open(“test/fixtures/files/text.txt”)
=> #<File:test/fixtures/files/text.txt>
irb(main):024:0> f.binmode
=> #<File:test/fixtures/files/text.txt>
irb(main):025:0> s = f.read
=> “This is a very simple sample text.\nIn latin1 \xE9ncoding \n”
irb(main):026:0> s.encoding
=> #Encoding:ASCII-8BIT
irb(main):027:0> u = s.encode(‘UTF-8’, ‘UTF-8’, :invalid => :replace,
:undef => :replace)
=> “This is a very simple sample text.\nIn latin1 \xE9ncoding \n”
irb(main):028:0> u.valid_encoding?
=> true
irb(main):029:0> u.squeeze(" ")
ArgumentError: invalid byte sequence in UTF-8
from (irb):29:in squeeze' from (irb):29 from /usr/local/ruby/1.9.3-p0/bin/irb:12:in