On 9/26/10, Bhay Z. [email protected] wrote:
in gsub.
content.gsub!( “\237”, ‘’)
I just cannot figure how to fix this problem and any help would be
greatly appreciated.
In 1.9, every string (and regular expression) has an encoding attached
to it. If there are any byte sequences in your string that don’t match
the encoding, it causes errors. 1.8 was much more permissive about its
strings, allowing arbitrary binary data in any string, which is why it
worked better for you. You can get back the 1.8 behavior under 1.9 by
setting the encoding of your string objects to ‘binary’.
My first suggestion would be to set the encoding of the string in the
variable content to binary before doing any of the gsub!s:
content.force_encoding(‘binary’)
However, a better way would be to set the encoding of the IO object
the strings are read from. That way you don’t need to force_encoding
each string as it comes in.
Even better is to figure out what the encoding this external tool is
using and set the IO’s encoding to that. Then perhaps a lot of this
hacky string manglich could go away.
But this is still only half the story. You also have to consider the
encoding of the strings and regexps which get passed as the first
argument to gsub. Those string (and regexp) literals default to the
same encoding as the source file they’re contained in. If no explicit
encoding is declared for a specific source file, ruby guesses an
encoding based on your environment (using the LOCALE env var and some
others that I can’t remember right now). Often, this means ruby
assumes your sources
are utf-8 encoded.
You can declare a specific encoding explicitly by putting something
like this as the very first line in your source:
#encoding: binary
(or the second line if the first line is a shebang line).
I used the binary encoding in the example line above because that’s
probably the one which will work best for you under the circumstances.
Declaring the source encoding to be binary is a bit hackish, but
probably the easiest way to get you where you want to go. If you
figure out what encoding your data is in, you’re probably better off
declaring the source encoding to be the same thing, but there may be
more work involved there.
PS: there is some redundancy in the sequence of gsub!s you posted. The
first 10 (for “\221” thru “\227”) are special cases of the 14th (for
/[\x80-\xFF]/) and can safely be deleted. Also, “\FB01” is the same
thing as “FB01” in both ruby 1.8 and 1.9 and probably not what you
wanted. (Maybe “\xFB\x01” is what you actually meant?)
HTH