Invalid byte sequence utf-8 OR best option to sanitize content brought in with net::http? single no

hi all,

platform: debian lenny, ruby1.91.p0, passenger/apache-multithread,
rails2.3 in vendor/postres and sql server via odbc. all current gems.

i have legacy asp content on win2k servers that i wrap in rails
controllers. this all worked great with ruby1.8, but now that we are
dealing with encoded strings in ruby1.9, i am having page crashes
randomly as users have cut and pasted high ascii code characters (e.g.
ascii 150 - a fancy dash) that are ms only and non-standard.

normally, i just wouldn’t have cared or even worried about it that
much; however, in testing this a bit further after a few mysterious
rails page crashes, i did more experimenting. i found that if i put
the following in my asp page, it will cause the rails page to fail
with “invalid byte sequence in utf-8” ror/vendor/rails/activesupport/
lib/active_support/core_ext/blank.rb: 50

the offending asp code is:

<%= chr(150) %>
this is my own doing to reproduce the issue, but there are many non-
standard windows characters that are not utf-8 compliant that probably
riddle my sql server database because users like to cut and paste
content from word and other places.

it turns out that because the content that i bring in via ruby
net::http has non-utf8 characters, the encoding is set to ascii8bit
and when i do force_encoding(utf-8), valid_encoding is false and the
page just fails. html::sanitize isn’t an option as i don’t want to
strip the tags. the content is from internal trusted servers that i
control. i just need to sanizite, i guess, the bad characters.

my thoughts/questions:

  1. seems like rails should be less brittle about managing encoding
    such that blank? doesn’t just fail when the valid_encoding is false.
    or you shouldn’t be able to create a string if the encoding is bad. or
    it should make best efforts to transliterate the bad characters.
    something.

  2. is iconv my best option. seems kind of nuts that i have to reencode
    the entire html page for one character. this does work using the
    translit//ignore options i get my pages, but i wonder at the
    overhead.

  3. as usual, trying to make my ms iis5 servers do anything useful is a
    non-starter. sure it says it can generate utf-8, but trying it the
    (typically confused and poorly documented) 25 different ways to make
    it do so, results in nothing but more wasted time. so i need a good
    rails solution that “just works.”

  4. it occurs to me that it could also be that ruby is setting the
    default to acsii for net::http regardless of how iis is sending it.
    how do i check/set the encoding.default_external in rails. why does
    rails remove the Encoding class. it isn’t there in console, but is in
    irb. i dislike rails remvoing native ruby classes.

please. i am so close to having ruby1.9/rails2.3 working, but this
encoding stuff is really a hassle.

  1. 1.9 is the wild wild west unfortunately, even more in all this
    encoding
    mess so as a developer right now is your responsability to transcode
    any
    external data to UTF-8(or you encoding of choice).
    I have sent a GSOC proposal to resolve this problems and let rails
    handle this problems for you and well “just work”.

  2. You can use the String#encode method supplied in ruby 1.9
    That does conversion between the supported encodings in ruby.
    It has a parameter to ignore or to replace invalid character with
    a placeholder value

encoding: utf-8

pi = "pi = ð
"
puts pi.encode(“iso-8859-1”, undef: :replace, replace: “??”)
returns pi = ??

  1. What you really want is to set the internal_encoding.
    If you have set the internal_encoding of your program
    every IO is transcode from its external_encoding to your
    internal_encoding in a transparent way.
    I recommend you read this blog:

http://blog.grayproductions.net/articles/ruby_19s_three_default_encodings

Rails doesn’t remove the Encoding class is available in the console.
I think your console for some reason is using ruby 1.8.

thanks hector,

i think you are right about the console. i tried the non-compat change
to case statements in ruby 1.9 with colons and console seemed fine
with that. so i guess somehow even though i change script/console to
#!/usr/local/ruby1.9/bin/ruby or even comment out the sherbang and
rename it script/console.rb and run it with my /usr/local ruby, i
still get 1.8. iguess

is there a way to set the ruby that console runs. this is one of those
things that i think it pretty convoluted in rails. we should just have
an external config file and set these things. all the calculated paths
and other “convention” stuff works most of the time, but sometimes it
just creates confusion. imho

regardless, any ideas about how to set the ruby version for console?
…gg

thanks hector,

i think you are right about the console. i tried the non-compat change
to case statements in ruby 1.9 with colons and console seemed fine
with that. so i guess somehow even though i change script/console to
#!/usr/local/ruby1.9/bin/ruby or even comment out the sherbang and
rename it script/console.rb and run it with my /usr/local ruby, i
still get 1.8. iguess

is there a way to set the ruby that console runs. this is one of those
things that i think it pretty convoluted in rails. we should just have
an external config file and set these things. all the calculated paths
and other “convention” stuff works most of the time, but sometimes it
just creates confusion. imho

regardless, any ideas about how to set the ruby version for console?
…gg

ok. found this patch to …/railties/lib/commands/console.rb

https://rails.lighthouseapp.com/attachments/93770/script-console-invoke-used-rubys-irb.diff

as script/console is just a wrapper for console.rb, that is the place
to intervene. stock rails just ends up calling your default irb
without bothering to see what version of ruby you are running.

this fixes my immediate problem so thanks. am going to grep
RUBY_PLATFORM to see if that can just be set somewhere in rails as
that seems to be referenced before searching for system location of
irb.

…gg

hector,

further update:

i was able to set both my internal and external encoding thanks to
hongli lai at phusion passenger. he helped me with a wrapper for my
local ruby that uses the encoding option. not suggesting that this is
his preferred method though, but you don’t seem to be able to pass
ruby options any other way that i’m aware of in passenger’s apache
config.

/usr/local/ruby1.9/bin/ruby_wrapper:
#!/bin/bash
exec /usr/local/ruby1.9/bin/ruby -E utf-8:utf-8 “$@”

then in apache2.conf:
PassengerRuby /usr/local/ruby1.9/bin/ruby_wrapper

restart apache.

in a controller:
raise “#{Encoding.default_internal} #{Encoding.default_internal}”

results in:
utf-8 utf-8

so all is good. for my app anyway. irb and script/console is a pain.

unfortunately, after all this, my asp pages still get ascii encoded
when brought in by net::http (after adding all the asp settings i can
to convince it to use utf). also, more unfortunately, your assertion
that if i have the default encodings set right (particularly
default_internal which i do now), that it will silently and fautlessly
convert my ascii page without error. no joy. got same utf encoding
error that i started with.

so…guess i am back to doing explicit encoding like you suggested or
going back to iconv.

all in all i have to say that ruby1.9 and rails2.3 and encoding and
irb and compiling your own ruby and… are still very rough.

…gg

2009/4/13 buddycat [email protected]

config.
in a controller:
that if i have the default encodings set right (particularly
…gg

Do you have a test case that I can reproduce the issue that you’re
seeing?

Thanks,

-Conrad

so i use lib/asp.rb module to get legacy asp content from internal
win2k/iis5/asp (classic not .net) servers as a mixin and require it in
my application_controller.rb as i have many asp pages. i do it this
way because it gives me a smooth incremental upgrade path to rails
from asp by replacing page for page as we write a better rails
replacement. this way my routes are all rails and i just call
asp_get_content when i have an asp page to wrap.

controller:
def my_legacy_page
asp_get_content
end

lib/asp.rb
module asp
def asp_get_content

@asp_response = Net::HTTP.start(host, port) {|x|
x.read_timeout = 1200
x.send_request(method, path, data, headers)
}

# return false on redirects so we can use custom renders like so:
# render :foo => :bar if asp_get_content while still allowing just
# asp_get_content without anything else for standard stuff
case @asp_response
when Net::HTTPRedirection
    redirect_to "#{@asp_response['location']}"
    false
else
   true
end

end

view:
<%= @asp_response.body %>

to reproduce the issue, just add

<%= chr(150) %> to the asp page. rails will choke with invalid byte
sequence utf-8 as soon as the response.rb tries to parse
@asp_response.body. see the above comments for the stack trace.

this is just my particular situation. i suspect you can add any high,
non-standard ascii code that windows likes like ascii 128-159. my test
case is ascii 150 that will reliably reproduce the issue. my point is
not with encoding per se, i just think that rails should be a bit more
fault tolerant around encodings as interop makes it almost a certainty
that we will pull incontent with bad encodings just as we pull in
malformed html. we cope with the latter well but now need to do so
with the former. imho.

thanks…gg

Sorry for the late response.
I took a dive in the Net:HTTP code and I have some bad news.

It uses a BufferedIO over the socket of the connection. And when it
reads from the socket it uses IO#sysread that is the lowest read you
can use
in ruby. This methods always returns a ASCII-8BIT string. So
you have to transcode or force_encoding the responses from Net:HTTP
explicitly.

Hector

also, my particular case is with asp content; but i am sure that the
problem can be reproduced with any web stack or even a static text
file with these characters.