Strip & Sanitize BEFORE saving data

So I’ve googled my brains out, and I see a lot of talk about
TextHelper for views, but next to no discussion about cleaning text
before it is saved.

I figured this had to be asked 4 zillion times, but I’m not finding
anything concrete/obvious.

Using h is fine as a safety catch, but that alone is not acceptable
to me as the means of diffusing the impact of HTML or JS in text
data. It needs to be removed / tested for in validations for
rejection (and of course it should be wise to all the entity/unicode/
null and other obfuscation tricks)

I did see refc to some tools like whitelist, but again, they’re all
focused on the display side of things.

Obviously I can use validates_format_of for the vast majority of
small fields that can be restricted to known characters which would
preclude an HTML/XSS injection. But there are plenty of larger, free-
text fields where that’s not practical.

I’m surprised there’s no basic validation for this (that I can see),
so I’m hoping that’s because there’s a common technique which
combines some other tool with validations to do this?

What I’m thinking of is something like strip_tags except that it is
usable on the model side of things.

– gw (www.railsdev.ws)

On Nov 28, 2007 7:51 PM, Greg W. [email protected] wrote:

So I’ve googled my brains out, and I see a lot of talk about
TextHelper for views, but next to no discussion about cleaning text
before it is saved.

You can make an ActiveRecord do exactly that. The open nature of Ruby
makes it simple:

app/model/cleaner.rb:

module Cleaner
def self.append_features( base )
base.before_save do |model|
model.html = model.html.gsub( /<(.|\n)*?>/, ‘’ ) if
model.respond_to?( :html )
end
end
end

add to config/environment.rb:

require “#{RAILS_ROOT}/app/models/cleaner”
class ActiveRecord::Base
include Cleaner
end

Now any columns named ‘html’ in any of your models will get cleaned of
any html tags.


Greg D.
http://destiney.com/

On Nov 28, 2007 9:19 PM, Greg D. [email protected] wrote:

module Cleaner
def self.append_features( base )
base.before_save do |model|
model.html = model.html.gsub( /<(.|\n)*?>/, ‘’ ) if
model.respond_to?( :html )
end
end
end

I use this same design pattern to make my blog titles into ‘path’
fields I use in my blog urls:

http://destiney.com/blog/clean-your-ruby-rails-activerecords-automatically

A route is the real magic:

map.connect “blog/:id”,
:controller => ‘blog’,
:action => ‘entry’,
:requirements => { :id => /[\w-]+/ },
:id => nil


Greg D.
http://destiney.com/

Greg:

  1. I am moved to ask a meta-question: How did you come upon this
    solution?

    • Reading Rails docs?
    • Reading Rails source code?
    • Regular Ruby reflection programming?

    Ah, I see, regular Ruby reflection. I went looking for the
    Cleaner module in the docs and source.

  2. A future version of rails could use the name ‘Cleaner’ and create
    a naming conflict, but that sounds manageable.

  3. Also, the Pickaxe Ruby book deprecates ‘append_features’ in favor
    of ‘Module#included’.

  4. Why would I name a column ‘html’ if I wanted to remove all html
    tags? I’m more likely to use this on a column named ‘description’,
    ‘name’ or even ‘email address’. I guess it’s just an example.

  5. Anyway, I like the code.

fredistic

Rick O. wrote:

On 11/28/07, Greg W. [email protected] wrote:

data. It needs to be removed / tested for in validations for

I’m surprised there’s no basic validation for this (that I can see),
so I’m hoping that’s because there’s a common technique which
combines some other tool with validations to do this?

What I’m thinking of is something like strip_tags except that it is
usable on the model side of things.

The common technique was to mix the helpers directly in to your model
and call them directly. However, I recently just refactored a lot of
that code into the html tokenizer library. You can now access the
classes directly as HTML::Sanitizer, HTML::LinkSanitizer, and
HTML::WhiteListSanitizer.

@Rick – where are these libraries? Don’t see them as part of Rails 1.2
or in your list of plugins on technoweenie.

– gw

On Nov 29, 2007 12:57 AM, fredistic [email protected] wrote:

  1. I am moved to ask a meta-question: How did you come upon this
    solution?

Honestly I don’t recall. If I had to guess I’d say from one of my Ruby
books:

http://static.destiney.com/ror_vs_c_asm.jpg

  1. A future version of rails could use the name ‘Cleaner’ and create
    a naming conflict, but that sounds manageable.

You can always name your version Cleanerfoobar87, I’m pretty sure that
won’t cause any future problems.

  1. Also, the Pickaxe Ruby book deprecates ‘append_features’ in favor
    of ‘Module#included’.

I wouldn’t be surprised a bit.

  1. Why would I name a column ‘html’ if I wanted to remove all html
    tags? I’m more likely to use this on a column named ‘description’,
    ‘name’ or even ‘email address’. I guess it’s just an example.

Obviously you’re free to rename the variable to something else. Like
I was saying in that other post I use this sort of thing to make
fields into other fields, it’s good for many other uses I’m sure.


Greg D.
http://destiney.com/

On Nov 28, 2007, at 7:30 PM, Rick O. wrote:

The common technique was to mix the helpers directly in to your model
and call them directly.

Yeah, I’m lost on trying to do that. I can’t seem to ever figure out
what the correct syntax is for doing this at any given point.
include? require? quoted? not-quoted? camel case? underscored? path
names needed? name spaces needed? Far too many ways to accomplish
what all looks like the same thing, and I still can’t figure out what
part of it is straight Ruby and where (if) Rails sticks its finger in
the pie trying to make things “easier.”

Leveraging strip_tags inside a custom validator seems the right way
to go, but I’m lost in trying to get that into a model.

Need some spoon feeding to get started.

However, I recently just refactored a lot of
that code into the html tokenizer library. You can now access the
classes directly as HTML::Sanitizer, HTML::LinkSanitizer, and
HTML::WhiteListSanitizer.

This is a Rails 2.0 enhancement?


def gw
acts_as_n00b
writes_at(www.railsdev.ws)
end

On 11/28/07, Greg W. [email protected] wrote:

data. It needs to be removed / tested for in validations for

I’m surprised there’s no basic validation for this (that I can see),
so I’m hoping that’s because there’s a common technique which
combines some other tool with validations to do this?

What I’m thinking of is something like strip_tags except that it is
usable on the model side of things.

The common technique was to mix the helpers directly in to your model
and call them directly. However, I recently just refactored a lot of
that code into the html tokenizer library. You can now access the
classes directly as HTML::Sanitizer, HTML::LinkSanitizer, and
HTML::WhiteListSanitizer.


Rick O.
http://lighthouseapp.com
http://weblog.techno-weenie.net
http://mephistoblog.com

Ha. About an hour ago I came across Tony’s blog entry, and I started
to write something based on it (I was going to try some simple stuff
without whitelist to start with though).

I guess I’ll try this. Hopefully this works with 1.2.5 ?

– gw

On Nov 29, 2007, at 3:03 PM, Marston A. wrote:

focused on the display side of things.
What I’m thinking of is something like strip_tags except that it is
usable on the model side of things.

– gw (www.railsdev.ws)


def gw
acts_as_n00b
writes_at(www.railsdev.ws)
end

There have also been some new plugins that have come out in the last
few weeks:

http://code.al3x.net/svn/acts_as_sanitized/
http://code.google.com/p/sanitizeparams/

This is just a shot in the dark, but couldn’t you add a global
“before_save” callback that would go through the self.attributes hash
and escape all strings? Then you could have a function call in the
model to determine the fields that should not be sanitized.

On 11/29/07, Greg W. [email protected] wrote:

What I’m thinking of is something like strip_tags except that it is
part of it is straight Ruby and where (if) Rails sticks its finger in
classes directly as HTML::Sanitizer, HTML::LinkSanitizer, and
HTML::WhiteListSanitizer.

This is a Rails 2.0 enhancement?

Ah yes, sorry. I recently added them to the HTML Tokenizer library
that powers assert_tag and assert_select. This way you don’t have to
mess with including the helpers. The 3 sanitizers all respond to
#sanitize. The WhiteListSanitizer responds to #sanitize_css too.

Also, #append_features isn’t really ‘deprecated’ as far as I know, but
it is one less line of code:

module FooBar
def self.append_features(base)
super
base…
end

def self.included(base)
base… # no super required
end
end


Rick O.
http://lighthouseapp.com
http://weblog.techno-weenie.net
http://mephistoblog.com

Marston A. wrote:

http://code.al3x.net/svn/acts_as_sanitized/
http://code.google.com/p/sanitizeparams/

I like seeing these two plugins listed side by side because it
highlights one of the issues with sanitizing: deciding where to do the
sanitizing can be as interesting as deciding what to sanitize.

The default Rails method of sanitizing in templates has two problems.
One, you sanitize too late to give feedback to the person who entered
the data. Two, you’re opening yourself to programmer error because
you’re asking programmers to add the sanitization call on every single
display of data. I’m struggling to come up with a benefit to this
approach. Maybe if you were intentionally keeping a database of unsafe
data but wanted to put a safe web front end on it, like a spam database.

Alex’s acts_as_sanitized does sanitization at the model level. That’s
the strictest and hardest to screw up as a programmer. It’s also easier
to give feedback to the user (although I don’t know if aas is
implemented to do that). The drawback is that you don’t always have
enough context at the model level to decide what to sanitize which is
why…

…we wrote sanitizeparams to work at the controller level. We want
admins and moderators to be able to post javascript and that contextual
decision is easier to make at the controller level. I like that we only
had to set it once, in application.rb, and then forgot about it. The
downsides are that we don’t give feedback if your input was sanitized
and that this doesn’t cover all inputs (we also aggregate RSS feeds).

On Dec 12, 2007, at 4:35 PM, Greg W. wrote:

… I’d still like to try your setup, but can’t get white_list and
your plugin to load on 1.2.6.

Actually looking closer at your code, I see it’s more likely a
problem with getting white_list to work on 1.2.6, so nevermind…
it’s not your worry :slight_smile:

– gw

On Dec 12, 2007, at 4:16 PM, Tony Stubblebine wrote:

One, you sanitize too late to give feedback to the person who entered
to give feedback to the user (although I don’t know if aas is
downsides are that we don’t give feedback if your input was sanitized
and that this doesn’t cover all inputs (we also aggregate RSS feeds).

Agreed all around, except that I find for 98% of the inputs in my
apps, sanitizing in the model works for me for which I just finished
a series of new validates_as_ class methods: http://www.railsdev.ws/
blog/11/custom-validations-in-rails/ that helps reduce the bulk and
improve readability IMO for many of the common things I do.

However, for those times where I have large open-editing fields, I’d
still like to try your setup, but can’t get white_list and your
plugin to load on 1.2.6.

Installed using:

script/plugin install http://svn.techno-weenie.net/projects/plugins/
white_list/

script/plugin install http://sanitizeparams.googlecode.com/svn/trunk/
sanitize_params/

environment.rb

config.plugins = %W( white_list, sanitize_params )

restart the site, and I get this error:

/usr/local/lib/ruby/gems/1.8/gems/rails-1.2.6/lib/initializer.rb:
195:in `load_plugins’: Cannot find the plugin ‘white_list,’! (LoadError)

If it’s worth your time pursuing a 1.2.x working version, I’m all
ears. Not sure yet when I’ll move to 2.0.


def gw
acts_as_n00b
writes_at(www.railsdev.ws)
end

Tony Stubblebine wrote:

The default Rails method of sanitizing in templates has two problems.
One, you sanitize too late to give feedback to the person who entered
the data. Two, you’re opening yourself to programmer error because
you’re asking programmers to add the sanitization call on every single
display of data. I’m struggling to come up with a benefit to this
approach. Maybe if you were intentionally keeping a database of unsafe
data but wanted to put a safe web front end on it, like a spam database.

You would only want to give users feedback on sanitation effects
when HTML text is expected to be entered. Early sanitation can be
confusing to users if plain text input is instead required.

For example, if pre-sanitation is used and a user enters the following
in a text field:

Java < Rails

they’ll see

Java < Rails

when they next go to edit the field.


We develop, watch us RoR, in numbers too big to ignore.

Mark Reginald J. wrote:

Tony Stubblebine wrote:

The default Rails method of sanitizing in templates has two problems.
One, you sanitize too late to give feedback to the person who entered
the data. Two, you’re opening yourself to programmer error because
you’re asking programmers to add the sanitization call on every single
display of data. I’m struggling to come up with a benefit to this
approach. Maybe if you were intentionally keeping a database of unsafe
data but wanted to put a safe web front end on it, like a spam database.

You would only want to give users feedback on sanitation effects
when HTML text is expected to be entered. Early sanitation can be
confusing to users if plain text input is instead required.

For example, if pre-sanitation is used and a user enters the following
in a text field:

Java < Rails

they’ll see

Java < Rails

when they next go to edit the field.


We develop, watch us RoR, in numbers too big to ignore.

One reason to postpone sanitizing input until the last minute is because
you may discover that your users may need a particular tag to be able to
be displayed. If you scrubbed it out on the way in, it’s gone forever.
If you do it at display time, then you can just modify your scrubber and
not need to modify the data again.

As a general rule, I like to molest the user input data as little as
possible.

_Kevin

Kevin O. wrote:

Mark Reginald J. wrote:

Tony Stubblebine wrote:

The default Rails method of sanitizing in templates has two problems.
One, you sanitize too late to give feedback to the person who entered
the data. Two, you’re opening yourself to programmer error because
you’re asking programmers to add the sanitization call on every single
display of data. I’m struggling to come up with a benefit to this
approach. Maybe if you were intentionally keeping a database of unsafe
data but wanted to put a safe web front end on it, like a spam database.

You would only want to give users feedback on sanitation effects
when HTML text is expected to be entered. Early sanitation can be
confusing to users if plain text input is instead required.

For example, if pre-sanitation is used and a user enters the following
in a text field:

Java < Rails

they’ll see

Java < Rails

when they next go to edit the field.


We develop, watch us RoR, in numbers too big to ignore.

One reason to postpone sanitizing input until the last minute is because
you may discover that your users may need a particular tag to be able to
be displayed. If you scrubbed it out on the way in, it’s gone forever.
If you do it at display time, then you can just modify your scrubber and
not need to modify the data again.

As a general rule, I like to molest the user input data as little as
possible.

_Kevin

@kevin: I assume most cases where users enter html include a chance for
them to preview (or at least see the results) immediately after entry.
In those cases it seems wrong to change the rules after the fact. I’d
consider that a flaw in any approach, but in this case slightly worse
because it’s harder to make the decision to ignore past transgressions.

@mark: Agreed that there’s no need to alert people to that sort of
sanitization. But when you’re removing significant tags it would be
polite to tell the user whats going on. For example, our impetus for
sanitizing was to remove iframe and script tags. The first time someone
pointed out an XSS vulnerability to me they did it by using an iframe to
insert apple’s homepage in the middle of the page. Big whoop. The second
time, they wrote a javascript script that caused the viewer to friend
everyone in the social network and then non-destructively add the script
into their profile. It was a very effective self-propagating javascript
virus. I was impressed. However, many widgets are distributed as
javascript and plenty of good people try to put them in our forms. They
deserve feedback.

Here’s a link to the virus for your viewing pleasure:
http://www.stubbleblog.com/foocamp_xss_hack.js.txt

My problem with sanitization is that it puts representational logic in
the model. Should the model really care that its data might one day
appear on an HTML page? Or should the HTML page take care of its own
needs?

///ark

If there is a business rule that states “HTML shall not be stored in
the model,” then of course you should validate it out in the model.
If, however, the business rule is “Our HTML pages should not be
vulnerable,” then the model is the wrong place to make that happen, in
my opinion. The more presentational logic you put in the model, the
worse, in my opinion. You need to handle this fully and properly in
the place that knows what dangers exist.

As for the model “allowing” HTML, that’s missing the point. The model
shouldn’t even know that HTML exists.

I know we’re not going to agree on this, of course. I’m not exactly a
belt-and-suspenders kind of guy. :slight_smile:

///ark