Strip out ALL javascript from HTML source


#1

Hi.

I’ve got a bit of an issue where I have an input source of HTML source
that
anyone can use. I need to strip out all javascript. Attributes, links
tags
etc.

At this stage I’m thinking Hpricot is the go. I guess I’m hoping there
is
someone out there that has done this and is willing to share.

Cheers
Daniel


#2

One of Rick O.'s many plugins can do what you want:

http://agilewebdevelopment.com/plugins/whitelist

Which tags are handled is controllable by your code


#3

On 4/2/07, CRAZ8 removed_email_address@domain.invalid wrote:

One of Rick O.'s many plugins can do what you want:

http://agilewebdevelopment.com/plugins/whitelist

Which tags are handled is controllable by your code

Thanx for the pointer. But I think I need a bit more than that. I need
to
be able to leave tags alone for the most part, except tags, but
attributes need a little more control. what I’ve come up with so far:

  • all on*** attributes have to go
  • any attribute that has “javascript:” in it has to go
  • any attribute with “.js” has to go
  • Also according to the exploit on myspace by
    samhttp://namb.la/popular/tech.htmlIt seems that I need to remove
    javascript: in attributes with newlines
    anywhere in the word.

I hope I’ve got them all. It doesn’t seem that the whitelist plugin
will do
this, although I will be very happy if it does.

Cheers
Daniel


#4

Sorry if this got through and is a double post. It got sent back to me.


#5

all on*** attributes have to go
any attribute that has “javascript:” in it has to go
any attribute with “.js” has to go
Also according to the exploit on myspace by sam It seems that I need to
remove javascript: in attributes with newlines anywhere in the word.

I hope I’ve got them all. It doesn’t seem that the whitelist plugin will
do this, although I will be very happy if it does.

You can of course contribute back to the plugin. However, I believe
it’ll do everything you listed short of removing any attribute with
.js. Not sure what the point of that is though.

Another option is to just yank the code and make it bend to your
specific whims. It’s not a very large one.


Rick O.
http://lighthouseapp.com
http://weblog.techno-weenie.net
http://mephistoblog.com


#6

On 4/2/07, Daniel N removed_email_address@domain.invalid wrote:

Thanx for the pointer. But I think I need a bit more than that. I need to
I hope I’ve got them all. It doesn’t seem that the whitelist plugin will
do this, although I will be very happy if it does.

Cheers
Daniel

I dunno how secure you want this to be, but to be truly safe from XSS
you’ll need to handle more cases then Rick’s plugin does - here is one
stab at it:

http://golem.ph.utexas.edu/~distler/blog/archives/001181.html

If you want to get even more depressed about securing a web app today,
go here to get an idea of the insane amount of XSS vectors.

http://ha.ckers.org/xss.html

  • Rob

http://robsanheim.com


#7

Thanx for the pointer. But I think I need a bit more than that. I need to
be able to leave tags alone for the most part, except tags, but
attributes need a little more control. what I’ve come up with so far:

Pipe your HTML thru tidy -asxhtml. Then use REXML and XPath to strip
out anything you don’t need (such as the header block that -asxhtml
will install). And strip out the tags, and anything that
looks like a tag, such as the tags.

The absolute safest, of course, is to strip anything not appearing on
a whitelist, such as , , , etc.


Phlip
http://c2.com/cgi/wiki?ZeekLand <-- NOT a blog!!


#8

On 4/2/07, Rick O. removed_email_address@domain.invalid wrote:

do this, although I will be very happy if it does.

You can of course contribute back to the plugin. However, I believe
it’ll do everything you listed short of removing any attribute with
.js. Not sure what the point of that is though.

I’ll certainly look at contributing if I can find a way to extend the
functionality. Perhaps a strip_all_javascript method or something like
that.

I think the point is trying to make it as difficult as possible to
upload
javascript. I want to accept arbitrary html source and display it on my
page but at the same time, minimise the risk of having my page hijacked.
The above list is the ways that I have thought of to include javascript
in a
submission. I don’t think it’s possible to completely remove the risk
of
submitted javascript since I could have a url like
http://example.com/stuffset the headers to javascript and return
whatever script it wants, but I
want to minimise that risk.

I hope i’ve considered most of the ways that ppl could hijack my page.
I
want to include as many tags intact as possible.

Cheers
Daniel


#9

I dunno how secure you want this to be, but to be truly safe from XSS
you’ll need to handle more cases then Rick’s plugin does - here is one
stab at it:

http://golem.ph.utexas.edu/~distler/blog/archives/001181.html

It uses most of the same tests I wrote, adds a lot more allowed
svg/mathml tags, and style attribute sanitizing. I just prefer to
leave it out, but textile uses it. Those tests were written from that
hackers article. You could just port the style stuff to white_list,
and then you don’t have to bother maintaining a plugin.


Rick O.
http://lighthouseapp.com
http://weblog.techno-weenie.net
http://mephistoblog.com


#10

On 4/3/07, Rob S. removed_email_address@domain.invalid wrote:

any attribute with “.js” has to go

  • Rob

http://robsanheim.com
http://seekingalpha.com

Thats is a bit depressing. It seems that no matter how hard I try I
won’t
be able to completely remove the js in submitted source.


#11

Is the object tag really that bad? I mean I think I need to support it
since you tube widgets and I guess others are based on object tags and I
need to support youtube at least.

One idea is to allow a custom format. Perhaps just look for youtube
urls, and convert them to videos? Obviously this should be done after
sanitizing…


Rick O.
http://lighthouseapp.com
http://weblog.techno-weenie.net
http://mephistoblog.com


#12

On 4/3/07, Rick O. removed_email_address@domain.invalid wrote:

Is the object tag really that bad? I mean I think I need to support it
since you tube widgets and I guess others are based on object tags and I
need to support youtube at least.

One idea is to allow a custom format. Perhaps just look for youtube
urls, and convert them to videos? Obviously this should be done after
sanitizing…

I’m not really sure what you mean by custom format. Does that mean
like
dom selection in the whitelist plugin? eg. Allow tag x if it’s a child
of
tag Y and has attribute z=‘value’ or z!=‘javascript’

I really want to be as broad ranging as possible and include as many
tags as
possible and also in their original form. It’s important for this app
that
the tags, as much as possible be left as they’re inputted, I just don’t
want
the result to hijack my page.


#13

On 4/3/07, Phlip removed_email_address@domain.invalid wrote:

will install). And strip out the tags, and anything that
looks like a tag, such as the tags.

The absolute safest, of course, is to strip anything not appearing on
a whitelist, such as , , , etc.


Phlip
http://c2.com/cgi/wiki?ZeekLand <-- NOT a blog!!

Is the object tag really that bad? I mean I think I need to support it
since you tube widgets and I guess others are based on object tags and I
need to support youtube at least.


#14

On 4/3/07, Rick O. removed_email_address@domain.invalid wrote:

Well, I originally meant something very custom like
video:http://youtubeurl..... Though since most normal folks can’t
grok this, and web power users have enough formats to figure out,
perhaps you could just seek out youtube urls sitting on a single line
or something.

For instance, Tumbler lets me add the raw embed code or just a youtube
video URL if I want to post a video.

I could not change the input to that level. video:... but I’ve had a
look
at the youtube and also odeo widgets and they both boil down to an embed
tag
with a type of shockwave flash.

Do you think it would be a bad idea to enable support for embed tags
with
that type with src from youtube.com or odeo.com / (a list of known)
domains? If I did this I could remove the object tag from around the
embed
tag and I don’t think it would have much of an impact.


#15

On 4/2/07, Daniel N removed_email_address@domain.invalid wrote:

want the result to hijack my page.
I could not change the input to that level. video:... but I’ve had a
look at the youtube and also odeo widgets and they both boil down to an
embed tag with a type of shockwave flash.

You’re really not getting the point of what I’m trying to say. I’m
saying, strip all object tags, and use something custom that gets
replaced w/ an object tag that you generate afterwards. If you’re
generating insecure JS, you have issues :slight_smile:

Do you think it would be a bad idea to enable support for embed tags with
that type with src from youtube.com or odeo.com / (a list of known)
domains? If I did this I could remove the object tag from around the embed
tag and I don’t think it would have much of an impact.

I don’t really know, I haven’t thought about this stuff much. I just
strip all object/embed tags by default. You may have to do some
digging for any attack vectors on object/embed tags. I don’t think
it’d be that different from image tags though.


Rick O.
http://lighthouseapp.com
http://weblog.techno-weenie.net
http://mephistoblog.com


#16

I really want to be as broad ranging as possible and include as many tags
as possible and also in their original form. It’s important for this app
that the tags, as much as possible be left as they’re inputted, I just don’t
want the result to hijack my page.

Well, I originally meant something very custom like
video:http://youtubeurl..... Though since most normal folks can’t
grok this, and web power users have enough formats to figure out,
perhaps you could just seek out youtube urls sitting on a single line
or something.

For instance, Tumbler lets me add the raw embed code or just a youtube
video URL if I want to post a video.


Rick O.
http://lighthouseapp.com
http://weblog.techno-weenie.net
http://mephistoblog.com


#17

On 4/3/07, Rick O. removed_email_address@domain.invalid wrote:

as possible and also in their original form. It’s important for
perhaps you could just seek out youtube urls sitting on a single line
saying, strip all object tags, and use something custom that gets
replaced w/ an object tag that you generate afterwards. If you’re
generating insecure JS, you have issues :slight_smile:

Ok that makes more sense to me.

Do you think it would be a bad idea to enable support for embed tags with

that type with src from youtube.com or odeo.com / (a list of known)
domains? If I did this I could remove the object tag from around the
embed
tag and I don’t think it would have much of an impact.

I don’t really know, I haven’t thought about this stuff much. I just
strip all object/embed tags by default. You may have to do some
digging for any attack vectors on object/embed tags. I don’t think
it’d be that different from image tags though.

K thanx for your help. Looks like I’ve got some digging to do :slight_smile:

Cheers
Daniel


#18

Daniel N wrote:

Is the object tag really that bad? I mean I think I need to support it
since you tube widgets and I guess others are based on object tags and I
need to support youtube at least.

I didn’t read the original post. If the question is “how do I do safe
markup and transclusions, in a public blog?”, then naturally get
either a Wiki markup (or YAML), or permit a subset of HTML. To
transclude Object tags, invent a new tag called . That way you
prevent shenanigans, right?


Phlip


#19

Daniel N wrote:

Thats is a bit depressing. It seems that no matter how hard I try I won’t
be able to completely remove the js in submitted source.

(Use the XPath system I suggested, then) remove all tags except those
on a short white-list, and then remove all their attributes.


Phlip


#20

We http://www.jobscore.com use SafeHtml <http://pixel-apes.com/
safehtml/> it’s really good about leaving the tags alone but removing
potentially dangerous XSS type stuff.

It’s PHP, but I wrapped it in a class that shells out to the php
interpreter.

Alex