Sanitizing and stripping some html?

thomasmaas · April 22, 2007, 5:33pm

I have an application that manages a list of feeds. In a scheduled
BackgrounDRb worker, I parse each of these feeds and post the content
to the same site. Some of these feeds contain HTML in the description
of each item in the feed. I would like to first sanitize the HTML to
remove anything particularly harmful, then I would like to strip
certain tags, leaving the content.

I first tested Rick O.'s white_list plugin. It seems that this
simply strips tags and their content. For example, if I say p is a bad
tag,

content

gets completely stripped. I would actually like to
keep the ‘content’ and simply remove the HTML. Certain tags are
alright, such as b, em, strong, but most I would like stripped out.

I then tested Sanitize HTML in Ruby | Take the First Step and it
seems to do the trick. I was just wondering if anyone else had been
interested in stripping HTML but leaving the content and how they went
about doing so. Thanks for your input.