I have an application that manages a list of feeds. In a scheduled
BackgrounDRb worker, I parse each of these feeds and post the content
to the same site. Some of these feeds contain HTML in the description
of each item in the feed. I would like to first sanitize the HTML to
remove anything particularly harmful, then I would like to strip
certain tags, leaving the content.
I first tested Rick O.'s white_list plugin. It seems that this
simply strips tags and their content. For example, if I say p is a bad
contentgets completely stripped. I would actually like to
keep the ‘content’ and simply remove the HTML. Certain tags are
alright, such as b, em, strong, but most I would like stripped out.
I then tested http://ideoplex.com/id/1138/sanitize-html-in-ruby and it
seems to do the trick. I was just wondering if anyone else had been
interested in stripping HTML but leaving the content and how they went
about doing so. Thanks for your input.