I have an application that manages a list of feeds. In a scheduled BackgrounDRb worker, I parse each of these feeds and post the content to the same site. Some of these feeds contain HTML in the description of each item in the feed. I would like to first sanitize the HTML to remove anything particularly harmful, then I would like to strip certain tags, leaving the content. I first tested Rick Olson's white_list plugin. It seems that this simply strips tags and their content. For example, if I say p is a bad tag, <p>content</p> gets completely stripped. I would actually like to keep the 'content' and simply remove the HTML. Certain tags are alright, such as b, em, strong, but most I would like stripped out. I then tested http://ideoplex.com/id/1138/sanitize-html-in-ruby and it seems to do the trick. I was just wondering if anyone else had been interested in stripping HTML but leaving the content and how they went about doing so. Thanks for your input.
on 2007-04-22 17:33