Regex in Ruby - Strip HTML out of comments - help

hreyaatnh · August 21, 2006, 4:14am

I want to deny any comments that have HTML tags in them, and I know this
can be done using Regex, but I’m not very familiar with how to do this.
Like, if I had method in “application_helper.rb” and passed the comment
body (or any text really), it would search the text and verify that it
only has letters and numbers. I don’t want to replace anything (gsub),
I want to return true if there is HTML in the comment.

def is_html( text )

use the regex here?

return false if not HTML or tags show up

return true if there is HTML or tags in the comment

end

#usage
if is_html(comment.body)
render :text => “Sorry - No HTML allowed.”
else
@comment.save
end

Something like that… Any help on this??? Thanks!

hreyaatnh · August 21, 2006, 4:35am

Regexes to parse HTML or XML are notoriously difficult.

Far better to use an HTML or XML parser to handle this task.

hreyaatnh · August 21, 2006, 4:44am

Tom M. wrote:

Regexes to parse HTML or XML are notoriously difficult.

Far better to use an HTML or XML parser to handle this task.

Oh ok, thanks for the tip. Basically, I just wanted the psuedo code
above with a function that would detect HTML tags in any way, shape or
form. It doesn’t have to use regex at all, I was just speaking out of a
lack of experience.

So how would I go about the solution you suggested? Or any other
solutions for the “is_html” function?

hreyaatnh · August 21, 2006, 4:56am

Typo extends the String class to add the following method:

Strips any html markup from a string

TYPO_TAG_KEY = TYPO_ATTRIBUTE_KEY = /[\w:_-]+/
TYPO_ATTRIBUTE_VALUE = /(?:[A-Za-z0-9]+|(?:‘[^’]?'|“[^”]?"))/
TYPO_ATTRIBUTE =
/(?:#{TYPO_ATTRIBUTE_KEY}(?:\s*=\s*#{TYPO_ATTRIBUTE_VALUE})?)/
TYPO_ATTRIBUTES = /(?:#{TYPO_ATTRIBUTE}(?:\s+#{TYPO_ATTRIBUTE}))/
TAG =
%r{<[!/?[]?(?:#{TYPO_TAG_KEY}|–)(?:\s+#{TYPO_ATTRIBUTES})?\s(?:[!/?]]+|–)?>}
def strip_html
self.gsub(TAG, ‘’).gsub(/\s+/, ’ ').strip
end

I haven’t run into any edge cases of it failing yet, but I am sure if
anyone
finds one a bug report would be welcome

On 8/20/06, ry an [email protected] wrote:

use the regex here?

Something like that… Any help on this??? Thanks!

–
Posted via http://www.ruby-forum.com/.

–
Thanks,
-Steve
http://www.stevelongdo.com

hreyaatnh · August 21, 2006, 5:24am

On Aug 20, 2006, at 7:44 PM, ry an wrote:

So how would I go about the solution you suggested? Or any other
solutions for the “is_html” function?

I’ve not invested the libraries in Ruby enough to make a good suggestion
as to which library to use and how to use it. Sorry.

–
– Tom M.

hreyaatnh · August 21, 2006, 4:58am

On Sun Aug 20, 2006 at 07:21:14PM -0700, Tom M. wrote:

Regexes to parse HTML or XML are notoriously difficult.

i donno about that. theres a million HTML parser out there, and some are
even written in JAVA. at its core theres a very small number of reserved
chars and there’s little syntactic magic on the delimiter level. this is
proably one reason its so successful, compared to the .doc format, or
CSV…

re: the thread topic, ive always used this to strip html:

string =~ s/<(?:[^>’"]|([’"]).?\1)*>//gs;

hreyaatnh · August 21, 2006, 6:35am

ry an wrote:

Tom M. wrote:

Regexes to parse HTML or XML are notoriously difficult.

Far better to use an HTML or XML parser to handle this task.

Oh ok, thanks for the tip. Basically, I just wanted the psuedo code
above with a function that would detect HTML tags in any way, shape or
form. It doesn’t have to use regex at all, I was just speaking out of a
lack of experience.

So how would I go about the solution you suggested? Or any other
solutions for the “is_html” function?

Having written several regexes recently to manipulate some HTML, I can
attest to the “notoriously difficult” comment.

If you’re looking for a parser, Rubyful Soup is pretty good, but tends
to be quite slow for larger pages.

Can you just do something like

def is_html(input_string)
input_string =~ //i
end

That’s assuming that you’re looking at a full HTML page. If you want
to just detect a fragment, you’d have to check for the presence of any
HTML tag.

Wes

hreyaatnh · August 21, 2006, 6:20am

On Aug 20, 2006, at 8:10 PM, Tom M. wrote:

shape or
form. It doesn’t have to use regex at all, I was just speaking out
of a
lack of experience.

So how would I go about the solution you suggested? Or any other
solutions for the “is_html” function?

I’ve not invested the libraries in Ruby enough to make a good
suggestion
as to which library to use and how to use it. Sorry.

Nor have I investigated the libraries…

–
– Tom M.

hreyaatnh · August 21, 2006, 7:58am

Hammed M. wrote:

If you drop the requirement to KNOW if someone entered HTML in the comment,
you could simply strip the html tags using TextHelper::strip_tags.

Hammed

On 8/21/06, ry an [email protected] wrote:

I want to deny any comments that have HTML tags in them, and I know this
can be done using Regex, but I’m not very familiar with how to do this.

Why don’t you just escape the comment before displaying it?

Justin

hreyaatnh · August 21, 2006, 6:50am

If you drop the requirement to KNOW if someone entered HTML in the
comment,
you could simply strip the html tags using TextHelper::strip_tags.

Hammed

hreyaatnh · August 21, 2006, 7:53pm

Here’s a possible solution for you using the Tokenizer. Only tested
briefly, so look for bugs.

Let’s see how the Tokenizer works first using strip_tags as an example,
and we’ll make our own is_html function with that knowledge!

def strip_tags(html)
if html.index("<")
text = “”
tokenizer = HTML::Tokenizer.new(html)

while token = tokenizer.next
  node = HTML::Node.parse(nil, 0, 0, token, false)
  # result is only the content of any Text nodes
  text << node.to_s if node.class == HTML::Text
end
# strip any comments, and if they have a newline at the end (ie.

line with
# only a comment) strip that too
text.gsub(/[\n]?/m, “”)
else
html # already plain text
end
end

def is_html(text)
if text.index(’<’) #might be html
tokenizer = HTML::Tokenizer.new(text)

while token = tokenizer.next
  node = HTML::Node.parse(nil, 0, 0, token, false)
  # if any nodes are not text, then it must be HTML
  return true if node.class != HTML::Text
end

end

false
end

Results:

is_html(‘should not be’)
=> false

is_html(’<’)
=> false

is_html(’<>’)
=> false

is_html(’’)
=> true

is_html(’’)
=> true

is_html(‘Does this work like it should?’)
=> true