Regex in Ruby - Strip HTML out of comments - help

I want to deny any comments that have HTML tags in them, and I know this
can be done using Regex, but I’m not very familiar with how to do this.
Like, if I had method in “application_helper.rb” and passed the comment
body (or any text really), it would search the text and verify that it
only has letters and numbers. I don’t want to replace anything (gsub),
I want to return true if there is HTML in the comment.

def is_html( text )

use the regex here?

return false if not HTML or tags show up

return true if there is HTML or tags in the comment

end

#usage
if is_html(comment.body)
render :text => “Sorry - No HTML allowed.”
else
@comment.save
end

Something like that… Any help on this??? Thanks!

Regexes to parse HTML or XML are notoriously difficult.

Far better to use an HTML or XML parser to handle this task.

Tom M. wrote:

Regexes to parse HTML or XML are notoriously difficult.

Far better to use an HTML or XML parser to handle this task.

Oh ok, thanks for the tip. Basically, I just wanted the psuedo code
above with a function that would detect HTML tags in any way, shape or
form. It doesn’t have to use regex at all, I was just speaking out of a
lack of experience.

So how would I go about the solution you suggested? Or any other
solutions for the “is_html” function?

Typo extends the String class to add the following method:

Strips any html markup from a string

TYPO_TAG_KEY = TYPO_ATTRIBUTE_KEY = /[\w:_-]+/
TYPO_ATTRIBUTE_VALUE = /(?:[A-Za-z0-9]+|(?:’[^’]?’|"[^"]?"))/
TYPO_ATTRIBUTE =
/(?:#{TYPO_ATTRIBUTE_KEY}(?:\s*=\s*#{TYPO_ATTRIBUTE_VALUE})?)/
TYPO_ATTRIBUTES = /(?:#{TYPO_ATTRIBUTE}(?:\s+#{TYPO_ATTRIBUTE}))/
TAG =
%r{<[!/?[]?(?:#{TYPO_TAG_KEY}|–)(?:\s+#{TYPO_ATTRIBUTES})?\s
(?:[!/?]]+|–)?>}
def strip_html
self.gsub(TAG, ‘’).gsub(/\s+/, ’ ').strip
end

I haven’t run into any edge cases of it failing yet, but I am sure if
anyone
finds one a bug report would be welcome :slight_smile:

On 8/20/06, ry an [email protected] wrote:

use the regex here?

Something like that… Any help on this??? Thanks!


Posted via http://www.ruby-forum.com/.


Thanks,
-Steve
http://www.stevelongdo.com

On Aug 20, 2006, at 7:44 PM, ry an wrote:

So how would I go about the solution you suggested? Or any other
solutions for the “is_html” function?

I’ve not invested the libraries in Ruby enough to make a good suggestion
as to which library to use and how to use it. Sorry.


– Tom M.

On Sun Aug 20, 2006 at 07:21:14PM -0700, Tom M. wrote:

Regexes to parse HTML or XML are notoriously difficult.

i donno about that. theres a million HTML parser out there, and some are
even written in JAVA. at its core theres a very small number of reserved
chars and there’s little syntactic magic on the delimiter level. this is
proably one reason its so successful, compared to the .doc format, or
CSV…

re: the thread topic, ive always used this to strip html:

string =~ s/<(?:[^>’"]|([’"]).?\1)*>//gs;

ry an wrote:

Tom M. wrote:

Regexes to parse HTML or XML are notoriously difficult.

Far better to use an HTML or XML parser to handle this task.

Oh ok, thanks for the tip. Basically, I just wanted the psuedo code
above with a function that would detect HTML tags in any way, shape or
form. It doesn’t have to use regex at all, I was just speaking out of a
lack of experience.

So how would I go about the solution you suggested? Or any other
solutions for the “is_html” function?

Having written several regexes recently to manipulate some HTML, I can
attest to the “notoriously difficult” comment.

If you’re looking for a parser, Rubyful Soup is pretty good, but tends
to be quite slow for larger pages.

Can you just do something like

def is_html(input_string)
input_string =~ //i
end

That’s assuming that you’re looking at a full HTML page. If you want
to just detect a fragment, you’d have to check for the presence of any
HTML tag.

Wes

On Aug 20, 2006, at 8:10 PM, Tom M. wrote:

shape or
form. It doesn’t have to use regex at all, I was just speaking out
of a
lack of experience.

So how would I go about the solution you suggested? Or any other
solutions for the “is_html” function?

I’ve not invested the libraries in Ruby enough to make a good
suggestion
as to which library to use and how to use it. Sorry.

Nor have I investigated the libraries… :slight_smile:


– Tom M.

Hammed M. wrote:

If you drop the requirement to KNOW if someone entered HTML in the comment,
you could simply strip the html tags using TextHelper::strip_tags.

Hammed

On 8/21/06, ry an [email protected] wrote:

I want to deny any comments that have HTML tags in them, and I know this
can be done using Regex, but I’m not very familiar with how to do this.

Why don’t you just escape the comment before displaying it?

Justin

If you drop the requirement to KNOW if someone entered HTML in the
comment,
you could simply strip the html tags using TextHelper::strip_tags.

Hammed

Here’s a possible solution for you using the Tokenizer. Only tested
briefly, so look for bugs.

Let’s see how the Tokenizer works first using strip_tags as an example,
and we’ll make our own is_html function with that knowledge!

def strip_tags(html)
if html.index("<")
text = “”
tokenizer = HTML::Tokenizer.new(html)

while token = tokenizer.next
  node = HTML::Node.parse(nil, 0, 0, token, false)
  # result is only the content of any Text nodes
  text << node.to_s if node.class == HTML::Text
end
# strip any comments, and if they have a newline at the end (ie. 

line with
# only a comment) strip that too
text.gsub(/[\n]?/m, “”)
else
html # already plain text
end
end

def is_html(text)
if text.index(’<’) #might be html
tokenizer = HTML::Tokenizer.new(text)

while token = tokenizer.next
  node = HTML::Node.parse(nil, 0, 0, token, false)
  # if any nodes are not text, then it must be HTML
  return true if node.class != HTML::Text
end

end

false
end

Results:

is_html(‘should not be’)
=> false

is_html(’<’)
=> false

is_html(’<>’)
=> false

is_html(’’)
=> true

is_html(’’)
=> true

is_html(‘Does this work like it should?’)
=> true