Strip tags?

maxbohr · July 23, 2006, 8:45am

Is there an easy way to strip html tags from strings?
Thanks

maxbohr · July 23, 2006, 9:34am

On 7/23/06, Max B. [email protected] wrote:

Is there an easy way to strip html tags from strings?
Thanks

–
Posted via http://www.ruby-forum.com/.

How about using Lynx?
http://lynx.isc.org/lynx2.8.5/index.html

hth,
-Harold

maxbohr · July 23, 2006, 9:47am

Max B. [email protected] wrote:

Is there an easy way to strip html tags from strings?

A regex isn’t always the best way to deal with markup
languages, but for an easy way it’s good enough.

$ irb
irb(main):001:0> a = ‘This is strong stuff!
’
=> “This is strong stuff!
<img src="foo.png"
alt="Some foo">”
irb(main):002:0> a.gsub(/<.*?>/, ‘’)
=> “This is strong stuff!”

maxbohr · July 23, 2006, 9:53am

On 7/23/06, Stefan S. [email protected] wrote:

Max B. [email protected] wrote:

Is there an easy way to strip html tags from strings?

A regex isn’t always the best way to deal with markup
languages, but for an easy way it’s good enough.

the problem is, it’s not always the correct way.

But, having a > in an attribute is rare; if one-in-a-blue-moon errors
are
ok, regexes are a nice easy solution.

;D

maxbohr · July 23, 2006, 12:04pm

On 7/23/06, Andreas S. [email protected] wrote:

Daniel B. wrote:

On 7/23/06, Stefan S. [email protected] wrote:

A regex isn’t always the best way to deal with markup
languages, but for an easy way it’s good enough.

the problem is, it’s not always the correct way.

This is no correct HTML, < and > have to be encoded as entities.

That is true… if the original poster has the luxury of only dealing
with
correct html, he’s a lucky fellow, and can kludge up some regexen that
will
do the job. Even in a well-coded site, it’s not unthinkable that you
could
forget to do some encoding and end up with angle-brackets inside a
textarea
or something, though.

How useful a regex approach is depends on the data. I have used a bit
of
regex-type html parsing before and it worked fine, for the data that I
was
parsing. Horses for courses.

;D

maxbohr · July 23, 2006, 6:17pm

Daniel B. wrote:

On 7/23/06, Andreas S. [email protected] wrote:

Daniel B. wrote:

On 7/23/06, Stefan S. [email protected] wrote:

A regex isn’t always the best way to deal with markup
languages, but for an easy way it’s good enough.

the problem is, it’s not always the correct way.

This is no correct HTML, < and > have to be encoded as entities.

That is true… if the original poster has the luxury of only dealing
with
correct html, he’s a lucky fellow, and can kludge up some regexen that
will
do the job. Even in a well-coded site, it’s not unthinkable that you
could
forget to do some encoding and end up with angle-brackets inside a
textarea
or something, though.

How useful a regex approach is depends on the data. I have used a bit
of
regex-type html parsing before and it worked fine, for the data that I
was
parsing. Horses for courses.

;D
Thanks for the quick replys.
I should have been more explicit in my question. I want to strip html
tags in order to sanitize form input. I’m a bit of a ruby noob and I
was hoping to find a function similar to PHP’s strip_tags, one that
would remove both html and ruby code.
Best

maxbohr · July 23, 2006, 6:27pm

“Andreas S.” [email protected] writes:

the problem is, it’s not always the correct way.

This is no correct HTML, < and > have to be encoded as entities.

It’s valid XHTML:

$ echo ‘’ | xmllint -

<?xml version="1.0"?>

However, ‘<’ needs to be escaped:

$ echo ‘’ | xmllint -
-:1: parser error : Unescaped ‘<’ not allowed in attributes values

maxbohr · July 23, 2006, 9:32pm

Mat S. wrote:

On Jul 23, 2006, at 12:17 PM, Max B. wrote:

Thanks for the quick replys.
I should have been more explicit in my question. I want to strip html
tags in order to sanitize form input. I’m a bit of a ruby noob and I
was hoping to find a function similar to PHP’s strip_tags, one that
would remove both html and ruby code.
Best

For sanitizing input, just escaping might be a better idea because it
has less chance of being destructive. If you’re on rails there’s an h
() function for this. If you’re doing something else, maybe check
out how rails does it and replicate it. There might be something
easy that someone on this list knows that don’t.

If you really want to strip them, I’d bet the regexp solution is no
less effective than PHP’s strip_tags.
-Mat

Thanks for the help everybody.
Best,
Max

maxbohr · July 23, 2006, 8:35pm

On Jul 23, 2006, at 12:17 PM, Max B. wrote:

Thanks for the quick replys.
I should have been more explicit in my question. I want to strip html
tags in order to sanitize form input. I’m a bit of a ruby noob and I
was hoping to find a function similar to PHP’s strip_tags, one that
would remove both html and ruby code.
Best

For sanitizing input, just escaping might be a better idea because it
has less chance of being destructive. If you’re on rails there’s an h
() function for this. If you’re doing something else, maybe check
out how rails does it and replicate it. There might be something
easy that someone on this list knows that don’t.

If you really want to strip them, I’d bet the regexp solution is no
less effective than PHP’s strip_tags.
-Mat

maxbohr · July 23, 2006, 10:28am

Daniel B. wrote:

On 7/23/06, Stefan S. [email protected] wrote:

Max B. [email protected] wrote:

Is there an easy way to strip html tags from strings?

A regex isn’t always the best way to deal with markup
languages, but for an easy way it’s good enough.

the problem is, it’s not always the correct way.

This is no correct HTML, < and > have to be encoded as entities.

maxbohr · July 25, 2006, 12:49pm

“William J.” [email protected] writes:

          # Accept any escaped character.
          \\.
          |
          [^"\\] +
      ) *
    "
  ) *
>

}xm

print DATA.read.gsub( re, ‘’ )

maxbohr · July 25, 2006, 8:31pm

Christian N. wrote:

    "
print DATA.read.gsub( re, ‘’ )

re = %r{
<
(?:
[^>"’] +
|
"
(?: \. | [^\"] + ) *
"
|
’
(?: \. | [^\’] + ) *
’
) *
>

}xm

print DATA.read.gsub( re, ‘’ )

END
Some<><"">
text
to <?xml version="1.0"?>
save
for
later
<bar quux="“foo>” />reading.

maxbohr · July 24, 2006, 6:20am

Christian N. wrote:

<?xml version="1.0"?>
However, ‘<’ needs to be escaped:

$ echo ‘’ | xmllint -
-:1: parser error : Unescaped ‘<’ not allowed in attributes values

–
Christian N. [email protected] http://chneukirchen.org

re = %r{
<
(?:
# Any characters but > or " .
[^>“] +
|
# Characters within quotes.
# Allow escaped quotes.
"
(?:
# Accept any escaped character.
\.
|
[^”\] +
) *
"
) *
>
}xm

print DATA.read.gsub( re, ‘’ )

END
Some<><“”>
<bar quux=“"foo>bar” /> text
to <?xml version="1.0"?>
save
for
<bar quux=“"foo>” />reading.