Forum: Ruby Remove HTML from String?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
6baa3299d90dc0d50940e8d14e828964?d=identicon&s=25 jotto (Guest)
on 2006-01-09 09:40
(Received via mailing list)
I can't find a method to remove HTML from a string in the core API. PHP
has something called strip_tags. Does Ruby have anything like this?
http://us3.php.net/manual/en/function.strip-tags.php
81cf8dab4b4af8aa3148c28421afd845?d=identicon&s=25 Horacio Sanson (Guest)
on 2006-01-09 13:26
(Received via mailing list)
A regular expression can strip the HTML tags from any string...

I use this

# Get the html data in a string by any method
html_string = get_html_method

# strip all the html tags from the html data
html_string.gsub!(/(<[^>]*>)|\n|\t/s) {" "}


this may not be the best way, (robust or fast) but is enough for my
needs.

Horacio

Monday 09 January 2006 17:38ã?jotto ã?ã??はæ?¸ãã¾ã?ã?:
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (Guest)
on 2006-01-09 13:59
(Received via mailing list)
On 09/01/06, jotto <jonathan.otto@gmail.com> wrote:
> I can't find a method to remove HTML from a string in the core API. PHP
> has something called strip_tags. Does Ruby have anything like this?
> http://us3.php.net/manual/en/function.strip-tags.php

Not built in. It's not really appropriate for the core language.
That's one of the things that makes PHP easy to use for people who are
trying to do simple things, but makes it hard when you get into
engineering and maintaining real programs. As was suggested by the
other respondent, it's relatively easy to remove:

  a.gsub(%r{</?[^>]+?>}, '')

-austin
Ec9233451f7c6ba37a83388b87a1f565?d=identicon&s=25 Gavin Kistner (Guest)
on 2006-01-10 00:29
(Received via mailing list)
On Jan 9, 2006, at 5:57 AM, Austin Ziegler wrote:
> other respondent, it's relatively easy to remove:
>
>   a.gsub(%r{</?[^>]+?>}, '')

...just pray that the HTML you are modifying is valid, and not some
garbage file that web browsers happen to treat as intended. For
example, watch the above regexp go to town on some invalid HTML:


class String
	def strip_tags
		self.gsub( %r{</?[^>]+?>}, '' )
	end
end

source = <<ENDHTML
<html><body>
<p>I'm pretending to know how to code. I <3 HTML, it's teh best!!!!</p>
<script>
for ( i=0; i<10; i++ ){ document.write(i+'<br>') }
</script>
BLASTOFFS!!!!
</body>
ENDHTML

puts source.strip_tags
#=> I'm pretending to know how to code. I
#=>
#=> for ( i=0; i') }
#=>
#=> BLASTOFFS!!!!
44a93c3ac1ed955170440840ab8a1c1e?d=identicon&s=25 Eric Schwartz (Guest)
on 2006-01-10 01:05
(Received via mailing list)
Gavin Kistner <gavin@refinery.com> writes:
> > engineering and maintaining real programs. As was suggested by the
> > other respondent, it's relatively easy to remove:
> >
> >   a.gsub(%r{</?[^>]+?>}, '')
>
> ..just pray that the HTML you are modifying is valid, and not some
> garbage file that web browsers happen to treat as intended.

More like, "Just pray the HTML you are modifying doesn't happen to be
completely valid, but not formed in exactly the way you are
expecting."  For instance, the following HTML snippet is completely
valid, but screws up the regex:

<p>a <img src="greaterthan.gif" alt=">" /> b</p>

irb(main):010:0> a='<p>a <img src="greaterthan.gif" alt=">" /> b</p>'
=> "<p>a <img src=\"greaterthan.gif\" alt=\">\" /> b</p>"
irb(main):011:0> a.gsub(%r{</?[^>]+?>}, '')
=> "a \" /> b"

Finding other such examples is an exercise for the reader.  This sort
of thing is why, as a rule, I avoid parsing HTML with regexes.

-=Eric
Ace7fa5337acbdf5897a6fc035897580?d=identicon&s=25 J. Ryan Sobol (Guest)
on 2006-01-10 02:44
(Received via mailing list)
If you're concerned about prevent browsers from rendering the HTML in
your string, replacing < and > with &lt; and &gt; symbols is more
affective than trying to remove the tags.

~ ryan ~
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (Guest)
on 2006-01-10 16:19
(Received via mailing list)
On 09/01/06, Eric Schwartz <emschwar@mail.ericschwartz.us> wrote:
> More like, "Just pray the HTML you are modifying doesn't happen to be
> completely valid, but not formed in exactly the way you are
> expecting."  For instance, the following HTML snippet is completely
> valid, but screws up the regex:
>
> <p>a <img src="greaterthan.gif" alt=">" /> b</p>

Actually, that is *not* completely valid, at least not valid XHTML
(which is what I use these days). You have to do that as:

  <p>a <img src="greaterthan.gif" alt="&gt;" /> b</p>

But my regexp wasn't intended to be complete; there are full libraries
out there for that.

-austin
44a93c3ac1ed955170440840ab8a1c1e?d=identicon&s=25 Eric Schwartz (Guest)
on 2006-01-11 00:35
(Received via mailing list)
Austin Ziegler <halostatue@gmail.com> writes:
> On 09/01/06, Eric Schwartz <emschwar@mail.ericschwartz.us> wrote:
> > More like, "Just pray the HTML you are modifying doesn't happen to be
> > completely valid, but not formed in exactly the way you are
> > expecting."  For instance, the following HTML snippet is completely
> > valid, but screws up the regex:
> >
> > <p>a <img src="greaterthan.gif" alt=">" /> b</p>
>
> Actually, that is *not* completely valid, at least not valid XHTML
> (which is what I use these days).

When wrapped with the appropriate tags, it validated HTML 4.01, which
is what I recommend most people generate these days (because of some,
but not all, of the reasons elucidated at
http://codinginparadise.org/weblog/2005/08/xhtml-c...).
So yes, it is valid HTML, which is all I claimed it to be.

I specifically didn't mention XHTML, since the bits of the thread I
saw referenced HTML, and they're enough different I figured XHTML
would have been mentioned if that's what was wanted.  Of course with
XHTML, you have CDATA sections, which can contain all sorts of
nastiness that can trip you up just as badly.

> You have to do that as:
>   <p>a <img src="greaterthan.gif" alt="&gt;" /> b</p>
>
> But my regexp wasn't intended to be complete; there are full libraries
> out there for that.

Right; my point was that in my experience, regexes seem to work just
fine, until suddenly they don't, and then you have to spend silly
amounts of time compensating for them-- or you could just use a proper
library in the first place, and not have to worry about it.

-=Eric
Dedb38b3571b323b77bc9b221e940172?d=identicon&s=25 ruby talk (Guest)
on 2006-01-11 05:15
(Received via mailing list)
i like this code i found, i did not make but found. and i wish i could
give
credit to who created it but i lost the website

 require 'cgi'

def html2text html
  text = html.
    gsub(/(&nbsp;|\n|\s)+/im, ' ').squeeze(' ').strip.
    gsub(/<([^\s]+)[^>]*(src|href)=\s*(.?)([^>\s]*)\3[^>]*>\4<\/\1>/i,
'\4')

  links = []
  linkregex = /<[^>]*(src|href)=\s*(.?)([^>\s]*)\2[^>]*>\s*/i
  while linkregex.match(text)
    links << $~[3]
    text.sub!(linkregex, "[#{links.size}]")
  end

  text = CGI.unescapeHTML(
    text.
      gsub(/<(script|style)[^>]*>.*<\/\1>/im, '').
      gsub(/<!--.*-->/m, '').
      gsub(/<hr(| [^>]*)>/i, "___\n").
      gsub(/<li(| [^>]*)>/i, "\n* ").
      gsub(/<blockquote(| [^>]*)>/i, '> ').
      gsub(/<(br)(| [^>]*)>/i, "\n").
      gsub(/<(\/h[\d]+|p)(| [^>]*)>/i, "\n\n").
      gsub(/<[^>]*>/, '')
  ).lstrip.gsub(/\n[ ]+/, "\n") + "\n"

  for i in (0...links.size).to_a
    text = text + "\n  [#{i+1}] <#{CGI.unescapeHTML(links[i])}>" unless
links[i].nil?
  end
  links = nil
  text
end


input =" <h1>Title</h1> This is the body. Testing <a href='
http://www.google.com/'>link to Google</a>.<p /> Testing image <img
src='/noimage.png'>.<br /> The End."

print html2text(input)
E553d76724115b8126c00a515bd957c1?d=identicon&s=25 Valery Visnakov (balepc)
on 2009-08-04 16:34
jotto wrote:
> I can't find a method to remove HTML from a string in the core API. PHP
> has something called strip_tags. Does Ruby have anything like this?
> http://us3.php.net/manual/en/function.strip-tags.php

Here is a gem for sanitizing strings http://wonko.com/post/sanitize
1d4ca4f4362bf8b0b528e35361743e82?d=identicon&s=25 Daniel P. C. (danielp_c)
on 2012-06-12 04:39
I hate regex.  I've written some ruby functions to remove html tags in
blocks and not just special characters... also rules for swapping html
code for anything else is included.  Example <br> will be swapped out
for \n with existing rules.  My code is available at
https://github.com/6ftDan/regex-is-evil
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (robert_k78)
on 2012-06-13 11:35
(Received via mailing list)
On Tue, Jun 12, 2012 at 4:39 AM, Daniel P. C. <lists@ruby-forum.com>
wrote:
> I hate regex. I've written some ruby functions to remove html tags in
> blocks and not just special characters... also rules for swapping html
> code for anything else is included. Example <br> will be swapped out
> for \n with existing rules. My code is available at
> https://github.com/6ftDan/regex-is-evil

What a mess.  This is extremely inefficient.  You create new strings
all the time.  You go over the string multiple times.  You do not pass
start and end index down to strip_seq().  There is no test which
ensures start index is lower than end index (try with string ">foo<").

I'd prefer a regexp solution anytime.  It's likely faster and easier
to read - for me at least.  Btw. /x goes a long way at making a regexp
more readable - you can even include comments.  Just a simple example:
https://gist.github.com/2923072

But proper tool is of course a HTML parser like Nokogiri.

Kind regards

robert
This topic is locked and can not be replied to.