Using gsub to remove embedded newlines in HTML file

weyus · August 3, 2006, 12:58am

I have an HTML file that is in a string.

I want to use gsub! to recursively remove any embedded newlines and
whitespace within two known delimeters.

Given a string that includes this kind of string:

~^LNK:http://slashdot.org/login.pl?op=newuserform~
Create a new account
^~

I want to replace the above with:

~^LNK:http://slashdot.org/login.pl?op=newuserform~Create a new account^~

(stripping out the newlines and whitespace)

Having trouble writing the regex for this.

I think I want something like:

/~^LNK:.?([\s\r\n])+.?^~/

that I could use in:

str.gsub!(/~^LNK:.?([\s\r\n])+.?^~/, ‘’)

to replace all of the whitespace, or potential newline characters with
null strings.

But I don’t think this will work because I really need to loop within
each substring of my large HTML string. The thing about gsub is that it
will substitute the entire matched string.

Do I need to scan out the ~^LNK.*?^~, operate on those and then put them
back into the larger string?

I’m not sure I’m asking this very well, so I apologize if that’s the
case.

Thanks,
Wes

weyus · August 3, 2006, 1:04am

Something like:

@html.scan(/~\^LNK:.*?\^~/mi).each do |link_line|
  new_link_line = link_line.gsub(/[\s\r\n]/, '')
  @html.gsub!(/#{link_line}/mi, new_link_line)
end

weyus · August 3, 2006, 1:39am

Wes G. wrote:

Something like:

@html.scan(/~\^LNK:.*?\^~/mi).each do |link_line|
  new_link_line = link_line.gsub(/[\s\r\n]/, '')
  @html.gsub!(/#{link_line}/mi, new_link_line)
end

This seems to work well:

@html.scan(/~^LNK:.*?^~/mi).each do |link_line|
new_link_line = link_line.gsub(/[\t\r\n]/, ‘’)
@html.gsub!(/#{Regexp.escape(link_line)}/mi, new_link_line) if
link_line != new_link_line
end

I wonder if I could have done with with one @html.gsub!() command, but
this is much more understandable to me anyway so I’ll stick with this.

Thanks,
Wes

weyus · August 3, 2006, 2:23pm

Wes G. wrote:

This seems to work well:

@html.scan(/~^LNK:.*?^~/mi).each do |link_line|
new_link_line = link_line.gsub(/[\t\r\n]/, ‘’)
@html.gsub!(/#{Regexp.escape(link_line)}/mi, new_link_line) if
link_line != new_link_line
end

You can use a block with gsub:
@html.gsub!(/~^LNK:.*?~/mi) { |s| s.gsub /\s/, ‘’ }

or something like that.

Good luck.

weyus · August 3, 2006, 10:49pm

Thanks. That is the Ruby way to do it, and that’s what I wanted to
know :).

I’ve used blocks with gsub but I keep forgetting that I can put anything
in there - so far I’ve only used backrefs to pull out pieces of the
matching regex to rearrange things.

Wes