Need help with a regexp

rpheath · December 8, 2006, 4:15am

I’m trying to write a regular expression to replace a

…

block or a

…

block with a blank (’’).
I can only get the

…

to work correctly. Here’s what I
have:

text.gsub(/^

[^<]</pre>$|^(.?)</p></blockquote>$/,’’)
Can someone help me figure out why the blockquote is still showing

up???  Thanks in advance.

rpheath · December 8, 2006, 4:47am

Why are you doing a gsub but then anchoring the Regexp to the start &
ends? Use a normal sub or take out all the ^s and $s (except for the
character class definitions, i.e., the ones in square brackets).

Please post some sample text, not of what you would like to remove but
of what you would like to remove it from.

Dan

rpheath · December 8, 2006, 4:55am

Thanks for the reply. I’m relatively new to regular expressions, and
misinterpretted the ^s and $s. I was thinking they were for that
specific check, so it was either the first string “|” (or) the second
string.

Here’s sample text that would be passed into it.

This is the first sentence. Now I'll post a code snippet:

def strip_blocks(text)
  text.gsub([regex],'')
end

This is another sentence before the block quote.

This is a quote

This is one more sentence

What I would like to have left is this:

This is the first sentence. Now I'll post a code snippet:

This is another sentence before the block quote.

This is one more sentence

Hopefully that helps. Sorry the question is not organized and kind of
basic, but I’m new to this. Thanks again for any help.

rpheath · December 8, 2006, 6:25am

You are missing the ‘m’ flag which will allow ‘.’ to match new lines

pre_match = /

.?<\pre>/m

block_match = /.?:.?</p>.?</blockquote>/m

rpheath · December 8, 2006, 3:40pm

On Dec 7, 2006, at 10:55 PM, rpheath wrote:

Thanks for the reply. I’m relatively new to regular expressions, and
misinterpretted the ^s and $s. I was thinking they were for that
specific check, so it was either the first string “|” (or) the second
string.

http://groups.google.com/group/rubyonrails-talk/browse_frm/thread/
6c75d5d4df368186/2743494eb303014c#2743494eb303014c

And might I suggest picking ONE mailing list on which to ask your
questions (Ruby is actually the better one for this question about
regular expressions), and then JUST ASK ONCE.

-Rob

Rob B. http://agileconsultingllc.com
[email protected]

rpheath · December 8, 2006, 6:17am

rpheath wrote:

Thanks for the reply. I’m relatively new to regular expressions, and
misinterpretted the ^s and $s. I was thinking they were for that
specific check, so it was either the first string “|” (or) the second
string.

Here’s sample text that would be passed into it.

This is the first sentence. Now I'll post a code snippet:
def strip_blocks(text)
  text.gsub([regex],'')
end
This is another sentence before the block quote.

This is a quote

This is one more sentence
----------------------
What I would like to have left is this:

This is the first sentence. Now I'll post a code snippet:

This is another sentence before the block quote.

This is one more sentence
----------------------
Hopefully that helps. Sorry the question is not organized and kind of
basic, but I’m new to this. Thanks again for any help.

Try this. It uses the “non-greedy” operator ‘?’ and multiline
case-insensitive matching. Not using the ‘non-greedy’ operator would
gobble up everything between two tags, including nested tags of the
same name. This is probably not what you would want.

def remove_tag_block(tag, text)
text.gsub(/<#{tag}>.*?</#{tag}>/im, ‘’)
end

irb(main):054:0> text
=> “

This is the first sentence. Now I’ll post a code
snippet:

\n\n

\ndef strip_blocks(text)\n

text.gsub([regex],‘’)\nend\n

\n\n

This is another sentence before
the block quote.

\n\n

\n
This is a
quote
\n

\n\n

This is one more sentence

”

irb(main):055:0> t=remove_tag_block(“pre”, text)

=> “

This is the first sentence. Now I’ll post a code
snippet:

\n\n\n\n

This is another sentence before the block
quote.

\n\n

\n
This is a
quote
\n

\n\n

This is one more sentence

”

irb(main):056:0> remove_tag_block(“blockquote”, t)

=> “

This is the first sentence. Now I’ll post a code
snippet:

\n\n\n\n

This is another sentence before the block
quote.

\n\n\n\n

This is one more sentence

”

The problem is that this won’t work with nested tags, e.g.

stuff

irb(main):065:0>
x=“

stuff

”
=> “

stuff

”
irb(main):066:0> remove_tag_block(“table”, x)
=> “”

This is because regular regular expressions can’t match nested
pairs, such as “((()(())()))”. I think I read somewhere a phrase that
regexp’s can’t count. You have to use recursive regular expressions,
which are found in PCRE (Perl RE), but AFAIK not in the current Ruby
regexp engine. Maybe Oniguruma has it - I dunno. I saw a PCRE extension
for Ruby somewhere, but I don’t know anything about it.

The Perl RE for matching nested parentheses is apparently as follows
(from
The Joy of Regular Expressions [1] — SitePoint)

(((?>[^()]+)|(?R))*)

I believe that to do this correctly without PCRE, you have to resort to
some text parsing or use a SAX parser or similar. Maybe some Ruby guru
(i.e. not me) will be able to pull out an RE or some easy way to do
this.