Regular expression


#1

Hi,
I know that what i’m going to ask is for the solution for a simple
problem. But as I’m new to Ruby I have not learnt a lot about regular
expressions in Ruby.

Can anybody tell me how to extract all the contents which are included
inside the ‘’ and ‘’ tag and also to extract the text given
in between the ‘’ and ‘’ tag using regular expression. I know it
can be extracted using the ‘scan’ method but I dont know what should be
the matching patterns or expressions. Can anybody pls help me

Regards
Arun


#2

Ryan D. wrote:

On Mar 22, 2009, at 23:49 , Arun K. wrote:

can be extracted using the ‘scan’ method but I dont know what should
be
the matching patterns or expressions. Can anybody pls help me

regexps are about the worst thing to use in this case. Look at this
instead:

http://mechanize.rubyforge.org/files/GUIDE_txt.html

I know that using mechanize or hpricot is a far better option in this
case. But i’m just asking as a matter of curiosity to know about regexps

Regards
ArunKumar


#3

On Mar 22, 2009, at 23:49 , Arun K. wrote:

can be extracted using the ‘scan’ method but I dont know what should
be
the matching patterns or expressions. Can anybody pls help me

regexps are about the worst thing to use in this case. Look at this
instead:

http://mechanize.rubyforge.org/files/GUIDE_txt.html


#4

Arun K. wrote:

Hi,
I know that what i’m going to ask is for the solution for a simple
problem. But as I’m new to Ruby I have not learnt a lot about regular
expressions in Ruby.

Can anybody tell me how to extract all the contents which are included
inside the ‘’ and ‘’ tag and also to extract the text given
in between the ‘’ and ‘’ tag using regular expression. I know it
can be extracted using the ‘scan’ method but I dont know what should be
the matching patterns or expressions. Can anybody pls help me

Regards
Arun

s = “hello world
new_s = s.gsub(/<.*?>/, “”)
puts new_s

–output:–
hello world

html = DATA.read()
regex = Regexp.new("(.*)", Regexp::MULTILINE)
puts html[regex, 1]

END

html page
hello
world
goodbye

–output:–

html page
hello
world
goodbye

In the expression:

html[regex, 1]

The 1 says to return the first parenthesized group in the regex.


#5

7stud – wrote:

regex = Regexp.new("(.*)", Regexp::MULTILINE)

…oh, yeah. Normally, a . matches any character except a newline. The
regex .* matches any character 0 or more times–but to get it to match
newlines as well, you have to specify Regxp::MULTILINE.


#6

On Mon, Mar 23, 2009 at 9:49 AM, Arun K.
removed_email_address@domain.invalidwrote:

Can anybody tell me how to extract all the contents which are included
inside the ‘’ and ‘’ tag and also to extract the text given
in between the ‘’ and ‘’ tag using regular expression. I know it
can be extracted using the ‘scan’ method but I dont know what should be
the matching patterns or expressions. Can anybody pls help me

Let’s assume we have the following content:

Want a Ruby regular expression editor? Check out Rubular.

Here are two quick and dirty regexps:

/(.*)</html>/m
This regexp will capture anything between an opening html tag and a
closing
one. the /m option specifies "Multiline Mode: “.” will match any
character
including a newline.
For our content, it will capture:

Want a Ruby regular expression editor? Check out Rubular.

/<a.>(.)</a>/
This regexp will capture the text between an opening anchor element and
a
closing one. The first “.*” is there to deal with href and any other
attribute. You might wanna throw the /m option in there too.
For our content, it will capture:
Rubular

On Mon, Mar 23, 2009 at 11:18 AM, Arun K.
removed_email_address@domain.invalid
wrote:

I know that using mechanize or hpricot is a far better option in this
case. But i’m just asking as a matter of curiosity to know about regexps

Dare I say, a man should use regexps if only to satisfy his curiosity.
:wink:

Regards,
Yaser


#7

7stud – wrote:

In the expression:

html[regex, 1]

The 1 says to return the first parenthesized group in the regex.

To be a little clearer, the 1 says to return whatever matched the first
parenthesized group in the regex.


#8

Check out the site http://www.rubular.com/
It is very helpful in solving RegEx problems

ciao,
Arjun
http://arjunghosh.wordpress.com
twitter.com/arjunghosh

On Mon, Mar 23, 2009 at 12:19 PM, Arun K.