Hi,
I know that what i’m going to ask is for the solution for a simple
problem. But as I’m new to Ruby I have not learnt a lot about regular
expressions in Ruby.
Can anybody tell me how to extract all the contents which are included
inside the ‘’ and ‘’ tag and also to extract the text given
in between the ‘’ and ‘’ tag using regular expression. I know it
can be extracted using the ‘scan’ method but I dont know what should be
the matching patterns or expressions. Can anybody pls help me
Regards
Arun
Ryan D. wrote:
On Mar 22, 2009, at 23:49 , Arun K. wrote:
can be extracted using the ‘scan’ method but I dont know what should
be
the matching patterns or expressions. Can anybody pls help me
regexps are about the worst thing to use in this case. Look at this
instead:
http://mechanize.rubyforge.org/files/GUIDE_txt.html
I know that using mechanize or hpricot is a far better option in this
case. But i’m just asking as a matter of curiosity to know about regexps
Regards
ArunKumar
On Mar 22, 2009, at 23:49 , Arun K. wrote:
can be extracted using the ‘scan’ method but I dont know what should
be
the matching patterns or expressions. Can anybody pls help me
regexps are about the worst thing to use in this case. Look at this
instead:
http://mechanize.rubyforge.org/files/GUIDE_txt.html
Arun K. wrote:
Hi,
I know that what i’m going to ask is for the solution for a simple
problem. But as I’m new to Ruby I have not learnt a lot about regular
expressions in Ruby.
Can anybody tell me how to extract all the contents which are included
inside the ‘’ and ‘’ tag and also to extract the text given
in between the ‘’ and ‘’ tag using regular expression. I know it
can be extracted using the ‘scan’ method but I dont know what should be
the matching patterns or expressions. Can anybody pls help me
Regards
Arun
s = “hello world”
new_s = s.gsub(/<.*?>/, “”)
puts new_s
–output:–
hello world
html = DATA.read()
regex = Regexp.new("(.*)", Regexp::MULTILINE)
puts html[regex, 1]
END
html page
hello
world
goodbye
–output:–
html page
hello
world
goodbye
In the expression:
html[regex, 1]
The 1 says to return the first parenthesized group in the regex.
7stud – wrote:
regex = Regexp.new("(.*)", Regexp::MULTILINE)
…oh, yeah. Normally, a . matches any character except a newline. The
regex .* matches any character 0 or more times–but to get it to match
newlines as well, you have to specify Regxp::MULTILINE.
On Mon, Mar 23, 2009 at 9:49 AM, Arun K.
[email protected]wrote:
Can anybody tell me how to extract all the contents which are included
inside the ‘’ and ‘’ tag and also to extract the text given
in between the ‘’ and ‘’ tag using regular expression. I know it
can be extracted using the ‘scan’ method but I dont know what should be
the matching patterns or expressions. Can anybody pls help me
Let’s assume we have the following content:
Want a Ruby regular expression editor? Check out Rubular.
Here are two quick and dirty regexps:
/(.*)</html>/m
This regexp will capture anything between an opening html tag and a
closing
one. the /m option specifies "Multiline Mode: “.” will match any
character
including a newline.
For our content, it will capture:
Want a Ruby regular expression editor? Check out Rubular.
/<a.>(.)</a>/
This regexp will capture the text between an opening anchor element and
a
closing one. The first “.*” is there to deal with href and any other
attribute. You might wanna throw the /m option in there too.
For our content, it will capture:
Rubular
On Mon, Mar 23, 2009 at 11:18 AM, Arun K.
[email protected]
wrote:
I know that using mechanize or hpricot is a far better option in this
case. But i’m just asking as a matter of curiosity to know about regexps
Dare I say, a man should use regexps if only to satisfy his curiosity.
Regards,
Yaser
7stud – wrote:
In the expression:
html[regex, 1]
The 1 says to return the first parenthesized group in the regex.
To be a little clearer, the 1 says to return whatever matched the first
parenthesized group in the regex.
Check out the site http://www.rubular.com/
It is very helpful in solving RegEx problems
ciao,
Arjun
twitter.com/arjunghosh
On Mon, Mar 23, 2009 at 12:19 PM, Arun K.