Forum: Ruby Regular expression

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
19eb75164135659a8fae98101b1c250e?d=identicon&s=25 Arun Kumar (arun_nss)
on 2009-03-23 07:52
Hi,
I know that what i'm going to ask is for the solution for a simple
problem. But as I'm new to Ruby I have not learnt a lot about regular
expressions in Ruby.

Can anybody tell me how to extract all the contents which are included
inside the '<html>' and '</html>' tag and also to extract the text given
in between the '<a>' and '</a>' tag using regular expression. I know it
can be extracted using the 'scan' method but I dont know what should be
the matching patterns or expressions. Can anybody pls help me

Regards
Arun
5a837592409354297424994e8d62f722?d=identicon&s=25 Ryan Davis (Guest)
on 2009-03-23 08:24
(Received via mailing list)
On Mar 22, 2009, at 23:49 , Arun Kumar wrote:

> can be extracted using the 'scan' method but I dont know what should
> be
> the matching patterns or expressions. Can anybody pls help me

regexps are about the worst thing to use in this case. Look at this
instead:

   http://mechanize.rubyforge.org/files/GUIDE_txt.html
19eb75164135659a8fae98101b1c250e?d=identicon&s=25 Arun Kumar (arun_nss)
on 2009-03-23 09:21
Ryan Davis wrote:
> On Mar 22, 2009, at 23:49 , Arun Kumar wrote:
>
>> can be extracted using the 'scan' method but I dont know what should
>> be
>> the matching patterns or expressions. Can anybody pls help me
>
> regexps are about the worst thing to use in this case. Look at this
> instead:
>
>    http://mechanize.rubyforge.org/files/GUIDE_txt.html

I know that using mechanize or hpricot is a far better option in this
case. But i'm just asking as a matter of curiosity to know about regexps

Regards
ArunKumar
54404bcac0f45bf1c8e8b827cd9bb709?d=identicon&s=25 7stud -- (7stud)
on 2009-03-23 10:40
Arun Kumar wrote:
> Hi,
> I know that what i'm going to ask is for the solution for a simple
> problem. But as I'm new to Ruby I have not learnt a lot about regular
> expressions in Ruby.
>
> Can anybody tell me how to extract all the contents which are included
> inside the '<html>' and '</html>' tag and also to extract the text given
> in between the '<a>' and '</a>' tag using regular expression. I know it
> can be extracted using the 'scan' method but I dont know what should be
> the matching patterns or expressions. Can anybody pls help me
>
> Regards
> Arun

s = "<a>hello world</a>"
new_s = s.gsub(/<.*?>/, "")
puts new_s

--output:--
hello world




html = DATA.read()
regex = Regexp.new("<html>(.*)</html>", Regexp::MULTILINE)
puts html[regex, 1]

__END__
<html>
<head>
  <title>html page</title>
</head>
<body>
  <div>hello</div>
  <div>world</div>
  <div>goodbye</div>
</body>
</html>


--output:--
<head>
        <title>html page</title>
</head>
<body>
        <div>hello</div>
        <div>world</div>
        <div>goodbye</div>
</body>


In the expression:

html[regex, 1]

The 1 says to return the first parenthesized group in the regex.
54404bcac0f45bf1c8e8b827cd9bb709?d=identicon&s=25 7stud -- (7stud)
on 2009-03-23 10:44
7stud -- wrote:

> regex = Regexp.new("<html>(.*)</html>", Regexp::MULTILINE)

...oh, yeah.  Normally, a . matches any character except a newline.  The
regex .* matches any character 0 or more times--but to get it to match
newlines as well, you have to specify Regxp::MULTILINE.
54404bcac0f45bf1c8e8b827cd9bb709?d=identicon&s=25 7stud -- (7stud)
on 2009-03-23 10:47
7stud -- wrote:
> In the expression:
>
> html[regex, 1]
>
> The 1 says to return the first parenthesized group in the regex.

To be a little clearer, the 1 says to return whatever matched the first
parenthesized group in the regex.
B09f4659460545e38ece34ddd0d96b46?d=identicon&s=25 Yaser Sulaiman (Guest)
on 2009-03-23 11:03
(Received via mailing list)
On Mon, Mar 23, 2009 at 9:49 AM, Arun Kumar
<arunkumar@innovaturelabs.com>wrote:

> Can anybody tell me how to extract all the contents which are included
> inside the '<html>' and '</html>' tag and also to extract the text given
> in between the '<a>' and '</a>' tag using regular expression. I know it
> can be extracted using the 'scan' method but I dont know what should be
> the matching patterns or expressions. Can anybody pls help me


Let's assume we have the following content:

<html>
<body>
<p>
Want a Ruby regular expression editor? Check out <a href="
http://www.rubular.com/">Rubular</a>.
</p>
</body>
</html>

Here are two quick and dirty regexps:

/<html>(.*)<\/html>/m
This regexp will capture anything between an opening html tag and a
closing
one. the /m option specifies "Multiline Mode: "." will match any
character
including a newline.
For our content, it will capture:
<body>
<p>
Want a Ruby regular expression editor? Check out <a href="
http://www.rubular.com/">Rubular</a>.
</p>
</body>

/<a.*>(.*)<\/a>/
This regexp will capture the text between an opening anchor element and
a
closing one. The first ".*" is there to deal with href and any other
attribute. You might wanna throw the /m option in there too.
For our content, it will capture:
Rubular

On Mon, Mar 23, 2009 at 11:18 AM, Arun Kumar
<arunkumar@innovaturelabs.com>
wrote:

> I know that using mechanize or hpricot is a far better option in this
> case. But i'm just asking as a matter of curiosity to know about regexps


Dare I say, a man should use regexps if only to satisfy his curiosity.
;-)

Regards,
Yaser
2f515221472e537a7642f0ead9c7772c?d=identicon&s=25 arjun ghosh (Guest)
on 2009-03-23 13:22
(Received via mailing list)
Check out the site http://www.rubular.com/
It is very helpful in solving RegEx problems

ciao,
Arjun
http://arjunghosh.wordpress.com
twitter.com/arjunghosh

On Mon, Mar 23, 2009 at 12:19 PM, Arun Kumar
This topic is locked and can not be replied to.