Forum: Ruby on Rails Extracting URL and text from HTML?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
F68e4842b0a5487678e26046fdb2fdf3?d=identicon&s=25 softwareengineer 99 (Guest)
on 2006-02-18 09:02
(Received via mailing list)
For an application I am working on I have to extract URLs and the text
used to link.

  For example,

  ..... <a href="http://www.rubyonrails.org" title="rails" >Ruby on
Rails</a>....

  I have been trying all night but cannot come up with the regular
expression needed to extract the URLs and the text.

  I have tried:

   myurls=response.scan(/href\s*=\s*["'](http|https)(.*)["']\s*.*>(.*)<\/a>/)

  However I am left with :

  ://domain.com/filename" rel="tag

  and

  ://domain.com/filename " title="permanent link

  Can anyone please help me as to how I can specify to extract
everything  till the next single or double quote character? Or how can I
go about  extracting URL and the linked text?

  I will greatly appreciate it.

  Thanks
  Frank
Ff3cda4c29529bff4a785ccce57af354?d=identicon&s=25 mehryar (Guest)
on 2006-02-18 09:26
(Received via mailing list)
irb> response = %{Here is some link <a href="http://www.rubyonrails.org"
title="rails" >Ruby on Rails</a>
and <a href="http://www.google.com">Google ofcourse</a> and <a
href="ftp://www.foo.bar" title="bar">Foo!</a>}

irb> puts response.scan(/href="([^"]+)".*?>([^>]+)</)
=> [["http://www.rubyonrails.org", "Ruby on Rails"],
["http://www.google.com", "Google ofcourse"], ["ftp://www.foo.bar",
"Foo!"]]

what you're looking for is the negation class so
href="([^"]+)"
       ^^^^^
       match anything that is not a doublequote all the way until you
bump into one.

and similarly
>([^>]+)<
  ^^^^^^
  match everything but only between two > and <


cheers,
-Mehryar


On Fri, 17 Feb 2006, softwareengineer 99 wrote:

>    myurls=response.scan(/href\s*=\s*["'](http|https)(.*)["']\s*.*>(.*)<\/a>/)
>
>   I will greatly appreciate it.
>
>   Thanks
>   Frank
>
>
> ---------------------------------
>
>  What are the most popular cars? Find out at Yahoo! Autos

-------------------------------------------------------
... with proper design, the features come cheaply. This
approach is arduous, but continues to succeed.
                                     ---Dennis Ritchie
F68e4842b0a5487678e26046fdb2fdf3?d=identicon&s=25 softwareengineer 99 (Guest)
on 2006-02-18 10:17
(Received via mailing list)
Hello Mehryar,
  This works like a charm :)

  Thank you so much. I really appreciate it.

  Frank



mehryar <mehryar@mehryar.com> wrote:
irb> response = %{Here is some link Ruby on Rails
and Google ofcourse and
href="ftp://www.foo.bar" title="bar">Foo!}

irb> puts response.scan(/href="([^"]+)".*?>([^>]+)=>
[["http://www.rubyonrails.org", "Ruby on Rails"],
["http://www.google.com", "Google ofcourse"], ["ftp://www.foo.bar",
"Foo!"]]

what you're looking for is the negation class so
href="([^"]+)"
       ^^^^^
       match anything that is not a doublequote all the way until you
bump into one.

and similarly
>([^>]+)<
  ^^^^^^
  match everything but only between two > and <


cheers,
-Mehryar


On Fri, 17 Feb 2006, softwareengineer 99 wrote:

>    myurls=response.scan(/href\s*=\s*["'](http|https)(.*)["']\s*.*>(.*)<\/a>/)
>
>   I will greatly appreciate it.
>
>   Thanks
>   Frank
>
>
> ---------------------------------
>
>  What are the most popular cars? Find out at Yahoo! Autos

-------------------------------------------------------
... with proper design, the features come cheaply. This
approach is arduous, but continues to succeed.
                                     ---Dennis Ritchie
This topic is locked and can not be replied to.