For an application I am working on I have to extract URLs and the text
used to link.
For example,
… Ruby on
Rails…
I have been trying all night but cannot come up with the regular
expression needed to extract the URLs and the text.
I have tried:
myurls=response.scan(/href\s*=\s*"'(.)["']\s.>(.)</a>/)
However I am left with :
://domain.com/filename" rel="tag
and
://domain.com/filename " title="permanent link
Can anyone please help me as to how I can specify to extract
everything till the next single or double quote character? Or how can I
go about extracting URL and the linked text?
I will greatly appreciate it.
Thanks
Frank
irb> response = %{Here is some link Ruby on Rails
and Google ofcourse and Foo!}
irb> puts response.scan(/href=“([^”]+)".*?>([^>]+)</)
=> [[“http://www.rubyonrails.org”, “Ruby on Rails”],
[“http://www.google.com”, “Google ofcourse”], [“ftp://www.foo.bar”,
“Foo!”]]
what you’re looking for is the negation class so
href=“([^”]+)"
^^^^^
match anything that is not a doublequote all the way until you
bump into one.
and similarly
([^>]+)<
^^^^^^
match everything but only between two > and <
cheers,
-Mehryar
On Fri, 17 Feb 2006, softwareengineer 99 wrote:
myurls=response.scan(/href\s*=\s*"'(.)["']\s.>(.)</a>/)
I will greatly appreciate it.
Thanks
Frank
What are the most popular cars? Find out at Yahoo! Autos
… with proper design, the features come cheaply. This
approach is arduous, but continues to succeed.
—Dennis Ritchie
Hello Mehryar,
This works like a charm 
Thank you so much. I really appreciate it.
Frank
mehryar [email protected] wrote:
irb> response = %{Here is some link Ruby on Rails
and Google ofcourse and
href=“ftp://www.foo.bar” title=“bar”>Foo!}
irb> puts response.scan(/href=“([^”]+)".*?>([^>]+)=>
[[“http://www.rubyonrails.org”, “Ruby on Rails”],
[“http://www.google.com”, “Google ofcourse”], [“ftp://www.foo.bar”,
“Foo!”]]
what you’re looking for is the negation class so
href=“([^”]+)"
^^^^^
match anything that is not a doublequote all the way until you
bump into one.
and similarly
([^>]+)<
^^^^^^
match everything but only between two > and <
cheers,
-Mehryar
On Fri, 17 Feb 2006, softwareengineer 99 wrote:
myurls=response.scan(/href\s*=\s*"'(.)["']\s.>(.)</a>/)
I will greatly appreciate it.
Thanks
Frank
What are the most popular cars? Find out at Yahoo! Autos
… with proper design, the features come cheaply. This
approach is arduous, but continues to succeed.
—Dennis Ritchie