Extracting URL and text from HTML?


#1

For an application I am working on I have to extract URLs and the text
used to link.

For example,

Ruby on
Rails

I have been trying all night but cannot come up with the regular
expression needed to extract the URLs and the text.

I have tried:

myurls=response.scan(/href\s*=\s*"’(.)["’]\s.>(.)</a>/)

However I am left with :

://domain.com/filename" rel="tag

and

://domain.com/filename " title="permanent link

Can anyone please help me as to how I can specify to extract
everything till the next single or double quote character? Or how can I
go about extracting URL and the linked text?

I will greatly appreciate it.

Thanks
Frank


#2

irb> response = %{Here is some link Ruby on Rails
and Google ofcourse and Foo!}

irb> puts response.scan(/href="([^"]+)".*?>([^>]+)</)
=> [[“http://www.rubyonrails.org”, “Ruby on Rails”],
[“http://www.google.com”, “Google ofcourse”], [“ftp://www.foo.bar”,
“Foo!”]]

what you’re looking for is the negation class so
href="([^"]+)"
^^^^^
match anything that is not a doublequote all the way until you
bump into one.

and similarly

([^>]+)<
^^^^^^
match everything but only between two > and <

cheers,
-Mehryar

On Fri, 17 Feb 2006, softwareengineer 99 wrote:

myurls=response.scan(/href\s*=\s*"’(.)["’]\s.>(.)</a>/)

I will greatly appreciate it.

Thanks
Frank


What are the most popular cars? Find out at Yahoo! Autos


… with proper design, the features come cheaply. This
approach is arduous, but continues to succeed.
—Dennis Ritchie


#3

Hello Mehryar,
This works like a charm :slight_smile:

Thank you so much. I really appreciate it.

Frank

mehryar removed_email_address@domain.invalid wrote:
irb> response = %{Here is some link Ruby on Rails
and Google ofcourse and
href=“ftp://www.foo.bar” title=“bar”>Foo!}

irb> puts response.scan(/href="([^"]+)".*?>([^>]+)=>
[[“http://www.rubyonrails.org”, “Ruby on Rails”],
[“http://www.google.com”, “Google ofcourse”], [“ftp://www.foo.bar”,
“Foo!”]]

what you’re looking for is the negation class so
href="([^"]+)"
^^^^^
match anything that is not a doublequote all the way until you
bump into one.

and similarly

([^>]+)<
^^^^^^
match everything but only between two > and <

cheers,
-Mehryar

On Fri, 17 Feb 2006, softwareengineer 99 wrote:

myurls=response.scan(/href\s*=\s*"’(.)["’]\s.>(.)</a>/)

I will greatly appreciate it.

Thanks
Frank


What are the most popular cars? Find out at Yahoo! Autos


… with proper design, the features come cheaply. This
approach is arduous, but continues to succeed.
—Dennis Ritchie