Hpricot and regexp parsing help needed please

newrubygirl · September 29, 2008, 10:08pm

Hello.

I’m using hpricot for the first time on a project. I need to get some
url’s from a web site, but I only want certain url’s.

I can grab all of the url’s from the page without a problem, but how can
I enhance this to select http://www.goodsite.com vs.
http://www.wrongsite.com?

I’d like to test for the string “goodsite”.

Thanks!

newrubygirl · September 30, 2008, 12:38am

I’m using hpricot for the first time on a project. I need to get some
url’s from a web site, but I only want certain url’s.

I can grab all of the url’s from the page without a problem, but how
can
I enhance this to select http://www.goodsite.com vs.
http://www.wrongsite.com?

I’d like to test for the string “goodsite”.

…assuming doc is an hpricot object…

doc.search(“a[@href*=‘goodsite’]”) do |result|
…
end

newrubygirl · September 30, 2008, 1:03am

Philip H. wrote:

…assuming doc is an hpricot object…

doc.search(“a[@href*=‘goodsite’]”) do |result|
…
end

Yes, that works to only grab the links that I need. Previously though,
I had used

(doc/:a).each do |link|

this only gave me the html string.

Can I do this the same way instead of returning

<a href= "http://…

I only want http:// so that I can use these links.

THANKS!

newrubygirl · September 30, 2008, 1:08am

Actually, I was wrong in my previous post. Sorry!! Both results are
the same, i.e., I get back the <a href…

Is there a way for me to have a clean link? I want to insert this into
a table and then pull up the pages.

Thanks!

newrubygirl · September 30, 2008, 1:40am

Figured it out.

doc.search(“a[@href*=‘goodsite’]”) do |result|
link = results.attributes[‘href’]
puts link
end