Page crawling and URL grabbing


#1

Hey guys,
I’m trying to write an application that goes onto a website (istockphoto
specifically), opens up istockphoto.com/file_browse.php and grabs the
URLs of the photos that appear there.

It’s my first time doing something like this. I’m reading some
documentation right now…but a hand would be greatly appreciated. I’m
not really sure how to do regex on an html file…or even find the right
stuff within that file. I’m guessing its…

open(‘http://www.istockphoto.com/file_browse.php/’) do |f|
f.find # dot something something
end

but I really have no idea. Any help would be great - thanks in advance!


#2

On Tue, Jan 27, 2009 at 1:55 AM, Patrick L. removed_email_address@domain.invalid wrote:

Hey guys,
I’m trying to write an application that goes onto a website (istockphoto
specifically), opens up istockphoto.com/file_browse.php and grabs the
URLs of the photos that appear there.

It’s my first time doing something like this. I’m reading some
documentation right now…but a hand would be greatly appreciated. I’m
not really sure how to do regex on an html file…or even find the right
stuff within that file. I’m guessing its…

Generally speaking, regular expressions are not the best tool to extract
information from HTML. Take a look at these other tools:

Mechanize
Hpricot
Scrubyt
Nokogiri

This is an example that might get you started, although I recommend
taking
a look at the above tools:

require ‘open-uri’
require ‘hpricot’

h = Hpricot(open(“http://www.istockphoto.com/file_browse.php”))
imgs = h.search("//[@class = searchImg]")
imgs.map {|img| img[“src”]}

=>

[“http://www2.istockphoto.com/file_thumbview_approve/8137463/1/istockphoto_8137463-budapest-by-night.jpg”,
http://www2.istockphoto.com/file_thumbview_approve/8139472/1/istockphoto_8139472-four-antique-wood-tennis-racquets.jpg”,
http://www2.istockphoto.com/file_thumbview_approve/6731990/1/istockphoto_6731990-two-female-lovers.jpg”,
http://www2.istockphoto.com/file_thumbview_approve/8308377/1/istockphoto_8308377-beauty.jpg”,
http://www2.istockphoto.com/file_thumbview_approve/6349299/1/istockphoto_6349299-lovers-interested-in-smth.jpg”,
http://www2.istockphoto.com/file_thumbview_approve/8322403/1/istockphoto_8322403-happy-piggy-bank.jpg”,
http://www2.istockphoto.com/file_thumbview_approve/8138976/1/istockphoto_8138976-tower-guard-of-cetara-little-town-in-amalfi-coast-italy.jpg”,
http://www2.istockphoto.com/file_thumbview_approve/8322394/1/istockphoto_8322394-yellow-red-paper.jpg”,
http://www1.istockphoto.com/file_thumbview_approve/4660654/1/istockphoto_4660654-the-art-of-eye-shadows.jpg”,
http://www1.istockphoto.com/file_thumbview_approve/8301075/1/istockphoto_8301075-3d-render-of-the-olive-tree.jpg”,
http://www1.istockphoto.com/file_thumbview_approve/6921717/1/istockphoto_6921717-manicure.jpg”,
http://www2.istockphoto.com/file_thumbview_approve/8322391/1/istockphoto_8322391-pomegranate.jpg”,
http://www2.istockphoto.com/file_thumbview_approve/8138975/1/istockphoto_8138975-junger-mann-seitlich.jpg”,
http://www2.istockphoto.com/file_thumbview_approve/8139815/1/istockphoto_8139815-winter.jpg”,
http://www2.istockphoto.com/file_thumbview_approve/8137153/1/istockphoto_8137153-beadworkafrican_pictureframe_p3406-jpg.jpg”,
http://www2.istockphoto.com/file_thumbview_approve/8139787/1/istockphoto_8139787-statue-of-liberty.jpg”,
http://www2.istockphoto.com/file_thumbview_approve/8322388/1/istockphoto_8322388-cold-winter-day.jpg”,
http://www2.istockphoto.com/file_thumbview_approve/8139602/1/istockphoto_8139602-statue-of-liberty.jpg”,
http://www2.istockphoto.com/file_thumbview_approve/8137801/1/istockphoto_8137801-litchi.jpg”,
http://www2.istockphoto.com/file_thumbview_approve/8139406/1/istockphoto_8139406-statue-of-liberty.jpg”,
http://www1.istockphoto.com/file_thumbview_approve/6850893/1/istockphoto_6850893-polka-dot-wedding-cake.jpg”,
http://www2.istockphoto.com/file_thumbview_approve/8139802/1/istockphoto_8139802-snow-woman.jpg”,
http://www2.istockphoto.com/file_thumbview_approve/8322364/1/istockphoto_8322364-white-cherry-blossom.jpg”,
http://www2.istockphoto.com/file_thumbview_approve/8139808/1/istockphoto_8139808-airport.jpg”,
http://www2.istockphoto.com/file_thumbview_approve/8322357/1/istockphoto_8322357-ciruit.jpg”,
http://www2.istockphoto.com/file_thumbview_approve/8139597/1/istockphoto_8139597-cheese-and-wine.jpg”,
http://www2.istockphoto.com/file_thumbview_approve/8138075/1/istockphoto_8138075-employee-of-office.jpg”]

You should customize the criteria to choose the images (in my little
example I selected all tags which had a class searchImg, which at a
quick glance seemed what you wanted, but double check).

I recall reading somewhere that nokogiri has better XPath support than
Hpricot, so check it out.

Jesus.


#3

Miroslaw N. wrote:

2009/1/27 Patrick L. removed_email_address@domain.invalid:

open(‘http://www.istockphoto.com/file_browse.php/’) do |f|
f.find # dot something something
end

Try Mechanize.
It’s easy :

agent = WWW::Mechanize.new
agent.user_agent_alias=‘Mac Safari’
page = agent.get(‘http://www.istockphoto.com/file_browse.php’);
page.links.text(/jpg/)

That’s great, or it sounds great. Is there any documentation aside from
blog posts and this: http://mechanize.rubyforge.org/mechanize/ ? What
did you use to learn it?


#4

mechanize is very easy and intuitive … you could basically learn to
use mechanize just by playing with it in irb . Combine that with reading
some/the docs , and you’re good to go .


#5

2009/1/27 Patrick L. removed_email_address@domain.invalid:

open(‘http://www.istockphoto.com/file_browse.php/’) do |f|
f.find # dot something something
end

Try Mechanize.
It’s easy :

agent = WWW::Mechanize.new
agent.user_agent_alias=‘Mac Safari’
page = agent.get(‘http://www.istockphoto.com/file_browse.php’);
page.links.text(/jpg/)