Forum: Ruby Page crawling and URL grabbing

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
981dc6a487197fb5a91bb038d0631354?d=identicon&s=25 Patrick L. (greenguy109)
on 2009-01-27 01:57
Hey guys,
I'm trying to write an application that goes onto a website (istockphoto
specifically), opens up istockphoto.com/file_browse.php and grabs the
URLs of the photos that appear there.

It's my first time doing something like this. I'm reading some
documentation right now...but a hand would be greatly appreciated. I'm
not really sure how to do regex on an html file...or even find the right
stuff within that file. I'm guessing its..

open('http://www.istockphoto.com/file_browse.php/') do |f|
f.find # dot something something
end

but I really have no idea. Any help would be great - thanks in advance!
E088bb5c80fd3c4fd02c2020cdacbaf0?d=identicon&s=25 Jesús Gabriel y Galán (Guest)
on 2009-01-27 09:39
(Received via mailing list)
On Tue, Jan 27, 2009 at 1:55 AM, Patrick L. <leahy16@gmail.com> wrote:
> Hey guys,
> I'm trying to write an application that goes onto a website (istockphoto
> specifically), opens up istockphoto.com/file_browse.php and grabs the
> URLs of the photos that appear there.
>
> It's my first time doing something like this. I'm reading some
> documentation right now...but a hand would be greatly appreciated. I'm
> not really sure how to do regex on an html file...or even find the right
> stuff within that file. I'm guessing its..

Generally speaking, regular expressions are not the best tool to extract
information from HTML. Take a look at these other tools:

Mechanize
Hpricot
Scrubyt
Nokogiri

This is an example that might get you started, although I recommend
taking
a look at the above tools:

require 'open-uri'
require 'hpricot'

h = Hpricot(open("http://www.istockphoto.com/file_browse.php"))
imgs = h.search("//[@class = searchImg]")
imgs.map {|img| img["src"]}

# =>
["http://www2.istockphoto.com/file_thumbview_approve...,
"http://www2.istockphoto.com/file_thumbview_approve...,
"http://www2.istockphoto.com/file_thumbview_approve...,
"http://www2.istockphoto.com/file_thumbview_approve...,
"http://www2.istockphoto.com/file_thumbview_approve...,
"http://www2.istockphoto.com/file_thumbview_approve...,
"http://www2.istockphoto.com/file_thumbview_approve...,
"http://www2.istockphoto.com/file_thumbview_approve...,
"http://www1.istockphoto.com/file_thumbview_approve...,
"http://www1.istockphoto.com/file_thumbview_approve...,
"http://www1.istockphoto.com/file_thumbview_approve...,
"http://www2.istockphoto.com/file_thumbview_approve...,
"http://www2.istockphoto.com/file_thumbview_approve...,
"http://www2.istockphoto.com/file_thumbview_approve...,
"http://www2.istockphoto.com/file_thumbview_approve...,
"http://www2.istockphoto.com/file_thumbview_approve...,
"http://www2.istockphoto.com/file_thumbview_approve...,
"http://www2.istockphoto.com/file_thumbview_approve...,
"http://www2.istockphoto.com/file_thumbview_approve...,
"http://www2.istockphoto.com/file_thumbview_approve...,
"http://www1.istockphoto.com/file_thumbview_approve...,
"http://www2.istockphoto.com/file_thumbview_approve...,
"http://www2.istockphoto.com/file_thumbview_approve...,
"http://www2.istockphoto.com/file_thumbview_approve...,
"http://www2.istockphoto.com/file_thumbview_approve...,
"http://www2.istockphoto.com/file_thumbview_approve...,
"http://www2.istockphoto.com/file_thumbview_approve...]


You should customize the criteria to choose the images (in my little
example I selected all tags which had a class searchImg, which at a
quick glance seemed what you wanted, but double check).

I recall reading somewhere that nokogiri has better XPath support than
Hpricot, so check it out.

Jesus.
4eb7e8e0e101c11c4863646bcd869810?d=identicon&s=25 Miroslaw Niegowski (Guest)
on 2009-01-27 09:42
(Received via mailing list)
2009/1/27 Patrick L. <leahy16@gmail.com>:
> open('http://www.istockphoto.com/file_browse.php/') do |f|
> f.find # dot something something
> end


Try Mechanize.
It's easy :

agent = WWW::Mechanize.new
agent.user_agent_alias='Mac Safari'
page = agent.get('http://www.istockphoto.com/file_browse.php');
page.links.text(/jpg/)
...
981dc6a487197fb5a91bb038d0631354?d=identicon&s=25 Patrick L. (greenguy109)
on 2009-01-28 00:43
Miroslaw Niegowski wrote:
> 2009/1/27 Patrick L. <leahy16@gmail.com>:
>> open('http://www.istockphoto.com/file_browse.php/') do |f|
>> f.find # dot something something
>> end
>
>
> Try Mechanize.
> It's easy :
>
> agent = WWW::Mechanize.new
> agent.user_agent_alias='Mac Safari'
> page = agent.get('http://www.istockphoto.com/file_browse.php');
> page.links.text(/jpg/)
> ...

That's great, or it sounds great. Is there any documentation aside from
blog posts and this: http://mechanize.rubyforge.org/mechanize/ ? What
did you use to learn it?
Ce521842cff87983fa0ab297b1d9317b?d=identicon&s=25 Tsunami Scripter (tsunami)
on 2009-01-28 00:47
mechanize is very easy and intuitive ... you could basically learn to
use mechanize just by playing with it in irb . Combine that with reading
some/the docs , and you're good to go .
This topic is locked and can not be replied to.