I’m trying to scrape images from a page. I’m using Hpricot to scrape the
actual image URLs into an array but I’ve encountered a problem regarding
resolving the full image paths.
Example:
The src of the images can be like any of the following:
http://external.site.com/images/image.jpg (Full URL)
/images/image.jpg (Absolute Path)
…/images/image.jpg (Relative Path)
images/image.jpg (Relative Path)
Is there a way to resolve these paths to the proper URLs? So I can copy
the images to my server or whatever else I need to do with them?
Hope that makes sense.
Cheers,
Jim
On Dec 3, 2007 10:45 AM, Jim N. [email protected]
wrote:
/images/image.jpg (Absolute Path)
…/images/image.jpg (Relative Path)
images/image.jpg (Relative Path)
Is there a way to resolve these paths to the proper URLs? So I can copy
the images to my server or whatever else I need to do with them?
You might try making a local mirror of the site using wget -m -np http://external.site.com
. That will resolve all the urls for you and
download the images.
–
Greg D.
http://destiney.com/
I would do something similiar but the problem with that is that the
script is going to be working on lots of different URLs.
It’s for a social bookmarking site that I’m currently working on. The
user bookmarks a page, a script scrapes all the images form the page and
resizes them, then a user can choose which thumbnail they want to use
for their bookmark.
Using a wget on every site probably isn’t the best plan for so many
sites.
On Dec 3, 2007 2:21 PM, Philip H. [email protected] wrote:
Parse the url into pieces… extract the domain name and the “directory”
part of the path.
Then just match them up. If your image starts with http just use that.
If it starts with a slash then prepend the domain name. Otherwise domain
/me watches while wget get reinvented.
–
Greg D.
http://destiney.com/
…/images/image.jpg (Relative Path)
images/image.jpg (Relative Path)
Is there a way to resolve these paths to the proper URLs? So I can copy
the images to my server or whatever else I need to do with them?
Parse the url into pieces… extract the domain name and the “directory”
part of the path.
Then just match them up. If your image starts with http just use that.
If it starts with a slash then prepend the domain name. Otherwise
domain
-philip
On Dec 3, 2007, at 11:45 AM, Jim N. wrote:
http://external.site.com/images/image.jpg (Full URL)
Cheers,
Jim
You use URI.join
irb> require ‘uri’
=> true
irb> page_and_images = {
?> ‘http://external.site.com/somedir/somepage.html’ =>
[‘http://external.site.com/images/image.jpg’
,
?> ‘/
images/image.jpg’,
?> ‘…/
images/image.jpg’ ],
?> ‘http://external.site.com/sometoppage.html’ =>
[‘http://external.site.com/images/image.jpg’
,
?> ‘images/
image.jpg’ ],
?> }
irb> page_and_images.each do |page,images|
?> page_url = URI.parse(page)
irb> puts “Starting from: #{page}”
irb> images.each do |image|
?> image_url = URI.join(page, image)
irb> puts " #{image} becomes #{image_url}"
irb> end
irb> end; nil
Starting from: http://external.site.com/sometoppage.html
http://external.site.com/images/image.jpg becomes
http://external.site.com/images/image.jpg
images/image.jpg becomes http://external.site.com/images/image.jpg
Starting from: http://external.site.com/somedir/somepage.html
http://external.site.com/images/image.jpg becomes
http://external.site.com/images/image.jpg
/images/image.jpg becomes http://external.site.com/images/image.jpg
…/images/image.jpg becomes http://external.site.com/images/
image.jpg
=> nil
-Rob
Rob B. http://agileconsultingllc.com
[email protected]