Hpricot Relative Path

I’m trying to write a script that pulls out an image from a yfrog page

So this is what I have

require ‘rubygems’
require ‘hpricot’
require ‘open-uri’

url = ‘ImageShack - Best place for all of your image hosting and image sharing needs
doc = Hpricot(open(url))

(doc%“#main_image”).attributes[‘src’] # => “/img3/7036/gssac.jpg”

The problem is that the path is relative.
I’ve done a little googling, queried my ruby and rails ML archives,
glanced
at hpricot code, and looked through the method lists for open-uri and
hpricot.
So far, I don’t see anything that looks very useful.
Is there a way to have it give me the absolute path so that I can
reference
the picture later?

The only thing I’ve found that works so far involves string
manipulation,
which seems like a brittle workaround to replace something that probably
exists if I could just find it.

url = ‘ImageShack - Best place for all of your image hosting and image sharing needs
page = open(url)
base = page.base_uri.to_s[ /(?:http://)?[^/]*// ] # => "
http://img3.yfrog.com/"
relative = (Hpricot(page)%“#main_image”).attributes[‘src’] # =>
“/img3/7036/gssac.jpg”
absolute = URI.join( base , relative )
absolute.to_s # => “http://img3.yfrog.com/img3/7036/gssac.jpg

Anyone know of a better solution?

On Fri, Mar 12, 2010 at 1:22 PM, Josh C. [email protected]
wrote:

The problem is that the path is relative.
I’ve done a little googling, queried my ruby and rails ML archives, glanced
at hpricot code, and looked through the method lists for open-uri and
hpricot.
So far, I don’t see anything that looks very useful.
Is there a way to have it give me the absolute path so that I can reference
the picture later?

Hpricot is just telling you what’s in the HTML. Munging the
document’s contents are your responsibility, not the parser’s :slight_smile:

The only thing I’ve found that works so far involves string manipulation,
which seems like a brittle workaround to replace something that probably
exists if I could just find it.

Look into the URI library.

require ‘uri’

uri = URI.parse( “ImageShack - Best place for all of your image hosting and image sharing needs” )
uri.path = # your hpricot magic to get the image path goes here

Ben

On Fri, Mar 12, 2010 at 3:54 PM, Ben B. [email protected]
wrote:

uri = URI.parse( “ImageShack - Best place for all of your image hosting and image sharing needs” )
uri.path = # your hpricot magic to get the image path goes here

Ben

Thanks, this is what I am using now:

page = open url
image_path = URI.parse page.base_uri.to_s.sub( %r(/$) , ‘’ )
image_path.path = (Hpricot(page)%“#main_image”).attributes[‘src’]
image_path.to_s

It still seems a little excessive, but it’s a lot better than what I had
before.