Hpricot Relative Path

jcain · March 12, 2010, 10:22pm

I’m trying to write a script that pulls out an image from a yfrog page

So this is what I have

require ‘rubygems’
require ‘hpricot’
require ‘open-uri’

url = ‘ImageShack - Best place for all of your image hosting and image sharing needs’
doc = Hpricot(open(url))

(doc%“#main_image”).attributes[‘src’] # => “/img3/7036/gssac.jpg”

The problem is that the path is relative.
I’ve done a little googling, queried my ruby and rails ML archives,
glanced
at hpricot code, and looked through the method lists for open-uri and
hpricot.
So far, I don’t see anything that looks very useful.
Is there a way to have it give me the absolute path so that I can
reference
the picture later?

The only thing I’ve found that works so far involves string
manipulation,
which seems like a brittle workaround to replace something that probably
exists if I could just find it.

url = ‘ImageShack - Best place for all of your image hosting and image sharing needs’
page = open(url)
base = page.base_uri.to_s[ /(?:http://)?[^/]*// ] # => "
http://img3.yfrog.com/"
relative = (Hpricot(page)%“#main_image”).attributes[‘src’] # =>
“/img3/7036/gssac.jpg”
absolute = URI.join( base , relative )
absolute.to_s # => “http://img3.yfrog.com/img3/7036/gssac.jpg”

Anyone know of a better solution?

jcain · March 12, 2010, 10:54pm

On Fri, Mar 12, 2010 at 1:22 PM, Josh C. [email protected]
wrote:

The problem is that the path is relative.
I’ve done a little googling, queried my ruby and rails ML archives, glanced
at hpricot code, and looked through the method lists for open-uri and
hpricot.
So far, I don’t see anything that looks very useful.
Is there a way to have it give me the absolute path so that I can reference
the picture later?

Hpricot is just telling you what’s in the HTML. Munging the
document’s contents are your responsibility, not the parser’s

The only thing I’ve found that works so far involves string manipulation,
which seems like a brittle workaround to replace something that probably
exists if I could just find it.

Look into the URI library.

require ‘uri’

uri = URI.parse( “ImageShack - Best place for all of your image hosting and image sharing needs” )
uri.path = # your hpricot magic to get the image path goes here

Ben

jcain · March 13, 2010, 12:23am

On Fri, Mar 12, 2010 at 3:54 PM, Ben B. [email protected]
wrote:

uri = URI.parse( “ImageShack - Best place for all of your image hosting and image sharing needs” )
uri.path = # your hpricot magic to get the image path goes here

Ben

Thanks, this is what I am using now:

page = open url
image_path = URI.parse page.base_uri.to_s.sub( %r(/$) , ‘’ )
image_path.path = (Hpricot(page)%“#main_image”).attributes[‘src’]
image_path.to_s

It still seems a little excessive, but it’s a lot better than what I had
before.