Hpricot problem on class initialize

Hello, im new to ruby so there is a good chance the error is not even
with hpricot but something im missing on the syntax or something.
I got the following code:

Its suposed to read a file where I have a bunch of facebook IDs and
return a list of the links to those IDs along with the page title (so I
can see the name before clicking on the link)
The link part is working fine.
Thanks and sorry if its a silly question =)

On Sep 23, 2013, at 12:04 PM, Mario Me [email protected] wrote:

Hello, im new to ruby so there is a good chance the error is not even
with hpricot but something im missing on the syntax or something.
I got the following code:
require 'sinatra'require 'hpricot'require 'open-uri'set :server, 'webric - Pastebin.com
Its suposed to read a file where I have a bunch of facebook IDs and
return a list of the links to those IDs along with the page title (so I
can see the name before clicking on the link)
The link part is working fine.
Thanks and sorry if its a silly question =)

What’s the part that is not working?

On Sep 23, 2013, at 13:28 , Tamara T. [email protected]
wrote:

The link part is working fine.
Thanks and sorry if its a silly question =)

What’s the part that is not working?

The part where they’re trying to use hpricot.

Use nokogiri instead. It’s more correct in nearly every way.

When I try to get the page with hpricot
doc = Hpricot(open(@link))
I always have a 404 error, but if I print the @link variable to the view
I can access the URL just fine. So I guess its the way im trying to open
it with hpricot that is wrong.

tamouse m. wrote in post #1122204:

On Sep 23, 2013, at 3:41 PM, Mario Me [email protected] wrote:

When I try to get the page with hpricot
doc = Hpricot(open(@link))
I always have a 404 error, but if I print the @link variable to the view
I can access the URL just fine. So I guess its the way im trying to open
it with hpricot that is wrong.

From what I can see, you are sending page.link back to the client in
index.erb. can you show the content that is sent back? (html source,
please)

Im sorry but I think I dont undestood.
The page.link is working fine, the html on the index.erb is just a bunch
of links to the profiles.

Use nokogiri instead. It’s more correct in nearly every way.
I tried this:

def initialize(link)
site = “https://www.facebook.com/
@link = site + link
page = Nokogiri::HTML(open(site + link))
@title = page.css(“title”).text
end

And it still returns a 404 error “OpenURI::HTTPError at / 404 Not
Found”.
Its strange because im using the same address to create the page.link
and it works!

On Sep 24, 2013, at 7:25 AM, Mario Me [email protected] wrote:

index.erb. can you show the content that is sent back? (html source,
please)

Im sorry but I think I dont undestood.
The page.link is working fine, the html on the index.erb is just a bunch
of links to the profiles.

That’s what I said.

Use nokogiri instead. It’s more correct in nearly every way.
I tried this:

def initialize(link)
site = “https://www.facebook.com/
@link = site + link
page = Nokogiri::HTML(open(site + link))
@title = page.css(“title”).text

Try this, just for me. Change the above two lines to this:

html_doc = Nokogiri::HTML(open(@link))
@title = html_doc.css("title").text

(Note you could put those on one line, like so:

@title = Nokigiri::HTML(open(@link)).css("title").text

)

Two things:

  1. the variable page might be used elsewhere. making it unique here
    might help.
  2. using the variable you just set, and will be using in the ERB, makes
    sure it is the same in the open.

end

And it still returns a 404 error “OpenURI::HTTPError at / 404 Not
Found”.
Its strange because im using the same address to create the page.link
and it works!

Do this for me as well, from the command line, take one of those
page.links from your output, and fetch it with either curl or wget.

This is because I have known of sites (not sure if FB is like this or
not) that respond differently depending on specific contents of the
request header, which can be different between open-uri, curl, wget and
various browsers.

On Sep 23, 2013, at 3:41 PM, Mario Me [email protected] wrote:

When I try to get the page with hpricot
doc = Hpricot(open(@link))
I always have a 404 error, but if I print the @link variable to the view
I can access the URL just fine. So I guess its the way im trying to open
it with hpricot that is wrong.

From what I can see, you are sending page.link back to the client in
index.erb. can you show the content that is sent back? (html source,
please)

I was wondering about that myself. I had a sneaky suspicion that
FaceBook wouldn’t allow you to “crawl” that site like that.

  • Wayne

From: Mario Me [email protected]
To: [email protected]
Sent: Tuesday, September 24, 2013 11:22 AM
Subject: Re: Hpricot problem on class initialize

This may be the problem, wget returns a “unsuported browser” page and
curl a “page not found message”.

This is because I have known of sites (not sure if FB is like this or
not) that respond differently depending on specific contents of the
request header, which can be different between open-uri, curl, wget and
various browsers.

This may be the problem, wget returns a “unsuported browser” page and
curl a “page not found message”.
Tried changing the user agent:
page = Nokogiri::HTML(open(site + link.to_s, ‘User-Agent’ =>
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)
Ubuntu '))
Still no success…