Urgent please :HTTPS html parsing

arunvoip · March 17, 2009, 1:05pm

Hi,
Is there any way to extract the html code of a https:// website in
hpricot. When i use hpricot to access a https:// website i receive the
following error.

/usr/local/lib/site_ruby/1.8/rubygems/custom_require.rb:31:in
gem_original_require': no such file to load -- net/https (LoadError) from /usr/local/lib/site_ruby/1.8/rubygems/custom_require.rb:31:inrequire’
from /usr/lib/ruby/1.8/open-uri.rb:230:in open_http' from /usr/lib/ruby/1.8/open-uri.rb:616:inbuffer_open’
from /usr/lib/ruby/1.8/open-uri.rb:164:in open_loop' from /usr/lib/ruby/1.8/open-uri.rb:162:incatch’
from /usr/lib/ruby/1.8/open-uri.rb:162:in open_loop' from /usr/lib/ruby/1.8/open-uri.rb:132:inopen_uri’
from /usr/lib/ruby/1.8/open-uri.rb:518:in open' from /usr/lib/ruby/1.8/open-uri.rb:30:inopen’
from demo.rb:15:in `valid’
from demo.rb:93

I’m also not able to load the html data gmail, youtube etc. Is it
because i’m using hpricot. Is there any other way to extract https
websites. Please help me.

Regards
Arun K.

arunvoip · March 17, 2009, 1:20pm

Did you require ‘net/https’? It seems that that lib is just not loaded/
present.

On Mar 17, 1:05 pm, Arun K. [email protected]

arunvoip · March 17, 2009, 1:23pm

Harm wrote:

Did you require ‘net/https’? It seems that that lib is just not loaded/
present.

On Mar 17, 1:05ï¿½pm, Arun K. [email protected]

Can u please explain about how to include ‘net/https’.

Thanks a lot

arunvoip · March 17, 2009, 2:27pm

-1 for effort on the part of the poster…

Please go read
http://www.ruby-doc.org/stdlib/libdoc/net/http/rdoc/classes/Net/HTTP.html

and learn about what you are trying to use

arunvoip · March 17, 2009, 3:37pm

Please help. I’ll be really thankful

Regards
Arun K.

A quick Google search for ‘rails https’ yielded this on the 5th entry
found. It seems like exactly what you need to do.

http://railsruby.blogspot.com/2006/02/https-open-uri-basic-authentication.html

arunvoip · March 17, 2009, 3:11pm

Ar Chron wrote:

-1 for effort on the part of the poster…

Please go read
http://www.ruby-doc.org/stdlib/libdoc/net/http/rdoc/classes/Net/HTTP.html

and learn about what you are trying to use

I learned about ‘net/http’ and 'hpricot. but it is showing the same
error even for youtube. The code snippet i used for url extraction is:

require ‘rubygems’
require ‘hpricot’
require ‘open-uri’
require ‘dbi’

class Url
def valid
begin
puts “Enter domain name :”
domain = gets.chomp
#concatinating ‘http://www.’ with the url to open the page
url = “http://#{domain}”
document = open(url,“User-Agent”=>“Mozilla/4.0 (compatible; MSIE
5.5; Windows NT 5.0; .NET CLR 1.0.3705)”)
#getting the original url of the site
realUrl = document.base_uri.to_s
rescue
puts “Unable to open the URL. Please check if you have
entered a valid URL.”
end
parms = Array.new
parms = [domain, realUrl]
end

I’m able to extract the data from every site except
‘http://www.youtube.com’ and ‘gmail.com’ and other ‘https’ sites’.
Please help. I’ll be really thankful

Regards
Arun K.

arunvoip · March 17, 2009, 3:54pm

On Mar 17, 2:37 pm, Ar Chron [email protected] wrote:

Please help. I’ll be really thankful

Regards
Arun K.

A quick Google search for ‘rails https’ yielded this on the 5th entry
found. It seems like exactly what you need to do.

http://railsruby.blogspot.com/2006/02/https-open-uri-basic-authentica…

It also looks completely out of date - that patch doesn’t look like it
would apply to current versions of ruby 1.8.6
if net/https can’t be required then I would assumed this is on a linux
distribution where ruby is split into multiple packages, one of which
is usually this one with ssl stuff in it ( libopenssl-ruby in ubuntu)

Fred

arunvoip · March 17, 2009, 4:04pm

Frederick C. wrote:

On Mar 17, 2:37ï¿½pm, Ar Chron [email protected] wrote:

Please help. I’ll be really thankful

Regards
Arun K.

A quick Google search for ‘rails https’ yielded this on the 5th entry
found. It seems like exactly what you need to do.

http://railsruby.blogspot.com/2006/02/https-open-uri-basic-authentica…

It also looks completely out of date - that patch doesn’t look like it
would apply to current versions of ruby 1.8.6
if net/https can’t be required then I would assumed this is on a linux
distribution where ruby is split into multiple packages, one of which
is usually this one with ssl stuff in it ( libopenssl-ruby in ubuntu)

Fred

Yes i think like that. As a fresher to ruby, i didn’t understand a bit
of the code and as u said looks outdated. If u have any tricks in the
trade to parse html content from atleast this site.
http://www.youtube.com’

i’m receiving an error like this while extracting data from the site :
`open_http’: 400 Bad Request (OpenURI::HTTPError)
This is not the error which is displayed in the case of ‘https://’
sites.

Please help

Regards
Arun K.