What's the Best Way to Mimic an HTTP Request?

danmcb · November 5, 2008, 5:26pm

I’m trying to write a tool that will take a domain as an argument and
make a request to http://onsamehost.com and then capture the list of
domains that share that same IP. I want to parse out those IPs and put
them into an array that I can print to a file later.

Here’s the code I’m trying to use:

–
require ‘net/http’
require ‘uri’

PATH = ‘/query.jsp’
USERAGENT = ‘Opera’
HOST = ‘onsamehost.com’

@http = Net::HTTP.new(HOST, 80)

resp, data = @http.get2(PATH, {‘User-Agent’ => USERAGENT})

puts resp
puts data

The problem is that I keep getting a redirect
(#Net::HTTPMovedPermanently:0xb7c35ffc), which doesn’t happen when I
make the request from a regular browser.

So I sniffed the regular request with wireshark, and a browser sends a
bunch of additional headers when it makes the request. Cookies,
referrer, etc.

Are any of these headers more necessary than others, and is there a
preferred way to send the headers using Ruby?

Thanks for any thoughts…

danmcb · November 5, 2008, 5:32pm

Is there a Ruby front end for Curl?

James

danmcb · November 5, 2008, 5:50pm

On Wed, Nov 5, 2008 at 8:24 AM, Daniel M. [email protected]
wrote:

The problem is that I keep getting a redirect
(#Net::HTTPMovedPermanently:0xb7c35ffc), which doesn’t happen when I
make the request from a regular browser.

Actually, it does – you just don’t see it.

When you request e.g. http::/example.com most servers will send
a redirect to the default page, e.g. http://example.com/index.html.

You need to either handle it or pass the default page’s full URL.

HTH,

danmcb · November 5, 2008, 5:59pm

On Wed, Nov 5, 2008 at 10:24 AM, Daniel M. [email protected]
wrote:

The problem is that I keep getting a redirect
(#Net::HTTPMovedPermanently:0xb7c35ffc), which doesn’t happen when I
make the request from a regular browser.

That site makes heavy use of redirects. Watch closely while running
queries or check your browser history.

So I sniffed the regular request with wireshark, and a browser sends a
bunch of additional headers when it makes the request. Cookies,
referrer, etc.

Are any of these headers more necessary than others, and is there a
preferred way to send the headers using Ruby?

Headers probably have no effect here.

What you probably want is code like this:

require 'net/http'
require 'uri'

def fetch(uri_str, limit = 10)
  # You should choose better exception.
  raise ArgumentError, 'HTTP redirect too deep' if limit == 0

  response = Net::HTTP.get_response(URI.parse(uri_str))
  case response
  when Net::HTTPSuccess     then response
  when Net::HTTPRedirection then fetch(response['location'], limit -

else
response.error!
end
end

resp = fetch(‘http://www.ruby-lang.org’)
puts resp.body

(from http://ruby-doc.org/stdlib/libdoc/net/http/rdoc/index.html –
“Following Redirection”)

regards,
Michael L.

danmcb · November 5, 2008, 6:57pm

On Wed, Nov 5, 2008 at 11:08 AM, Daniel M. [email protected]
wrote:

Thanks, much, Michael. Unfortunately I’m not quite tracking on why that
was necessary. It just seems a bit elaborate given what I thought was a
simple problem.

But I totally appreciate it…I just wish it were something simpler.

The site you’re hitting makes heavy use of redirects (and not really
for their intended purpose). What this means is that you submit your
request for a given URL and the server responds with a redirect and a
new URL. If you are working in a browser, your browser automatically
requests that URL, and the server again responds with a redirect and a
new URL. Again, a web browser handles requesting that next URL
automatically. This URL is the actual results page with the data you
want. It’s the web site making you jump through hoops to get where you
want to go.

Net::HTTP does not have a built in facility for following redirects
the way your browser does. So you have to write code to follow
redirects by submitting new requests until you get to one that is not
a redirect, which is what the fetch() method from the Net::HTTP
example does.

-Michael

danmcb · November 5, 2008, 6:09pm

Thanks, much, Michael. Unfortunately I’m not quite tracking on why that
was necessary. It just seems a bit elaborate given what I thought was a
simple problem.

But I totally appreciate it…I just wish it were something simpler.

danmcb · November 5, 2008, 7:03pm

Michael L. wrote:

The site you’re hitting makes heavy use of redirects (and not really
for their intended purpose). What this means is that you submit your
request for a given URL and the server responds with a redirect and a
new URL. If you are working in a browser, your browser automatically
requests that URL, and the server again responds with a redirect and a
new URL. Again, a web browser handles requesting that next URL
automatically. This URL is the actual results page with the data you
want. It’s the web site making you jump through hoops to get where you
want to go.

Ah, I see.

You appear, by my estimation, to rock.

: Daniel :

danmcb · November 5, 2008, 7:13pm

Avdi G. wrote:

You may want to look into using Mechanize rather than straight-up
Net::HTTP.

Mechanize for Ruby? Interesting. I didn’t know Ruby had an
implementation. Thanks, Avdi.

danmcb · November 5, 2008, 7:03pm

You may want to look into using Mechanize rather than straight-up
Net::HTTP.

–
Avdi

Home: http://avdi.org
Developer Blog: Avdi Grimm, Code Cleric
Twitter: http://twitter.com/avdi
Journal: http://avdi.livejournal.com

danmcb · November 21, 2008, 10:30am

Daniel M. wrote:

The problem is that I keep getting a redirect
(#Net::HTTPMovedPermanently:0xb7c35ffc), which doesn’t happen when I
make the request from a regular browser.

So I sniffed the regular request with wireshark, and a browser sends a
bunch of additional headers when it makes the request. Cookies,
referrer, etc.

Are any of these headers more necessary than others, and is there a
preferred way to send the headers using Ruby?

We have had similar issues where we didn’t see a redirect when sniffing
the browser but it happened for our code. The reason was HTTP/1.1.
With HTTP/1.1 it is required to specify the host you expect to be
talking with (as more than one virtual host may be serviced by one
server):
GET / HTTP/1.1
Host: www.apache.org
(see Apache Week. HTTP/1.1 for reference)

Hope that helps in avoiding the redirect

Uwe

What's the Best Way to Mimic an HTTP Request?

puts resp puts data

puts resp
puts data