Trying to GET google with socket....problem

r3madi · April 7, 2007, 10:28pm

Well I don’t know why the socket can’t connect to Google. Here is my
source code:

require ‘socket’
h = TCPSocket.new(‘www.google.ca’,80)
h.print “GET /index.html HTTP/1.0\n\n”
a = h.read
puts a

I tried changing the HTTP to 1.1 but it still doesn’t work.

r3madi · April 7, 2007, 10:40pm

I just ran this code in irb, and it worked without issue.

Can you provide the specific exception or unexpected results?

r3madi · April 7, 2007, 10:41pm

Also, can you provide the platform that you are using? I was using OS
X.

r3madi · April 7, 2007, 10:43pm

On Apr 7, 2007, at 13:28 , Hey Y. wrote:

Well I don’t know why the socket can’t connect to Google. Here is my
source code:

require ‘socket’
h = TCPSocket.new(‘www.google.ca’,80)
h.print “GET /index.html HTTP/1.0\n\n”
a = h.read
puts a

If you just want to get google (or whatever), use:

ruby -ropen-uri -e ‘puts URI.parse(“http://www.google.com/
index.html”).read’

If you want to know the inner-workings of HTTP clients and servers,
use the above and trace it backwards. There is a lot of good code in
there.

r3madi · April 7, 2007, 11:04pm

Michael G. wrote:

Also, can you provide the platform that you are using? I was using OS
X.
Well I don’t know what you meant right there but I’m using Windows XP.

r3madi · April 7, 2007, 10:53pm

Michael G. wrote:

I just ran this code in irb, and it worked without issue.

Can you provide the specific exception or unexpected results?
Well I just ran the code and got this:

HTTP/1.0 302 Found

Location: Google

Cache-Control: private

Set-Cookie:
PREF=ID=e20f9edec5958042:TM=1175979001:LM=1175979001:S=shwmC1m6Amdg20nV;
expires=Sun, 17-Jan-2038 19:14:07 GMT; path=/; domain=.google.com

Content-Type: text/html

Server: GWS/2.1

Content-Length: 228

Date: Sat, 07 Apr 2007 20:50:01 GMT

Connection: Keep-Alive

302 Moved

The document has moved here.

Also I would like to stick to using sockets instead of other HTTP
clients :).

r3madi · April 7, 2007, 11:51pm

Michael G. wrote:

OK, so you are getting a response back from the server.

I have no idea why you’re getting a redirect from them, but you are
getting a proper response over your socket.
Well thank you for the answer :). The thing is that it’s weird that even
when I put the host as google.ca it still redirects me to google.ca.
Well thank you to everyone that has helped me and I appreciate it but I
am wondering something else now: Why when I put HTTP/1.1 the program
loads but it just stays blank, not doing anything.

r3madi · April 7, 2007, 11:29pm

OK, so you are getting a response back from the server.

I have no idea why you’re getting a redirect from them, but you are
getting a proper response over your socket.

r3madi · April 8, 2007, 2:10am

Hi!

The answers to both of your questions is simple…

Thus spake Hey Y. on 04/07/2007 11:51 PM:

Well thank you for the answer :). The thing is that it’s weird that even
when I put the host as google.ca it still redirects me to google.ca.

That’s because google redirects you to your localized version of
google and you did not specify the hostname in your get. You open a
socket to www.google.ca, but you only tell it to deliver some
“index.html”. If that machine hosted multiple domains (which in fact
it does), it would not know whether to send you
www.google.ca/index.html or perhaps Google.
So it informs you that it has an “/index.html” for you which it
figures might best suit your needs and that this page can be found
by issuing the following HTTP command:

GET www.google.ca/index.html HTTP/1.0\n\n

Well thank you to everyone that has helped me and I appreciate it but I
am wondering something else now: Why when I put HTTP/1.1 the program
loads but it just stays blank, not doing anything.

The answer to that question is even simpler:
In HTTP/1.0, you open a socket, issue a request, get a response and
close the socket again for each and every single item you need. You
open a socket for the html-page itself, another one to request an
image specified in that page and so on. So after each request, the
socket is closed by the server.

When you specify HTTP/1.1, you have another option: pipelining. When
you request a resource via HTTP/1.1, a compliant server MAY keep the
socket open for you after it’s response so that you might specify
another request without having to open a whole new socket. If the
server does this, it is the client’s responsibility to close the
socket when it does not require any more data.
Try it: open up a telnet connection to www.google.ca and issue your
request as HTTP/1.0. The socket will close immediately after the
response from the server.
Now do the same thing again but specify HTTP/1.1. This time the
socket stays open and your can issue another request (or the same
request again to keep things simple.

For further information I suggest you read rfc1945 and rfc2616
respectively.

HTH, HAND,

Phil

r3madi · April 8, 2007, 5:54am

Philipp T. wrote:

Hi!

The answers to both of your questions is simple…

Thus spake Hey Y. on 04/07/2007 11:51 PM:

Well thank you for the answer :). The thing is that it’s weird that even
when I put the host as google.ca it still redirects me to google.ca.

That’s because google redirects you to your localized version of
google and you did not specify the hostname in your get. You open a
socket to www.google.ca, but you only tell it to deliver some
“index.html”. If that machine hosted multiple domains (which in fact
it does), it would not know whether to send you
www.google.ca/index.html or perhaps Google.
So it informs you that it has an “/index.html” for you which it
figures might best suit your needs and that this page can be found
by issuing the following HTTP command:

GET www.google.ca/index.html HTTP/1.0\n\n

Well thank you to everyone that has helped me and I appreciate it but I
am wondering something else now: Why when I put HTTP/1.1 the program
loads but it just stays blank, not doing anything.

The answer to that question is even simpler:
In HTTP/1.0, you open a socket, issue a request, get a response and
close the socket again for each and every single item you need. You
open a socket for the html-page itself, another one to request an
image specified in that page and so on. So after each request, the
socket is closed by the server.

When you specify HTTP/1.1, you have another option: pipelining. When
you request a resource via HTTP/1.1, a compliant server MAY keep the
socket open for you after it’s response so that you might specify
another request without having to open a whole new socket. If the
server does this, it is the client’s responsibility to close the
socket when it does not require any more data.
Try it: open up a telnet connection to www.google.ca and issue your
request as HTTP/1.0. The socket will close immediately after the
response from the server.
Now do the same thing again but specify HTTP/1.1. This time the
socket stays open and your can issue another request (or the same
request again to keep things simple.

For further information I suggest you read rfc1945 and rfc2616
respectively.

HTH, HAND,

Phil
Thank you a lot Phil! I have learned a lot from you like how to POST
data (Yup, I learned) and much more and I am very grateful for all the
help you have given me. It makes sense why it didn’t connect to
google.ca and I learned how to fix it right after my last post but I had
to go offline. I have also read RFC2616 but only bits and pieces of what
I have read are stuck in my head so I will keep re-reading it to learn
more. I will also read RFC1945 and I’m sorry for my newbish posts. It’s
not that I’m lazy because I really am a hard worker but it’s just that I
needed someone to point me to the right direction and that is what you
did :).

r3madi · April 8, 2007, 4:43pm

On Sun, Apr 08, 2007 at 05:28:07AM +0900, Hey Y. wrote:

Well I don’t know why the socket can’t connect to Google. Here is my
source code:

require ‘socket’
h = TCPSocket.new(‘www.google.ca’,80)
h.print “GET /index.html HTTP/1.0\n\n”
a = h.read
puts a

I tried changing the HTTP to 1.1 but it still doesn’t work.

Two problems:
(1) Line terminator for HTTP is \r\n not \n
(2) You have not supplied a Host: header

h.print “GET /index.html HTTP/1.0\r\nHost: www.google.ca\r\n\r\n”

I say again: you must read and understand RFC 2616.

This documents HTTP/1.1, which has gained a lot of features. You could
try
reading the earlier RFCs for HTTP/1.0 or HTTP/0.9 for a simplified
protocol.

B.

r3madi · April 8, 2007, 8:19pm

On 08.04.2007, at 16:42, Brian C. wrote:

I tried changing the HTTP to 1.1 but it still doesn’t work.

Two problems:
(1) Line terminator for HTTP is \r\n not \n
(2) You have not supplied a Host: header

This means too send something like

GET /index.html HTTP/1.1\r\n
Host: www.google.ca\r\n
\r\n
\r\n

Regards
Karl-Heinz

r3madi · April 8, 2007, 5:57pm

Brian C. wrote:

On Sun, Apr 08, 2007 at 05:28:07AM +0900, Hey Y. wrote:

Well I don’t know why the socket can’t connect to Google. Here is my
source code:

require ‘socket’
h = TCPSocket.new(‘www.google.ca’,80)
h.print “GET /index.html HTTP/1.0\n\n”
a = h.read
puts a

I tried changing the HTTP to 1.1 but it still doesn’t work.

Two problems:
(1) Line terminator for HTTP is \r\n not \n
(2) You have not supplied a Host: header

h.print “GET /index.html HTTP/1.0\r\nHost: www.google.ca\r\n\r\n”

I say again: you must read and understand RFC 2616.

This documents HTTP/1.1, which has gained a lot of features. You could
try
reading the earlier RFCs for HTTP/1.0 or HTTP/0.9 for a simplified
protocol.

B.

Have you read what I last posted? Or did you just ignore it and gave me
the answer to a already answered question? Yes I have read RFC2616 more
than once and I do understand a lot of it but not all stays on my head
in the few times I read the document. I don’t know but I have read in a
lot of places that for a line terminator you can also use “\n\n” and it
seems to work fine. Also putting the Host header or adding the full
domain to the code such as “GET www.google.ca/index.html” both specifies
which host we want so I don’t see why change them.

r3madi · April 8, 2007, 9:07pm

On Mon, Apr 09, 2007 at 12:57:09AM +0900, Hey Y. wrote:

reading the earlier RFCs for HTTP/1.0 or HTTP/0.9 for a simplified
protocol.

B.

Have you read what I last posted? Or did you just ignore it and gave me
the answer to a already answered question? Yes I have read RFC2616 more
than once and I do understand a lot of it but not all stays on my head
in the few times I read the document. I don’t know but I have read in a
lot of places that for a line terminator you can also use “\n\n” and it
seems to work fine.

Read RFC 2616 section 2.2:

" HTTP/1.1 defines the sequence CR LF as the end-of-line marker for all
protocol elements except the entity-body (see appendix 19.3 for
tolerant applications)."

and appendix 19.3 says:

" The line terminator for message-header fields is the sequence CRLF.
However, we recommend that applications, when parsing such headers,
recognize a single LF as a line terminator and ignore the leading
CR."

So the upshot is: you’re sending a malformed request, but some servers
may
honour it.

Also putting the Host header or adding the full
domain to the code such as “GET www.google.ca/index.html” both specifies
which host we want so I don’t see why change them.

No, “GET www.google.ca/index.html” is a completely malformed request and
will be rejected. In any case this is different to the GET request you
actually sent, quoted at the very top of this posting.

The hostname is never supplied as part of the GET line.

Of course you supplied it to Ruby’s TCPSocket.new method, but at that
point
the hostname is converted to an IP address before the connection is
opened.
The name is not passed to the far end and therefore you must provide a
Host:
header.

I’m sorry, but I’m dropping out of this conversation now. Your response
was
arrogant. If you know nothing about HTTP, then I suggest you don’t go
around
telling people who know something about HTTP that they are wrong.

Regards,

Brian.

r3madi · April 9, 2007, 1:19am

On Apr 8, 2007, at 11:57 AM, Hey Y. wrote:

Also putting the Host header or adding the full
domain to the code such as “GET www.google.ca/index.html” both
specifies
which host we want so I don’t see why change them.

The URI provided in the GET request can be an absolute URI only if
the request is going to a proxy server. In that case the GET would
look like:

GET http://proxy.domain.com/index.html

Otherwise the URI must be an absolute path (i.e., a path starting
with ‘/’).
In that case the GET would look like:

GET /index.html

The problem with only having the path is that a web server that is
hosting
several websites can’t determine from the GET request which site the
request pertains to. The incoming TCP connections only have a
destination
IP address, not a destination domain name. The solution to this problem
is the “Host:” header. By looking at the “Host:” header, the web server
can multiplex several websites at the same IP address. Without the
Host:
header you would have to have a separate IP address for every website.

So your request should be sent as:

GET /index.html HTTP/1.0
Host: www.google.ca

r3madi · September 25, 2007, 11:10pm

Brian C. wrote:

I tried changing the HTTP to 1.1 but it still doesn’t work.
reading the earlier RFCs for HTTP/1.0 or HTTP/0.9 for a simplified protocol.

B.

That would be the issue.

r3madi · April 8, 2007, 9:17pm

On Apr 8, 2007, at 5:57 PM, Hey Y. wrote:

Have you read what I last posted? Or did you just ignore it and
gave me
the answer to a already answered question? Yes I have read RFC2616
more
than once and I do understand a lot of it but not all stays on my head
in the few times I read the document. I don’t know but I have read
in a
lot of places that for a line terminator you can also use “\n\n”
and it
seems to work fine.

Perhaps you read that in a CGI context?

“The server MUST translate the header data from the CGI header field
syntax to the HTTP header field syntax if these differ. For example,
the character sequence for newline (such as Unix’s ASCII NL) used by
CGI scripts may not be the same as that used by HTTP (ASCII CR
followed by LF).”

That’s what allows CGIs to ouput things like

print “Content-Type: text/plain\n\n”

and forget about CRLFs.

– fxn