First post, noob here. Just learning Ruby and I’ve been struggling with
this for over an hour now and I must be missing something. I need to
get the file extension of the default document for a website. It’s easy
enough when you supply the extension in the URI request like:
I know I can wget the file and that by default downloads the default
document saving it as the proper name and extension. Since I couldn’t
find the RIGHT way to discover this otherwise I’ve tried writing out the
IO stream and it requires you to specify the file destination, so that’s
So far the only way I’ve been able to work around this is to try to loop
through trying to guess the default document (index.html, index.htm,
default.aspx, etc.) and end once it finds a positive response code. Not
Not only not ideal, but plain wrong. Imagine that both index.html and
index.htm exist. How do you know, which one is the “default” document.
And what about the cases, where the wget request redirects to a default
CGI script, which then generates the HTML code, which is sent back?
From the viewpoint of the client, this is indistinguishable from the
case where you just get returned a static HTML.
I think that, in the general case, you never can reliably find out the
name of a document fetched via HTTP. You request something, and you get
something in return, but it doesn’t mean that the thing being returned,
happened to have the same name on the web server.
HTTP/1.1 200 OK
Date: Tue, 15 Dec 2015 17:18:13 GMT
Expires: Tue, 22 Dec 2015 17:18:13 GMT
Last-Modified: Fri, 09 Aug 2013 23:54:35 GMT
Server: ECS (iad/182A)
The most you can tell is that it returns some text/html (see Content
Type) document, but as you can see, the server does not tell you what it
is doing internally and how it generates this data when you make a GET
request to http://example.com In particular, you can create a simple
server with ruby sockets that simply responds with some pre-defined
response that is not read from any file and may or may not include a
Content-Type header, but see http://www.w3.org/Protocols/rfc1341/4_Content-Type.html :
Default RFC 822 messages are typed by this protocol as plain text in the
US-ASCII character set, which can be explicitly specified as
“Content-type: text/plain; charset=us-ascii”. If no Content-Type is
specified, either by error or by an older user agent, this default is
If no Content-Type is given, assume text/plain; charset=us-ascii.
You can use the mime-type from the Content-Type header to select an
appropriate file name, but some mime-types may have multiple extensions.