Retrieving file extension from default document

Rafa_F · December 12, 2015, 3:59am

Hi everyone,

First post, noob here. Just learning Ruby and I’ve been struggling with
this for over an hour now and I must be missing something. I need to
get the file extension of the default document for a website. It’s easy
enough when you supply the extension in the URI request like:

require ‘open-uri’

$URL = ‘Example Domain’
urlConn = open($URL)

puts File.extname($URL)

But I need to be able to grab the extension of the default document,
when it’s not explicitly provided:

$URL = ‘http://www.example.com/’

I know I can wget the file and that by default downloads the default
document saving it as the proper name and extension. Since I couldn’t
find the RIGHT way to discover this otherwise I’ve tried writing out the
IO stream and it requires you to specify the file destination, so that’s
no use.

IO.copy_stream(urlConn, ‘website.html’)

Anyone have any ideas?

Thanks much!

justinskolnick · December 15, 2015, 2:40pm

What is the “default document”? The one, which is returned by a
webserver, when you only provided the website URI?

IMO this is impossible to do, without actually fetching it, because it
depends on the configuration of the webserver.

justinskolnick · December 15, 2015, 2:47pm

Hi Ronald,

Thanks for the reply!

I agree, we have to get the file for sure, but I want to be able to get
the file without explicitly requesting the file. I want to request:

http://www.example.com/

And determine that the server responded with:

http://www.example.com/index.php

So far the only way I’ve been able to work around this is to try to loop
through trying to guess the default document (index.html, index.htm,
default.aspx, etc.) and end once it finds a positive response code. Not
ideal.

If anyone has a better idea, I’d love to hear it.

justinskolnick · December 15, 2015, 3:39pm

Not only not ideal, but plain wrong. Imagine that both index.html and
index.htm exist. How do you know, which one is the “default” document.

And what about the cases, where the wget request redirects to a default
CGI script, which then generates the HTML code, which is sent back?
From the viewpoint of the client, this is indistinguishable from the
case where you just get returned a static HTML.

I think that, in the general case, you never can reliably find out the
name of a document fetched via HTTP. You request something, and you get
something in return, but it doesn’t mean that the thing being returned,
happened to have the same name on the web server.

Ronald

justinskolnick · December 15, 2015, 6:33pm

Use ruby+curl to make a http GET request to the http://example.com and
inspect the header the server returns.

require’curb’
puts Curl::Easy.http_get(“http://www.example.com/”).header_str

HTTP/1.1 200 OK
Accept-Ranges: bytes
Cache-Control: max-age=604800
Content-Type: text/html
Date: Tue, 15 Dec 2015 17:18:13 GMT
Etag: “359670651+gzip”
Expires: Tue, 22 Dec 2015 17:18:13 GMT
Last-Modified: Fri, 09 Aug 2013 23:54:35 GMT
Server: ECS (iad/182A)
Vary: Accept-Encoding
X-Cache: HIT
x-ec-custom-error: 1
Content-Length: 1270

The most you can tell is that it returns some text/html (see Content
Type) document, but as you can see, the server does not tell you what it
is doing internally and how it generates this data when you make a GET
request to http://example.com In particular, you can create a simple
server with ruby sockets that simply responds with some pre-defined
response that is not read from any file and may or may not include a
Content-Type header, but see
Content-Type fields in MIME :

Default RFC 822 messages are typed by this protocol as plain text in the
US-ASCII character set, which can be explicitly specified as
“Content-type: text/plain; charset=us-ascii”. If no Content-Type is
specified, either by error or by an older user agent, this default is
assumed.

If no Content-Type is given, assume text/plain; charset=us-ascii.

You can use the mime-type from the Content-Type header to select an
appropriate file name, but some mime-types may have multiple extensions.