Forum: Ruby net::http and caching files

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
4618d7fb1dc6d939414f36adbb2a187c?d=identicon&s=25 Shea Martin (Guest)
on 2007-08-02 17:25
(Received via mailing list)
I would like to open a webpage, only if the page is newer than what I
already have.

It looks like I have to get the whole page to get the last_modified
value.  I can't see anyway else to get the value, say off of the
Net::HTTP::HEAD.

I was hoping to save some bandwidth, am I SOL?

~S
Bef7ff8a0537495a1876ffebdc9f8e51?d=identicon&s=25 Lionel Bouton (Guest)
on 2007-08-02 17:36
(Received via mailing list)
Shea Martin wrote the following on 02.08.2007 17:24 :
>
You don't use the last_modified value like that. You make a simple GET
but you pass headers to tell the server that you only want the whole
page if the content has been modified.

So you'll have to
1/ store the 'last-modified' and 'etag' headers in the response (when it
has been modified, on first fetch or when the server is updated to put
them in the response).
2/ put them in the headers of your get request when you have them, like
that:

headers = {}
headers["If-Modified-Since"] = last_modified if last_modified
headers["If-None-Match"] = etag if etag

3/ check that response.is_a?(Net::HTTPNotModified)

Lionel
4618d7fb1dc6d939414f36adbb2a187c?d=identicon&s=25 Shea Martin (Guest)
on 2007-08-02 17:40
(Received via mailing list)
Shea Martin wrote:
> I would like to open a webpage, only if the page is newer than what I
> already have.
>
> It looks like I have to get the whole page to get the last_modified
> value.  I can't see anyway else to get the value, say off of the
> Net::HTTP::HEAD.
>
> I was hoping to save some bandwidth, am I SOL?
>
> ~S


It looks like Net::HTTP::Options might be what I want, just trying to
decipher the docs for it now.

~S
4618d7fb1dc6d939414f36adbb2a187c?d=identicon&s=25 Shea Martin (Guest)
on 2007-08-02 19:06
(Received via mailing list)
Lionel Bouton wrote:
>> ~S
> 2/ put them in the headers of your get request when you have them, like
> that:
>
> headers = {}
> headers["If-Modified-Since"] = last_modified if last_modified
> headers["If-None-Match"] = etag if etag
>
> 3/ check that response.is_a?(Net::HTTPNotModified)

Just read what 'etag' is.  Do I actually need mtime, if I have etag?

Thanks,

~S
Bef7ff8a0537495a1876ffebdc9f8e51?d=identicon&s=25 Lionel Bouton (Guest)
on 2007-08-02 19:14
(Received via mailing list)
>
> Just read what 'etag' is.  Do I actually need mtime, if I have etag?

Depends on the server on the other side. Both have roughly the same
usage ('last_modified' can't be reliably parsed as an accurate date as
there are servers with inaccurate clocks or bad timezone settings) but
anyone of them can be used at the server's discretion. If you don't know
in advance which server you'll fetch information from and which header
it will respond with, better implement support bor both.

Lionel
4618d7fb1dc6d939414f36adbb2a187c?d=identicon&s=25 Shea Martin (Guest)
on 2007-08-02 20:36
(Received via mailing list)
I have just tried about 20 servers (random urls), and have not seen and
etag or last_modified on any of them.  Is there really that few of
servers which support the two?

Am I doing something wrong?

I am on win32 if it matters.

CODE:
require 'open-uri'

# h = {}
# h['If-Modified-Since'] = 'Thu, 09 Aug 2007 17:33:40 GMT'
http = Net::HTTP.new( "www.google.com" )
resp, data = http.get( "/index.html" )
p "r is #{resp}"
p "code is #{resp.code}"
resp.each { |k,v| p "#{k} = #{v}" }

#open( "http://google.ca" ) do |f|
#  p f.last_modified
#end

exit 0


~S
Bef7ff8a0537495a1876ffebdc9f8e51?d=identicon&s=25 Lionel Bouton (Guest)
on 2007-08-02 21:46
(Received via mailing list)
Shea Martin wrote the following on 02.08.2007 20:35 :
>
> #end
>
> exit 0

Of course google won't send you last_modified or etag headers they don't
have documents to tag with an etag or to mark as generated at a given
time. If they want to optimize the bandwidth they are far more likeliy
to use cache-control headers, which they do for their main page with:
"Cache-Control: private"

Look at RSS or Atom feeds, between 1/3 to 1/2 of them have
"last-modified" headers, they are "modified" when a new article or
comment is posted...

Lionel.
This topic is locked and can not be replied to.