Thinking about caching


#1

I’ve been thinking about how to get the benefits of page caching back
without the pain of the same.

It occurred to me that, for a snappy response to ‘real’ readers, we
could probably get away with just caching the first page of /articles,
articles/category/whatever, /articles/tag/whatever and so on, along
with their associated feeds and the full articles. It’s my gut feeling
that the vast majority of hits are to the first page of an index, or
directly to an article[1]. In order for this to work, we’d have to
ensure that none of our links were ever generated in the form
‘/articles/tag/whatever?some_arg=…’, but that’s merely tedious, not
necessarily a showstopper.

However, there are problems with this approach because we’ve taken
advantage of fragment caching to introduce goodies like time limited
sidebars and future publication times[2]. One approach to cache
sweeping, suggested by Tobi on IRC, is to have each request process
fire off a sleeper thread, set to wake up at next expiry time (or in
an hour say, whichever is sooner) and zap the cache. Because every
dispatch thread would be doing this, it doesn’t really matter if one
of 'em gets killed before time, and a double expiry’s not really a
problem either, so what if the cache gets zapped twice, it’ll still
get rebuilt.

Another possibility is to stick a bit of javascript in the default
layout that fires off a request to some kind of uncached heartbeat
action, so a page may get served up without touching typo (so the the
browser gets a fast response) but typo still gets hit ‘at leisure’ so
to speak and can do any posted cache sweeps/publication actions in a
postfilter. I like this idea rather less, because it doesn’t work for
feeds, and it’s the sort of suspicious javascript activity that gets
apps a bad name.

A third option is to have a separate process that gets fired from a
cron script or something and handles any cache sweeping required, but
I don’t think that’s going to fly for a lot of people using hosting
services.

Whatever option gets chosen, I reckon there’s a case for unifying the
handling of future events. Here’s a sketch of a possible approach:

class Trigger < ActiveRecord::Base
belongs_to :pending_item, :polymorphic => true

class << self
  def post_action(due_at, item, method='came_due')
    create!(:due_at => due_at, :pending_item => item,
            :method => method)
    fire
  end

  def fire
    destroy_all 'due_at < now()'
    true
  end
end

def destroy
  pending_item.send(method)
end

end

This allows for arbitrary ActiveRecord based objects to post trigger
requests. Then application could declare a post_filter that does
‘Trigger.fire’ at the end of every request. Here’s an example of how
Article could take advantage of triggers:

class Article
def before_create
Trigger.post_action(published_at || created_at, self, ‘publish!’)
end

after_save :ping_on_publication

def publish!
  unless published?
    self.published = true
    self.save!
  end
end

def published=(publication_state)
  if publication_state && unpublished?
    @just_published = true
  end
  self[:published] = publication_state
end

def ping_on_publication
  if @just_published
    send_notifications
  end
end

end

(Note that there’s no need to do the cache sweeping logic here
because the cache sweeper already handles that)

We’ll have to do some fancy footwork to make sidebars work as
pending_items, but there’s virtue in making it happen. For instance,
an aggregation sidebar could check to see if anything had changed in
its target feed and only trigger a cache flush if there was something
new.

Note too that, with this interface, the cron option is easy – the
commandline to handle everything is:

/typo_installation/script/runner -e production ‘Trigger.fire’

Which seems pretty cute to me.

Thoughts? What have I missed?

  1. I would really appreciate it if anyone who can be bothered would go
    through their typo logs and quickly check for the relative
    frequency of non search engine crawler hits on any index pages
    after the first. I’m guessing that there’s a serious power law
    curve in effect here.

  2. Using the ‘created_at’ field, which should really, really, really
    be published_at or some such – overloading ‘created_at’ in this
    way is simply confusing.