Forum: Mongrel Non-Cache Handling for Bots

Posted by Mike Papper (Guest)
on 2009-06-11 00:54
(Received via mailing list)
Hello,

when a bot visits our page, we want to create a response that is 
different
than the response for a human. In particular we want to limit the
hierarchies of menus so that the bot doesnt think there are too may
tags/keywords for the site (and thus they dont get indexed) as well we 
do
not display ads for bots.

We cache most of our pages using standard rails caching.

The problem is that when a bot visits the site, it will get the standard
cached page with the incorrect menus. We want to work around this so 
that
bots do not get cached pages. (Yes, it is good if the bots get a
bot-specific cached page, but lets keep it simple so far).

We did try numerous the following:
1) using apache rewrite to add /robot to the URL. then mongrel will 
never
find the cached page on disk. The problem here is that all the links in 
the
page have /robot in front of them. So, having apache add /robot to the 
URI
results in the PATH_INFO as-seen by rails to have /robot in it. Another
problem was that we effectivly duplicated the set of rules in routes.rb 
for
the cases with /robot as a prefix.

We have 2 ideas:
1) when apache detects a bot, send the request to a non-caching 
webserver.
Does anyonme know one of these?
2) I edited the mongrel source code in 
mongrel-1.1.4/lib/mongrel/rails.rb
and added this kind of thing to the process method:

do_not_cache = KNOWN_ROBOT_AGENTS.detect{ |b|
user_agent.downcase.include?(b) } if user_agent

And then used that variable in the tests for @files.can_serve(...)

This works but we still want mongrel to serve static files as cached (so 
the
rules above can take care of this too, it just gets more complicated to
check for /stylesheets, /.images etc.).

------------
Question: is there a way to plug-in our own logic into the mogrel 
process of
handling a request? And/or can we set up a specific mongrel to never 
cache
(are there options for this)?

Any ideas are appreciated,

Mike
Posted by unknown (Guest)
on 2009-06-11 03:24
(Received via mailing list)
What kind of caches are you talking about?
Are these full page caches?  The kind that get stored into /public?

My question is, can you instead of adding /robot to the URL when apache
finds the robot, can you instead change Apache's DocumentRoot?
It seems to be that this would prevent apache from finding the cached
page.  Also, if you actually point to another copy of your /public, you
could get the normal static pages...

I think perhaps since you are talking about changing mongrel's caching
behaviour that you aren't talking about the page caches that get stored
into /public. (Well, I'm rusty on terminology here)

--
Michael Richardson <mcr@simtone.net>
Director -- Consumer Desktop Development, Simtone Corporation, Ottawa, 
Canada
Personal: http://www.sandelman.ca/mcr/

SIMtone Corporation fundamentally transforms computing into simple,
secure, and very low-cost network-provisioned services pervasively
accessible by everyone.  Learn more at www.simtone.net and 
www.SIMtoneVDU.com
Posted by Mike Papper (Guest)
on 2009-06-11 03:49
(Received via mailing list)
I am talking about Rails standard page-caching mechanism. Rails by
default puts full pages into public/... and if mongrel sees them
there, it serves them (without running rails dispatch et al). This is
fine for normal but not good for the bot user agents.

Here is a new solution:
1) set rails to cache in public/cache
2) Use Apache rewrite to serve these files directly (if found)
3) If not found, pass to mongrel which will not find the cached files
either since MONGREL ONLY LOOKS IN public for cached files. Mongrel
does not honor the config.action_controller.page_cache_directory rails
setting
4) Rails processes the file and puts it into public/cache/...

...on the next request, apache serves from cache.

I am working on the reqwrite rules etc. for this.

Mike
Posted by unknown (Guest)
on 2009-06-11 04:22
(Received via mailing list)
>>>>> "Mike" == Mike Papper <bodaro@gmail.com> writes:
    Mike> I am talking about Rails standard page-caching mechanism. 
Rails by  default
    Mike> puts full pages into public/... and if mongrel sees them 
there, it serves
    Mike> them (without running rails dispatch et al). This is  fine for 
normal but not
    Mike> good for the bot user agents.

right, so that's what I thought you were talking about.
Only, it's not mongrel that serves up the pages, but Apache, usually.
In your apache config, you have something like:

        # Rewrite all non-static requests to cluster
        RewriteCond %{DOCUMENT_ROOT}/%{REQUEST_FILENAME} !-f
        RewriteRule ^/(.*)$ balancer://spartan_cluster%{REQUEST_URI} 
[P,QSA,L]

which basically serves up any files found in /public, otherwise, punts
to the mongrel.    I thought that rails put the files directly there for
apache to use/see. (there are caveats if your mongrel and apache do not
share the same file system, such as because they are on different 
machines)

If you are telling me that actually mongrel does this, it's news to me.

    Mike> Here is a new solution:
    Mike> 1) set rails to cache in public/cache
    Mike> 2) Use Apache rewrite to serve these files directly (if found)
    Mike> 3) If not found, pass to mongrel which will not find the 
cached files  either
    Mike> since MONGREL ONLY LOOKS IN public for cached files. Mongrel 
does not honor
    Mike> the config.action_controller.page_cache_directory rails 
setting
    Mike> 4) Rails processes the file and puts it into public/cache/...

    Mike> ...on the next request, apache serves from cache.

    Mike> I am working on the reqwrite rules etc. for this.

So, basically have apache pick a different cache location when it sees a
robot.

--
Michael Richardson <mcr@simtone.net>
Director -- Consumer Desktop Development, Simtone Corporation, Ottawa, 
Canada
Personal: http://www.sandelman.ca/mcr/

SIMtone Corporation fundamentally transforms computing into simple,
secure, and very low-cost network-provisioned services pervasively
accessible by everyone.  Learn more at www.simtone.net and 
www.SIMtoneVDU.com
Posted by Shawn Hill (shawn)
on 2009-06-11 04:45
(Received via mailing list)
"Serving up different results based on user agent may cause your site to 
be
perceived as deceptive and removed from the Google index."
http://www.google.com/support/webmasters/bin/answe...
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.