Serve *only* from cache for particular user-agents

I havne’t found any ideas for this and thought I might ask here. We
have a
fairly straightforward proxy_cache setup with a proxy_pass backend. We
cache documents for different lengths of time or go the backend for
what’s
missing. My problem is we’re getting overrun with bot and spider
requests.
MSN in particular started hitting us exceptionally hard yesterday and
started bringing our backend servers down. Because they’re crawling the
site from end to end our cache is missing a lot of those pages and nginx
has
to pass the request on through.

I’m looking for a way to match on User-Agent and say that if it matches
certain bots to only serve out of proxy_cache. Ideally I’d like the
logic
to be: if it’s in the cache, serve it. If it’s not, then return some
4xx
error. But in the case of those user-agents, don’t go to the backend.
Only give them cache. My first thought was something like…

if ($http_user_agent ~* msn-bot) {
proxy_pass http://devnull;
}

by making a bogus backend. But in nginx 1.4.3 (that’s what we’re
running) I
get
nginx: [emerg] “proxy_pass” directive is not allowed here

Does anyone have another idea?

Thanks,
-Rick

Posted at Nginx Forum:

Hello!

On Fri, Feb 21, 2014 at 10:25:58AM -0500, rge3 wrote:

certain bots to only serve out of proxy_cache. Ideally I’d like the logic
nginx: [emerg] “proxy_pass” directive is not allowed here

Does anyone have another idea?

The message suggests you are trying to write the snippet above at
server{} level. Moving things into a location should do the
trick.

Please make sure to read If is Evil… when used in location context | NGINX though.


Maxim D.
http://nginx.org/

Maxim D. Wrote:

missing. My problem is we’re getting overrun with bot and spider
matches
}
trick.

Please make sure to read If is Evil… when used in location context | NGINX though.

That seems to have done it! With a location block I now have…

                   location / {
                            proxy_cache_valid  200 301 302  30m;

                            if ($http_user_agent ~* msn-bot) {
                                    proxy_pass http://devnull;
                            }

                            if ($http_user_agent !~* msn-bot) {
                                    proxy_pass 

http://productionrupal;
}
}

That seems to work perfectly. But is it a safe use of “if”? Is there a
safer way to do it without an if?

Thanks for the help!
-R

Posted at Nginx Forum:

Hello!

On Fri, Feb 21, 2014 at 11:46:02AM -0500, rge3 wrote:

cache documents for different lengths of time or go the backend for

if ($http_user_agent ~* msn-bot) {
The message suggests you are trying to write the snippet above at
if ($http_user_agent ~* msn-bot) {
proxy_pass http://devnull;
}

                            if ($http_user_agent !~* msn-bot) {
                                    proxy_pass http://productionrupal;
                            }
                    }

Second condition can be removed, it’s surplus. Just a

 location / {
     if (...) {
        proxy_pass ...
     }

     proxy_pass ...
 }

should be enough.

That seems to work perfectly. But is it a safe use of “if”? Is there a
safer way to do it without an if?

As long as it’s full configuration, there should be no problems.


Maxim D.
http://nginx.org/

you can use Module ngx_http_map_module

Ex:

map $http_user_agent $mobile {
~* msn-bot ‘http://devnull’;

default 'http://productionrupal';

}

Thanks,

Ajay K

ajay Wrote:

you can use Module ngx_http_map_module

Ex:

map $http_user_agent $mobile {
~* msn-bot ‘http://devnull’;

default 'http://productionrupal';

}

Actually that worked perfectly! Then I can do it entirely without the
‘if’.

Thanks Ajay and Maxim. I appreciate all the help!
-R

Posted at Nginx Forum:

On 2/21/2014 7:25 AM, rge3 wrote:

I havne’t found any ideas for this and thought I might ask here. We have a
fairly straightforward proxy_cache setup with a proxy_pass backend. We
cache documents for different lengths of time or go the backend for what’s
missing. My problem is we’re getting overrun with bot and spider requests.
MSN in particular started hitting us exceptionally hard yesterday and
started bringing our backend servers down. Because they’re crawling the
site from end to end our cache is missing a lot of those pages and nginx has
to pass the request on through.

Are they ignoring your robots.txt?