Bot-taming: which rules should apply, and in which order?

aris · July 29, 2012, 6:20pm

I’m attempting to tame (minimize or eliminate) Yandex bot access.

I’d like to understand the application/precedence of the rules I apply.

To my site config I’ve added

map $http_user_agent $bad_bot {
default 0;
~(Yandex|YandexBot) 1;
}

map $http_referrer $bad_referrer {
default 0;
~*(yandex) 1;
}

valid_referers mydomain.com *.mydomain.com localhost 127.0.0.1
[::1];

location / {
if ($bad_bot) {return 403;}
if ($bad_referrer) {return 403;}
if ($invalid_referer) {return 444;}
…
}

and

cat /robots.txt
User-agent: *
Disallow: /

cat /robot_ssl.txt
User-agent: *
Disallow: /

In my logs I see repeating ‘444’ rejections:

100.43.83.148 - - [28/Jul/2012:06:02:14 -0500] GET /robots.txt
HTTP/1.1 “444” 0 “-” “Mozilla/5.0 (compatible; YandexBot/3.0;
+http://yandex.com/bots)” “-”
100.43.83.148 - - [28/Jul/2012:06:06:23 -0500] GET /robots.txt
HTTP/1.1 “444” 0 “-” “Mozilla/5.0 (compatible; YandexBot/3.0;
+http://yandex.com/bots)” “-”

With my rules above, I’d expect that to be a ‘403’ rejection, as
specified for the “$bad_bot” check.

Why am I seeing the ‘444’ instead of the ‘403’?