Forum: NGINX limit_req for spiders only

A8cae7780631b0e946f4826cd06128a6?d=identicon&s=25 Toni Mueller (Guest)
on 2013-10-14 13:59
(Received via mailing list)
Hi,

I would like to put a brake on spiders which are hammering a site with
dynamic content generation. They should still get to see the content,
but only not generate excessive load. I therefore constructed a map to
identify spiders, which works well, and then tried to

limit_req_zone $binary_remote_addr zone=slow:10m ...;

if ($is_spider) {
    limit_req zone=slow;
}


Unfortunately, limit_req is not allowed inside "if", and I don't see an
obvious way to achieve this effect otherwise.

If you have any tips, that would be much appreciated!


Kind regards,
--Toni++
2974d09ac2541e892966b762aad84943?d=identicon&s=25 Sylvia (Guest)
on 2013-10-14 15:25
(Received via mailing list)
Hello.

Doesnt robots.txt "Crawl-Delay" directive satisfy your needs?
Normal spiders should obey robots.txt, if they dont - they can be
banned.

Posted at Nginx Forum:
http://forum.nginx.org/read.php?2,243670,243674#msg-243674
A8cae7780631b0e946f4826cd06128a6?d=identicon&s=25 Toni Mueller (Guest)
on 2013-10-14 16:03
(Received via mailing list)
Hello,

On Mon, Oct 14, 2013 at 09:25:24AM -0400, Sylvia wrote:
> Doesnt robots.txt "Crawl-Delay" directive satisfy your needs?

I have it already there, but I don't know how long it takes for such a
directive, or any changes to robots.txt for that matter, to take effect.
Observing the logs, I'd say that this delay between changing robots.txt
and a change in robot behaviour would take several days, as I cannot see
any effects so far.

> Normal spiders should obey robots.txt, if they dont - they can be banned.

Banning Google is not a good idea, no matter how abusive they might be,
and they incidentically operate one of those robots which keep hammering
the site. I'd much prefer a technical solution to enforce such limits,
over convention.

I'd also like to limit the request frequency over an entire pool, so
that I can say "clients from this pool can make requests only with this
fequency, combined, not per client IP", because it doesn't buy me
anything if I can limit the individual search robot to a decent
frequency, but then get hammered by 1000 search robots in parallel, each
one observing the request limit. Right?


Kind regards,
--Toni++
36a8284995fa0fb82e6aa2bede32adac?d=identicon&s=25 Francis Daly (Guest)
on 2013-10-14 16:23
(Received via mailing list)
On Mon, Oct 14, 2013 at 01:59:23PM +0200, Toni Mueller wrote:

Hi there,

This is untested, but follows the docs at
http://nginx.org/r/limit_req_zone:

> I therefore constructed a map to
> identify spiders, which works well, and then tried to
>
> limit_req_zone $binary_remote_addr zone=slow:10m ...;
>
> if ($is_spider) {
>     limit_req zone=slow;
> }
>

> If you have any tips, that would be much appreciated!

In your map, let $is_spider be empty if is not a spider ("default",
presumably), and be something else if it is a spider (possibly
$binary_remote_addr if every client should be counted individually,
or something else if you want to group some spiders together.)

Then define

  limit_req_zone $is_spider zone=slow:10m ...;

instead of what you currently have.

  f
--
Francis Daly        francis@daoine.org
A8cae7780631b0e946f4826cd06128a6?d=identicon&s=25 Toni Mueller (Guest)
on 2013-10-14 16:52
(Received via mailing list)
H Francis,

On Mon, Oct 14, 2013 at 03:23:03PM +0100, Francis Daly wrote:
> In your map, let $is_spider be empty if is not a spider ("default",
> presumably), and be something else if it is a spider (possibly
> $binary_remote_addr if every client should be counted individually,
> or something else if you want to group some spiders together.)

thanks a bunch! This works like a charm!


Kind regards,
--Toni++
Please log in before posting. Registration is free and takes only a minute.
Existing account

NEW: Do you have a Google/GoogleMail, Yahoo or Facebook account? No registration required!
Log in with Google account | Log in with Yahoo account | Log in with Facebook account
No account? Register here.