Limit_req for spiders only

alcina · October 14, 2013, 1:59pm

Hi,

I would like to put a brake on spiders which are hammering a site with
dynamic content generation. They should still get to see the content,
but only not generate excessive load. I therefore constructed a map to
identify spiders, which works well, and then tried to

limit_req_zone $binary_remote_addr zone=slow:10m …;

if ($is_spider) {
limit_req zone=slow;
}

Unfortunately, limit_req is not allowed inside “if”, and I don’t see an
obvious way to achieve this effect otherwise.

If you have any tips, that would be much appreciated!

Kind regards,
–Toni++

Toni_M · October 14, 2013, 3:25pm

Hello.

Doesnt robots.txt “Crawl-Delay” directive satisfy your needs?
Normal spiders should obey robots.txt, if they dont - they can be
banned.

Posted at Nginx Forum:

Toni_M · October 14, 2013, 4:03pm

Hello,

On Mon, Oct 14, 2013 at 09:25:24AM -0400, Sylvia wrote:

Doesnt robots.txt “Crawl-Delay” directive satisfy your needs?

I have it already there, but I don’t know how long it takes for such a
directive, or any changes to robots.txt for that matter, to take effect.
Observing the logs, I’d say that this delay between changing robots.txt
and a change in robot behaviour would take several days, as I cannot see
any effects so far.

Normal spiders should obey robots.txt, if they dont - they can be banned.

Banning Google is not a good idea, no matter how abusive they might be,
and they incidentically operate one of those robots which keep hammering
the site. I’d much prefer a technical solution to enforce such limits,
over convention.

I’d also like to limit the request frequency over an entire pool, so
that I can say “clients from this pool can make requests only with this
fequency, combined, not per client IP”, because it doesn’t buy me
anything if I can limit the individual search robot to a decent
frequency, but then get hammered by 1000 search robots in parallel, each
one observing the request limit. Right?

Kind regards,
–Toni++

Toni_M · October 14, 2013, 4:23pm

On Mon, Oct 14, 2013 at 01:59:23PM +0200, Toni M. wrote:

Hi there,

This is untested, but follows the docs at
http://nginx.org/r/limit_req_zone:

I therefore constructed a map to
identify spiders, which works well, and then tried to

limit_req_zone $binary_remote_addr zone=slow:10m …;

if ($is_spider) {
limit_req zone=slow;
}

If you have any tips, that would be much appreciated!

In your map, let $is_spider be empty if is not a spider (“default”,
presumably), and be something else if it is a spider (possibly
$binary_remote_addr if every client should be counted individually,
or something else if you want to group some spiders together.)

Then define

limit_req_zone $is_spider zone=slow:10m …;

instead of what you currently have.

f

Francis D. [email protected]

Toni_M · October 14, 2013, 4:52pm

H Francis,

On Mon, Oct 14, 2013 at 03:23:03PM +0100, Francis D. wrote:

In your map, let $is_spider be empty if is not a spider (“default”,
presumably), and be something else if it is a spider (possibly
$binary_remote_addr if every client should be counted individually,
or something else if you want to group some spiders together.)

thanks a bunch! This works like a charm!

Kind regards,
–Toni++