Stoping bots and wget in nginx

Hi guys, I am new to the list. Is there a way to stop or block the bots
access and wget to a nginx web server? tnks

att: alex

Alle martedì 18 dicembre 2007, Alexis Torres Garnica ha scritto:

Hi guys, I am new to the list. Is there a way to stop or block the bots
access and wget to a nginx web server? tnks

att: alex

If with “block bots” you mean “block requests based on User Agent”, you
can do
this setting up something like this:

            if ($http_user_agent ~ libwww-perl ) {
                    return 400;
            }

(just an example, of course)

wget (and many other user agents) respect robots.txt if you place it
at /robots.txt:

The Web Robots Pages
robots.txt - Wikipedia

Of course malicious agents will ignore it and continue scraping your
site. It’s pretty hard to block these kinds of bots since they can
mimic browser requests that would be difficult to disambiguate from
normal user requests.

Alle martedì 18 dicembre 2007, Eden Li ha scritto:

wget (and many other user agents) respect robots.txt if you place it
at /robots.txt:

The Web Robots Pages
robots.txt - Wikipedia

Of course malicious agents will ignore it and continue scraping your
site. It’s pretty hard to block these kinds of bots since they can
mimic browser requests that would be difficult to disambiguate from
normal user requests.

That’s true. But if you look carefully to a usual web site logs, most
part of
weird urls are coming from a small subset of specific user agents
(basically,
scripts run by people who barely have a clue of what they are doing).
While I agree that several tools respects robots.txt, they are the
“good”
ones, and I see no point in stopping them. Othe other side, malicious
tools
that fakes the user agent are really difficult to stop and you have to
rely
on a good configuration of the system. In the middle lies a highly
amount of
hits coming from specific user agents, mostly trying to do pretty
harmless
things (bounce attacks, etc…). That kind of visitors can be kept out by
a
simple configuration line, and given the hig rate of them it can be
worth to
use that countermeasure (naive as it is)

On Thu, Dec 20, 2007 at 11:26:29AM -0600, Alexis Torres Garnica wrote:

Hi guys and tnks for your answer, I added the robots.txt to my
/home/htdocs =) tnks. I have this rewrite rules:
if ($http_user_agent = “Wget/*”) {

  •        if ($http_user_agent = "Wget/*") {
    
  •        if ($http_user_agent ~ "Wget/|Teleport Pro|WebCopier") {
Hi guys and tnks for your answer, I added the robots.txt to my /home/htdocs =) tnks. I have this rewrite rules:

        if ($http_user_agent = "Wget/*") {
                return 403;
        }

        if ($http_user_agent = "Teleport Pro") {
                return 403;
        }

        if ($http_user_agent = "WebCopier") {
                return 403;
        }

I do a simple test with wget but I can download files with it, I tryed with wget, Wget, Wget* and the last Wget/* but its not working.

Nice day.

Fabio Coatti escribió:
Alle martedì 18 dicembre 2007, Eden Li ha scritto:
  
wget (and many other user agents) respect robots.txt if 
you place it
at /robots.txt:

http://www.robotstxt.org/orig.html
http://en.wikipedia.org/wiki/Robots.txt

Of course malicious agents will ignore it and continue scraping your
site. It’s pretty hard to block these kinds of bots since they can
mimic browser requests that would be difficult to disambiguate from
normal user requests.

That’s true. But if you look carefully to a usual web site logs, most
part of
weird urls are coming from a small subset of specific user agents
(basically,
scripts run by people who barely have a clue of what they are doing).
While I agree that several tools respects robots.txt, they are the
“good”
ones, and I see no point in stopping them. Othe other side, malicious
tools
that fakes the user agent are really difficult to stop and you have to
rely
on a good configuration of the system. In the middle lies a highly
amount of
hits coming from specific user agents, mostly trying to do pretty
harmless
things (bounce attacks, etc…). That kind of visitors can be kept out by
a
simple configuration line, and given the hig rate of them it can be
worth to
use that countermeasure (naive as it is)

On 12/18/07, Fabio Coatti <[email protected]> 
wrote:
    
Alle martedì 18 dicembre 2007, Alexis Torres 
Garnica ha scritto:
      
Hi guys, I am new to the list. Is there a way to 
stop or block the bots
access and wget to a nginx web server? tnks

att: alex



If with “block bots” you mean “block requests based
on User Agent”, you
can do this setting up something like this:

            if ($http_user_agent ~ libwww-perl ) {
                    return 400;
            }

(just an example, of course)


Fabio “Cova” Coatti http://members.ferrara.linux.it/cova
Ferrara Linux Users Group http://ferrara.linux.it
GnuPG fp:9765 A5B6 6843 17BC A646 BE8C FA56 373A 5374 C703
Old SysOps never die… they simply forget their password.