Nokogiri ip ban?

Dobai-Pataky_BSSSSl · November 23, 2010, 12:19pm

Hi there…

I’ve been playing with Ruby and Nokogiri to crawl some website to get
text information, but after a while I realize that some of those
websites block my access while the script is running. Since the moment
they block the access, the script keeps running (cause I handled the
exception) but is getting what is suppose to.
After the block, if I try to access using the browser, I just can’t, so
I guess they block the IP address, right?

I also tried using TOR, like this:

Nokogiri::HTML(open(url, :proxy => ‘http://(ip_address):(port)’))

But I still have the same problem: works in the beginning, but after a
while stops working.

I can just run the crawler in steps, to not do lots of calls to the
website in the same moment, but is kinda boring…

Any of you face the same problem? Any of you have a solution for this?

thanks,

Luis

luisfgnr · November 23, 2010, 12:40pm

Bombing a webserver in the fashion you describe is not advisable in any
way. They’re clearly not happy with what you’re doing… so either lower
the frequency of the requests or ask them directly about your needs -
they might be willing to let you run your script more often or even give
you the raw data directly.

Andrea D.

Il 23/11/2010 12:19, Luis G. ha scritto:

luisfgnr · November 23, 2010, 12:42pm

On Tue, Nov 23, 2010 at 12:19 PM, Luis G. [email protected] wrote:

Hi there…
But I still have the same problem: works in the beginning, but after a
while stops working.

I can just run the crawler in steps, to not do lots of calls to the
website in the same moment, but is kinda boring…

Any of you face the same problem? Any of you have a solution for this?

Run your crawler in steps (don’t crawl the whole site, and only grab
what is new; that’s what the Age header is for!), and respect
robots.txt.

Otherwise, well, you get what you deserve, if you hog a server’s CPU
cycles and create a Denial of Service attack (nobody cares if it is by
accident or by design).

Phillip G.

Though the folk I have met,
(Ah, how soon!) they forget
When I’ve moved on to some other place,
There may be one or two,
When I’ve played and passed through,
Who’ll remember my song or my face.

luisfgnr · November 23, 2010, 1:34pm

On Tue, Nov 23, 2010 at 2:19 PM, Luis G. [email protected] wrote:

tags, so not so much info to look into. And I’m just accessing the
web-pages I didn’t access before (just the new ones).

So, of course I understand that they need to protect webserver, but I
think my program is not really a threat

I’m gonna run the script in steps and in different days like I thought
before and like you told me.

One more thought, what are you using for user-agent? Some sites block
empty
or known to be automated user-agents.

Regards,
Ammar

luisfgnr · November 23, 2010, 1:19pm

Hey guys… Thanks for your replies.

I thought the program I built was not so heavy for the website I’m
trying to get info from.
The thing is, I’m accessing the website to get information, but I just
access there to specific pages in that domain, so I’m not really
crawling everything.
I build the url based in some info I have in my DB and after I have the
url, I access to that url directly and collect the information in that
specific page. And more, the page I’m accessing have just some

html
tags, so not so much info to look into. And I’m just accessing the
web-pages I didn’t access before (just the new ones).

So, of course I understand that they need to protect webserver, but I
think my program is not really a threat

I’m gonna run the script in steps and in different days like I thought
before and like you told me.

Thanks a lot for your help.

Luis

luisfgnr · November 23, 2010, 1:46pm

Hi Ammar

Yeah, that’s one of the reasons I asked this question, because I though
that we can solve this issue just changing the user agent or the headers
or something… Like they have here:
http://www.ruby-doc.org/stdlib/libdoc/open-uri/rdoc/

Anyway, now I’m using an empty user agent, but I tried to define a user
agent but the result was the same. I tried something like:

Nokogiri::HTML(open(url, “User-Agent” => “Ruby/#{RUBY_VERSION}”))

I also tried the user agents we can use in Mechanize (‘Linux Mozilla’,
for example) but nothing worked.

Luis

luisfgnr · November 23, 2010, 2:02pm

On Tue, Nov 23, 2010 at 2:46 PM, Luis G. [email protected] wrote:

Anyway, now I’m using an empty user agent, but I tried to define a user
agent but the result was the same. I tried something like:

Nokogiri::HTML(open(url, “User-Agent” => “Ruby/#{RUBY_VERSION}”))

I also tried the user agents we can use in Mechanize (‘Linux Mozilla’,
for example) but nothing worked.

It was just a guess. But, as you mentioned, your IP is being blocked, so
it’s too late to change agents now. You may have been blocked for any
reason
really, frequency of requests, user-agent, or something else entirely.
Usually such blocks are temporary (it could be a dynamic IP) so you
could
try again later. But who knows how long it will take, or if you will be
blocked again.

Andrea’s suggestion is probably your best bet, contact the owners of the
site and request access. You might find out why you got blocked and
avoid it
in the future.

Regards,
Ammar

luisfgnr · November 23, 2010, 2:09pm

Actually I was blocked before, and is like during 24 hours or so.
But the thing is, I’m running the crawlers in a test server, and not in
the production one. And they are not under the same network, so the IP’s
are different

Thanks guys.

Luis

Nokogiri ip ban?

Otherwise, well, you get what you deserve, if you hog a server’s CPU cycles and create a Denial of Service attack (nobody cares if it is by accident or by design).

Otherwise, well, you get what you deserve, if you hog a server’s CPU
cycles and create a Denial of Service attack (nobody cares if it is by
accident or by design).