Any problems with large number (160) of upstream servers (mongrels)?

jonathan · November 3, 2007, 6:51am

Hi,

I was trying to proxy to a large number of upstream servers (mongrels)
today and while it was working great proxying to 80 mongrels, 160
mongrels seemed to be a different matter. When running with 160, nginx
suddenly became a strange bottleneck in the system—it wasn’t using
CPU, but for some reason connections would be accepted and distributed
very slowly. Putting it back down to 80 and the problem went away. All
the upstream servers are in a single upstream directive block.

I know this is vague, but as the problem occurred in a production
environment I didn’t have much time to diagnose it. I am going to see
about setting a test server to see if I can figure anything out. If
you have any recommendations as to what to look at it, it would be
greatly appreciated.

Otherwise, I am mostly just wondering if anyone has seen this
before—has anyone used more upstream servers than this without a
problem? If so, is there anything special you did in your config? My
config is mostly based on Ezra’s Rails config… I have 4 worker
processes, 1024 worker connections… not much else special. I am
using 0.5.26, but I don’t see anything too serious in the changelog.

Thanks,

JD

jonathan · November 3, 2007, 6:36pm

On 11/3/07, Jonathan D. [email protected] wrote:

I was trying to proxy to a large number of upstream servers (mongrels)
today and while it was working great proxying to 80 mongrels, 160
mongrels seemed to be a different matter.

Nginx does not reuse connections – for every request, it connects to
the upstream Mongrel, sends the request, reads the response and then
disconnects. This probably does not scale very well to lots of
upstreams; a lot of time and system resources will be spent just on
TCP and socket bookeeping.

What was the load like when you experienced the problem?

Alexander.

jonathan · November 3, 2007, 10:24pm

Nginx does not reuse connections – for every request, it connects to
the upstream Mongrel, sends the request, reads the response and then
disconnects. This probably does not scale very well to lots of
upstreams; a lot of time and system resources will be spent just on
TCP and socket bookeeping.

What was the load like when you experienced the problem?

The system load was very nominal. The load on nginx (in terms of
connections) should have been about constant throughout the period, so
it should have been doing roughly the same amount of work but to a
larger list of upstream servers.

It could have been some other factor, and at this point I am inclined
to think it was. It doesn’t make a lot of sense that a larger upstream
list would cause a problem.

jonathan · November 8, 2007, 5:22am

Perhaps try adding a few more worker processes to start. Awaiting
Igor’s response on this one

~Wayne