Monthly Gateway Timeout

dubstep · March 4, 2011, 10:14am

Hi All

I’m experiencing a very strange network problem that occurs every four
to six weeks and lasts for approximately one hour. I can not manually
provoke the problem and, during this very hour, I can not resolve it
even with rebooting the server. Even stranger, two physically separated
servers suffer from the same problem at the same time. Both servers use
nginx as SSL reverse proxy. Each server has disjunct domains to handle.

During one hour, we just see: Nginx - Gateway Timout. After one hour, it
suddenly works again. It started around four months ago. Note that all
other network traffic is not affected, only nginx HTTPS and HTTP.

So, what’s in common with both servers:

Hardware (UltraSPARC T2 Plus).
OS (Solaris 10 U9 latest patch level).
Time (both servers use the exact same NTP-controlled time).
Switch.
Firewall (I replaced the firewall four weeks ago, the error still
just appeared).
Nginx 0.8.46-0.8.54, same configuration but for different domains
internally hosted at different servers, compiled with Solaris OpenSSL.

I observed that nginx, during this mysterious hour, mistakenly proxies
the requests back to the original IP on random ports instead of the
proxy IP and that these requests are blocked by the firewall.

Because two different machines are affected at the same time and it
cannot be resolved by a restart of nginx or a reboot of the whole
server, and it resolves itself after approximately one hour, my guess is
that some time-dependent error occurs in nginx.

I will replace nginx with apache to verify the problem actually is nginx
and not the OS, switch are whatever and then wait and hope

Does anyone have an idea how to locate or investigate this problem?

Kind regards,
Marc

Posted at Nginx Forum:

Marc_Kramis · March 4, 2011, 11:04am

Hello!

On Fri, Mar 04, 2011 at 04:12:59AM -0500, Marc Kramis wrote:

suddenly works again. It started around four months ago. Note that all
other network traffic is not affected, only nginx HTTPS and HTTP.

So, what’s in common with both servers:

Hardware (UltraSPARC T2 Plus).

Some endianness issue?..

proxy IP and that these requests are blocked by the firewall.

Because two different machines are affected at the same time and it
cannot be resolved by a restart of nginx or a reboot of the whole
server, and it resolves itself after approximately one hour, my guess is
that some time-dependent error occurs in nginx.

I will replace nginx with apache to verify the problem actually is nginx
and not the OS, switch are whatever and then wait and hope

Does anyone have an idea how to locate or investigate this problem?

You may want to follow Debugging | NGINX and provide
config and debug log (if you are able to obtain one), as well as
nginx -V output. This may help to investigate the problem if the
issue is actually in nginx.

Maxim D.

Marc_Kramis · March 4, 2011, 11:34am

Hi Marc,
can you reproduce this issue by changing system clock back to the
“unfortunate hour”?

Also, is the wrong proxy address always the same or does nginx try to
connect to random address and/or port?

Best regards,
Piotr S. < [email protected] >

Marc_Kramis · March 5, 2011, 12:19pm

Hi Maxim, Hi Piotr

nginx -V:

nginx version: nginx/0.8.54
built by Sun C 5.10 SunOS_sparc Patch 141861-06 2010/07/28
TLS SNI support disabled
configure arguments: --with-cc=/opt/sunstudio12.1/bin/cc
–with-cpp=/opt/sunstudio12.1/bin/cc --with-cc-opt=‘-xtarget=ultraT2plus
-xO5 -I /usr/sfw/include’ --with-ld-opt=‘-R/usr/sfw/lib -L/usr/sfw/lib’
–prefix=/nginx --user=daemon --group=daemon --with-http_ssl_module
–with-pcre=…/pcre-8.12 --with-zlib=…/zlib-1.2.5

Note that the bug also appeared with optimization level O3.

nginx.conf:

— Basic Configuration

user daemon daemon;
error_log /nginx/logs/error.log warn;
ssl_engine pkcs11;
worker_processes 16;

events {
worker_connections 256;
}

— HTTP Configuration

http {

log_format LOG ‘$remote_addr - $remote_user
[$time_local] “$request” $status $body_bytes_sent “$http_referer”
“$http_user_agent”’;
access_log /nginx/logs/$host.access.log LOG;

server_tokens off;

gzip on;
gzip_vary on;
gzip_proxied any;
gzip_types text/plain text/xml text/css text/javascript
image/svg+xml application/xhtml+xml application/xml application/rss+xml
application/atom+xml application/x-javascript application/json;

client_body_buffer_size 128k;
client_max_body_size 256m;
client_body_temp_path /nginx/client_body_temp 1 2;

proxy_read_timeout 3600;
proxy_redirect off;
proxy_pass_header Set-Cookie;
proxy_temp_path /nginx/proxy_temp;

— https://foo -------------------------------------------

server {

listen                  446;
server_name             foo;

ssl                     on;
ssl_certificate         /nginx/ssl/foo.crt;
ssl_certificate_key     /nginx/ssl/foo.key;
ssl_session_cache       shared:SSL:8m;

location /bar {
  rewrite               ^/(.*)$ https://foo/bar/ permanent;
}

location /bar/ {
  proxy_pass            http://10.10.10.1:8080/bar/;
}

location / {
  rewrite               ^/(.*)$ https://foo permanent;
}

}

server {

listen                  80 default;
server_name             _;
server_name_in_redirect off;

location / {
  rewrite               ^/(.*)$ http://foo permanent;
}

}

The error log is full of the following error (only during the
problematic hour):

2011/03/04 08:40:28 [error] 20062#0: *507995 upstream timed out (145:
Connection timed out) while reading response header from upstream,
client: IP, server: SERVER, request: “GET URL
HTTP/1.1”, upstream: “UPSTREAM”, host: “HOST”, referrer:
“***REFERER”

I just realized that only during this hour, the firewall lists blocked
outgoing traffic exactly to the client IPs of the error log at random
ports, i.e., I assume that during this hour, nginx mistakenly sends the
proxied request back to the client instead of the internal server.

Regards,
Marc

Posted at Nginx Forum:

Marc_Kramis · March 5, 2011, 1:45pm

Hello!

On Sat, Mar 05, 2011 at 06:18:00AM -0500, Marc Kramis wrote:

[…]

log_format LOG ‘$remote_addr - $remote_user
[$time_local] “$request” $status $body_bytes_sent “$http_referer”
“$http_user_agent”’;
access_log /nginx/logs/$host.access.log LOG;

Just a side note: this, along with no root specified in server{}
blocks, makes your system vulnerable to inode exhaustion attack
(i.e. attacker may create arbitrary number of log files on your
system, eventually bringing it down).

[…]

proxy_read_timeout 3600;

Uhm… Its really big one. It’s unlikely that clients will wait
for so long for an answer.

[…]

location /bar/ {
  proxy_pass            http://10.10.10.1:8080/bar/;

With this configuration nginx will convert ip/port to binary form
once (while reading config), and will use it (binary) while doing
connect()'s. It is very unlikely that something bad will happen
in nginx at this point.

[…]

The error log is full of the following error (only during the
problematic hour):

2011/03/04 08:40:28 [error] 20062#0: *507995 upstream timed out (145:
Connection timed out) while reading response header from upstream,
client: IP, server: SERVER, request: “GET URL
HTTP/1.1”, upstream: “UPSTREAM”, host: “HOST”, referrer:
“***REFERER”

Words “while reading response header” in error_log suggest that
connection to upstream was established (and request was sent), but
upstream failed to generate an answer in time.

I just realized that only during this hour, the firewall lists blocked
outgoing traffic exactly to the client IPs of the error log at random
ports, i.e., I assume that during this hour, nginx mistakenly sends the
proxied request back to the client instead of the internal server.

With your proxy_read_timeout it’s very likely that appropriate
firewall states for connections with clients were already expired,
and that’s why your firewall complains about nginx trying to
reply “504 Gateway timeout” back to clients once it detects
timeout.

From here it looks like it’s your backend problem, not nginx. And
strange firwall complains you see are due to timeouts
misconfiguration (you have to configure timeouts on your firewall
at least as big as ones in nginx).

Maxim D.

Marc_Kramis · March 5, 2011, 3:29pm

Hi Igor

It’s a 32-bit executable.

Regards,
Marc

Posted at Nginx Forum:

Marc_Kramis · March 5, 2011, 1:50pm

On 04.03.2011, at 12:12, Marc Kramis wrote:

UltraSPARC T2 Plus

How does nginx run: as 32-bit or 64-bit executable?

–
Igor S.
http://sysoev.ru/en/

Marc_Kramis · March 5, 2011, 3:29pm

Hi Maxim

The timeout is required because some (old) webapps take quite some time
to generate reports and stuff. I try to remove this global setting and
just apply it for the old webapps.

Thanks for the hint about the logs. I will centralize that in a single
file.

I’m now replacing one nginx by an apache and then wait a few weeks. If
both go down the same time again, the problem is definitively not in the
reverse proxy.

Best regards,
Marc

Posted at Nginx Forum: