Weird issue between backend and upstream (0 bytes)

Hi list!

I’ve a weid issue I’ve been banging my head without been able to
understand
what’s going on.

Setup is as follow:

  • 1 nginx 0.7.67 doing load balance to two backends.
  • backend 1 with Nginx 1.2.2 and php-fpm 5.3
  • backend 2 with Nginx 0.7.67 and php-fpm 5.3

Some, and only some requests log in the upstream status 200 and 0 byte
returned.
Same request in the backend log shows a 200 status and ALWAYS the same
amount of bytes, which change between backends.
When this happen on a request that went to backend 1: upstream logs 200
0
byte, backend logs 200 15776 bytes.
When this happen on a request that went to backend 2: upstream logs 200
0
byte, backend logs 200 15670 bytes.

I’ve tried without luck to reproduce the problem, so I decided to start
debugging all requests to this site to try to understand why nginx is
returning empty responses.

This is what I see in upstream when error happens:
(…)
2012/08/03 13:25:32 [debug] 1546#0: *823 http proxy header: “Accept:
/
2012/08/03 13:25:32 [debug] 1546#0: *823 http proxy header: “User-Agent:
AdsBot-Google (+http://www.google.com/adsbot.html)”
2012/08/03 13:25:32 [debug] 1546#0: *823 http proxy header:
“Accept-Encoding: gzip,deflate”
2012/08/03 13:25:32 [debug] 1546#0: *823 http proxy header:
2012/08/03 13:25:32 [debug] 1546#0: *823 http cleanup add:
0000000001A3DF10
2012/08/03 13:25:32 [debug] 1546#0: *823 get rr peer, try: 2
2012/08/03 13:25:32 [debug] 1546#0: *823 get rr peer, current: 0 8
2012/08/03 13:25:32 [debug] 1546#0: *823 socket 149
2012/08/03 13:25:32 [debug] 1546#0: *823 epoll add connection: fd:149
ev:80000005
2012/08/03 13:25:32 [debug] 1546#0: *823 connect to 176.31.64.205:8059,
fd:149 #824
2012/08/03 13:25:32 [debug] 1546#0: *823 http upstream connect: -2
2012/08/03 13:25:32 [debug] 1546#0: *823 event timer add: 149:
30000:1344011162971
2012/08/03 13:25:32 [debug] 1546#0: *823 http run request:
“/s/miracle+noodle?”
2012/08/03 13:25:32 [debug] 1546#0: *823 http upstream check client,
write
event:1, “/s/miracle+noodle”
2012/08/03 13:25:32 [debug] 1546#0: *823 http upstream recv(): -1 (11:
Resource temporarily unavailable)
2012/08/03 13:25:32 [debug] 1546#0: *823 http upstream request:
“/s/miracle+noodle?”
2012/08/03 13:25:32 [debug] 1546#0: *823 http upstream send request
handler
2012/08/03 13:25:32 [debug] 1546#0: *823 http upstream send request
2012/08/03 13:25:32 [debug] 1546#0: *823 chain writer buf fl:1 s:358
2012/08/03 13:25:32 [debug] 1546#0: *823 chain writer in:
0000000001A3DF48
2012/08/03 13:25:32 [debug] 1546#0: *823 writev: 358
2012/08/03 13:25:32 [debug] 1546#0: *823 chain writer out:
0000000000000000
2012/08/03 13:25:32 [debug] 1546#0: *823 event timer del: 149:
1344011162971
2012/08/03 13:25:32 [debug] 1546#0: *823 event timer add: 149:
120000:1344011252972
2012/08/03 13:25:33 [debug] 1546#0: *823 http run request:
“/s/miracle+noodle?”
2012/08/03 13:25:33 [debug] 1546#0: *823 http upstream check client,
write
event:0, “/s/miracle+noodle”
2012/08/03 13:25:33 [debug] 1546#0: *823 http upstream recv(): 0 (11:
Resource temporarily unavailable)
2012/08/03 13:25:33 [debug] 1546#0: *823 http run request:
“/s/miracle+noodle?”
2012/08/03 13:25:33 [debug] 1546#0: *823 http upstream check client,
write
event:1, “/s/miracle+noodle”
2012/08/03 13:25:45 [debug] 1546#0: *823 http upstream request:
“/s/miracle+noodle?”
2012/08/03 13:25:45 [debug] 1546#0: *823 http upstream process header
2012/08/03 13:25:45 [debug] 1546#0: *823 malloc: 0000000001B8DC30:16384
2012/08/03 13:25:45 [debug] 1546#0: *823 recv: fd:149 4344 of 16296
2012/08/03 13:25:45 [debug] 1546#0: *823 http proxy status 200 “200 OK”
2012/08/03 13:25:45 [debug] 1546#0: *823 http proxy header: “Server:
nginx”
2012/08/03 13:25:45 [debug] 1546#0: *823 http proxy header: “Date: Fri,
03
Aug 2012 16:24:26 GMT”
2012/08/03 13:25:45 [debug] 1546#0: *823 http proxy header:
“Content-Type:
text/html”
2012/08/03 13:25:45 [debug] 1546#0: *823 http proxy header: “Connection:
close”
2012/08/03 13:25:45 [debug] 1546#0: *823 http proxy header:
“X-Powered-By:
PHP/5.3.15-1~dotdeb.0”
2012/08/03 13:25:45 [debug] 1546#0: *823 http proxy header done
2012/08/03 13:25:45 [debug] 1546#0: *823 HTTP/1.1 200 OKM
2012/08/03 13:25:45 [debug] 1546#0: *823 finalize http upstream request:
-1
2012/08/03 13:25:45 [debug] 1546#0: *823 finalize http proxy request
2012/08/03 13:25:45 [debug] 1546#0: *823 free rr peer 2 0
2012/08/03 13:25:45 [debug] 1546#0: *823 close http upstream connection:
149
2012/08/03 13:25:45 [debug] 1546#0: *823 event timer del: 149:
1344011252972
2012/08/03 13:25:45 [debug] 1546#0: *823 http upstream cache fd: 0
2012/08/03 13:25:45 [debug] 1546#0: *823 http file cache free
2012/08/03 13:25:45 [debug] 1546#0: *823 http finalize request: -1,
“/s/miracle+noodle?” 1
2012/08/03 13:25:45 [debug] 1546#0: *823 http close request
2012/08/03 13:25:45 [debug] 1546#0: *823 http log handler
2012/08/03 13:25:45 [debug] 1546#0: *823 run cleanup: 0000000001A3DC40
2012/08/03 13:25:45 [debug] 1546#0: *823 run cleanup: 0000000001850E58
2012/08/03 13:25:45 [debug] 1546#0: *823 free: 0000000001B8DC30
2012/08/03 13:25:45 [debug] 1546#0: *823 free: 00000000018501C0, unused:
1
2012/08/03 13:25:45 [debug] 1546#0: *823 free: 0000000001A3D890, unused:
757
2012/08/03 13:25:45 [debug] 1546#0: *823 close http connection: 148
2012/08/03 13:25:45 [debug] 1546#0: *823 free: 00000000019B7C20
2012/08/03 13:25:45 [debug] 1546#0: *823 free: 0000000001B2BEF0
2012/08/03 13:25:45 [debug] 1546#0: *823 free: 0000000001B3DC00, unused:
8
2012/08/03 13:25:45 [debug] 1546#0: *823 free: 00000000019F55B0, unused:
112

The above debug log on upsteams seems to say that the backend closed the
connection after headers, am I right?

However, backend debug of another affected request that always log as
returned the same amount of bytes (which is pretty weird), shows a
different story…
After many of…
(…)
2012/08/03 13:56:17 [debug] 1519#0: *221 pipe recv chain: -2
2012/08/03 13:56:17 [debug] 1519#0: *221 pipe buf busy s:1 t:1 f:0
000000000109F1E0, pos 00000000010A2FD0, size: 49752 file: 0, size: 0
2012/08/03 13:56:17 [debug] 1519#0: *221 pipe buf busy s:1 t:1 f:0
000000000109F1E0, pos 00000000010AF230, size: 65456 file: 0, size: 0
2012/08/03 13:56:17 [debug] 1519#0: *221 pipe buf free s:0 t:1 f:0
00000000010BF1F0, pos 00000000010BF1F0, size: 80 file: 0, size: 0
2012/08/03 13:56:17 [debug] 1519#0: *221 pipe length: -1
2012/08/03 13:56:17 [debug] 1519#0: *221 pipe write downstream: 0
2012/08/03 13:56:17 [debug] 1519#0: *221 pipe read upstream: 0
2012/08/03 13:56:17 [debug] 1519#0: *221 pipe buf busy s:1 t:1 f:0
000000000109F1E0, pos 00000000010A2FD0, size: 49752 file: 0, size: 0
2012/08/03 13:56:17 [debug] 1519#0: *221 pipe buf busy s:1 t:1 f:0
000000000109F1E0, pos 00000000010AF230, size: 65456 file: 0, size: 0
2012/08/03 13:56:17 [debug] 1519#0: *221 pipe buf free s:0 t:1 f:0
00000000010BF1F0, pos 00000000010BF1F0, size: 80 file: 0, size: 0
2012/08/03 13:56:17 [debug] 1519#0: *221 pipe length: -1
(…)

Sudenly this comes:
2012/08/03 13:56:17 [debug] 1519#0: *221 pipe recv chain: -2
2012/08/03 13:56:17 [debug] 1519#0: *221 pipe buf busy s:1 t:1 f:0
000000000109F1E0, pos 00000000010A2FD0, size: 49752 file: 0, size: 0
2012/08/03 13:56:17 [debug] 1519#0: *221 pipe buf busy s:1 t:1 f:0
000000000109F1E0, pos 00000000010AF230, size: 65456 file: 0, size: 0
2012/08/03 13:56:17 [debug] 1519#0: *221 pipe buf free s:0 t:1 f:0
00000000010BF1F0, pos 00000000010BF1F0, size: 49232 file: 0, size: 0
2012/08/03 13:56:17 [debug] 1519#0: *221 pipe length: -1
2012/08/03 13:56:17 [debug] 1519#0: *221 pipe write downstream: 0
2012/08/03 13:56:17 [debug] 1519#0: *221 pipe read upstream: 0
2012/08/03 13:56:17 [debug] 1519#0: *221 pipe buf busy s:1 t:1 f:0
000000000109F1E0, pos 00000000010A2FD0, size: 49752 file: 0, size: 0
2012/08/03 13:56:17 [debug] 1519#0: *221 pipe buf busy s:1 t:1 f:0
000000000109F1E0, pos 00000000010AF230, size: 65456 file: 0, size: 0
2012/08/03 13:56:17 [debug] 1519#0: *221 pipe buf free s:0 t:1 f:0
00000000010BF1F0, pos 00000000010BF1F0, size: 49232 file: 0, size: 0
2012/08/03 13:56:17 [debug] 1519#0: *221 pipe length: -1
2012/08/03 13:56:17 [debug] 1519#0: *221 event timer: 13, old:
1344013157360, new: 1344013157362
2012/08/03 13:56:17 [debug] 1519#0: *221 event timer: 12, old:
1344013037361, new: 1344013037362
2012/08/03 13:56:17 [debug] 1519#0: *221 post event 000000000106B360
2012/08/03 13:56:17 [debug] 1519#0: *221 delete posted event
000000000106B360
2012/08/03 13:56:17 [debug] 1519#0: *221 http run request:
“/s/new+balance+1500?”
2012/08/03 13:56:17 [debug] 1519#0: *221 http upstream check client,
write
event:0, “/s/new+balance+1500”
2012/08/03 13:56:17 [debug] 1519#0: *221 http upstream recv(): 0 (11:
Resource temporarily unavailable)
2012/08/03 13:56:17 [debug] 1519#0: *221 finalize http upstream request:
499
2012/08/03 13:56:17 [debug] 1519#0: *221 finalize http fastcgi request
2012/08/03 13:56:17 [debug] 1519#0: *221 free rr peer 1 0
2012/08/03 13:56:17 [debug] 1519#0: *221 close http upstream connection:
13
2012/08/03 13:56:17 [debug] 1519#0: *221 free: 0000000000FF7CD0, unused:
48
2012/08/03 13:56:17 [debug] 1519#0: *221 event timer del: 13:
1344013157360
2012/08/03 13:56:17 [debug] 1519#0: *221 reusable connection: 0
2012/08/03 13:56:17 [debug] 1519#0: *221 http upstream temp fd: -1
2012/08/03 13:56:17 [debug] 1519#0: *221 http output filter
“/s/new+balance+1500?”
2012/08/03 13:56:17 [debug] 1519#0: *221 http copy filter:
“/s/new+balance+1500?”
2012/08/03 13:56:17 [debug] 1519#0: *221 image filter
2012/08/03 13:56:17 [debug] 1519#0: *221 xslt filter body
2012/08/03 13:56:17 [debug] 1519#0: *221 http postpone filter
“/s/new+balance+1500?” 0000000000FEDA18
2012/08/03 13:56:17 [debug] 1519#0: *221 http copy filter: -1
“/s/new+balance+1500?”
2012/08/03 13:56:17 [debug] 1519#0: *221 http finalize request: -1,
“/s/new+balance+1500?” a:1, c:1

So this means the backend felt the upstream closed the connection before
it
was allowed to return all data:
2012/08/03 13:56:17 [debug] 1519#0: *221 finalize http upstream request:
499

If that’s the case, why it saves this to the logfile:
[03/Aug/2012:13:56:17 -0300] “GET /s/new+balance+1500 HTTP/1.0” 200
15776
“-” “AdsBot-Google (+http://www.google.com/adsbot.html)” - Cache: - 200
24.535

I think the problem may be in the upstream, as this weird beahavior
happens
with both backends and both use different nginx versions, but I’m really
out of answers right now with this issue.

Any idea? Hint? Today of 8K requests, 2779 returned 0 bytes on the
upstream
and 15776 bytes on the backend…

Thank you!!

Guzmn

Anyone saw/had any similar issue? we are kinda with this issue :S

Regards

M

Hi…

Just made myself of a dump from the backend perspective and analized
with
wireshark to see what were on those bytes, I was pretty sure those bytes
were some error from the app with the key to fix it… but no, those
15776
bytes are always 15776 but the first 15776 of every request, so every
request is different.

According to tcpdump everything flows OK and after one common ack from
upstream while backend sends data, upstream inmediatly sends the FIN
flag
and inmediatly after that a group of RST arrive.

So, clearly there’s no backend issue here, whatever happens it’s on the
upstream, and now it looks more a problem with the OS than nginx fault.

Anyone have any ideas why the OS may be closing & resetting nginx
connections?

OS: Debian Lenny Virtualized inside XenServer (Citrix free version)
Virtual cores: 2 ( Load: 0.10)
RAM: 2GB (Currently: 44mb free / 1.3GB os cached, 310MB buffers)

Currently show no load and problem occurrs to 20%-30% of traffic to both
backends but only for a specific url. This main nginx it’s doing load
balancing for more domains, some of them with a lot of traffic, and
there’s
not a single status 200 0 byte request in the logs of others domains.

Some network stats:
netstat -anp|awk ‘$1~/^tcp/ { print $6 }’|sort -n|uniq -c
2 CLOSING
164 ESTABLISHED
21 FIN_WAIT1
150 FIN_WAIT2
1 LAST_ACK
9 LISTEN
4 SYN_RECV
1 SYN_SENT
293 TIME_WAIT

No custom configuration inside /etc/sysctl.conf…

I’ll be gladly submit more thata from the faulty nginx server if anyone
needs, just ask

Thanks for any hint, I’m really out of ideas

Hello!

On Thu, Aug 09, 2012 at 10:24:51AM -0300, Guzmán Brasó wrote:

Just made myself of a dump from the backend perspective and analized with
wireshark to see what were on those bytes, I was pretty sure those bytes
were some error from the app with the key to fix it… but no, those 15776
bytes are always 15776 but the first 15776 of every request, so every
request is different.

According to tcpdump everything flows OK and after one common ack from
upstream while backend sends data, upstream inmediatly sends the FIN flag
and inmediatly after that a group of RST arrive.

This might be a result of several first packets (e.g. initial
buffer size) being sent before app crashes somewhere at the end of
response generation. Have you tried to look what happens from
the app’s point of view?

[…]

Maxim D.

Hi Maxim! Thanks for taking time to check it out…

So the 499 seen by the php-fpm nginx here It’s not that main nginx
closed
the connection but that fastcgi closed the connection?

All the time thought was nothing to do with the backend… there’s no
php
warning or error on the php-fpm side when this happens, will try to
enable
debug mode in php-fpm and swim around the logs…

Thanks!

2012/08/03 13:56:17 [debug] 1519#0: *221 finalize http upstream request:
499

Hello!

On Thu, Aug 09, 2012 at 12:10:57PM -0300, Guzmán Brasó wrote:

Hi Maxim! Thanks for taking time to check it out…

So the 499 seen by the php-fpm nginx here It’s not that main nginx closed
the connection but that fastcgi closed the connection?

All the time thought was nothing to do with the backend… there’s no php
warning or error on the php-fpm side when this happens, will try to enable
debug mode in php-fpm and swim around the logs…

Ah, sorry, it looks like I’ve misunderstood what you were trying
to say. Partialy because of strange usage of the “upstream” word -
from frontend point of view it’s more or less synonym for
“backend”, you probably mean to say “frontend” instead.

Looking though debug log you’ve provided earlier suggests that
everything is actually ok. Here is quote from frontend logs:

2012/08/03 13:25:45 [debug] 1546#0: *823 http proxy header done
2012/08/03 13:25:45 [debug] 1546#0: *823 HTTP/1.1 200 OKM
2012/08/03 13:25:45 [debug] 1546#0: *823 finalize http upstream request: -1
2012/08/03 13:25:45 [debug] 1546#0: *823 finalize http proxy request

The intresting part is somewhere between “HTTP…” and “finalize
…” lines, but it’s at low level and was misssed from grep. Most
likely (assuming request finalization is for reason) client closed
request and this resulted in writev() failure. This in turn
caused finalization of a request, and close of the connection to
the upstream server (aka backend). There are 0 bytes sent as
nothing was actually sent. The fact that you see many such log
lines suggests that you’ve probably disabled client abort checks
(proxy_ignore_client_abort).

On a backend you see this as “200” with some small number of bytes
which is some first bytes it was able to send before connection
was closed by the frontend. The 499 finalization is internal, and
as response status as sent to client is already 200 - it doesn’t
affect access log.

Maxim D.

By the way, it’s getting worse with time…

Total php requests with status 200 since today 6:00 AM: 3708.

Of those, 2292 returned status 200 and received a RST from nginx load
balancer, so more than half of it are getting caught by this issue…

Will try to get the original logs to see if there’s some light there…

Thank you!

Guzman

Hi Maxim!, once again thank you…

Exactly what I thought… but something doesn’t make sense, the owner of
the
site put some paid traffic on it and I now numbers are bigger than
before,
it could not be that hundreds of people abort the connection exactly and
precisely at the same byte. It’s always the same byte, and it’s always
random thought I’ve never been able to reproduce it myself. Just sniffed
others traffic to see it.

Just checked and there’s no mention of proxy_ignore_client_abort either
on
frontend (nginx load balancer) or backend (nginx proxy for php-fpm).
This
is really weird…

I think I still have the original debug logs from where that grep came
from, will check it out as soon I can ssh into it.

Thank you once again!!

Guzmn

Hello!

On Thu, Aug 09, 2012 at 04:07:28PM -0300, Guzmán Brasó wrote:

Hi Maxim!, once again thank you…

Exactly what I thought… but something doesn’t make sense, the owner of the
site put some paid traffic on it and I now numbers are bigger than before,
it could not be that hundreds of people abort the connection exactly and
precisely at the same byte. It’s always the same byte, and it’s always
random thought I’ve never been able to reproduce it myself. Just sniffed
others traffic to see it.

As frontend’s nginx detects connection abort only when it tries to
send first bytes of a response, and from logs you’ve provided it
looks like response generation takes more than 10 seconds, - it’s
highly unlikely that the abort really happens just before first
byte. Most likely it happens during 10 seconds in question.

Just checked and there’s no mention of proxy_ignore_client_abort either on
frontend (nginx load balancer) or backend (nginx proxy for php-fpm). This
is really weird…

Not really. There are number of cases which might prevent nginx
from detecting connection abort by a client, even if it’s not
specifically configured to ignore connection aborts. Well-known
cases include connection close with pending pipelined request in a
connection, or clean connection close of a ssl connection (with
shutdown alert sent by a client). In such cases nginx will be
able to detect connection close only if event method might provide
some hint about pending connection close (kqueue only as of now).

Maxim D.

Looking though debug log you’ve provided earlier suggests that
likely (assuming request finalization is for reason) client closed
as response status as sent to client is already 200 - it doesn’t


Guzmán Brasó Núñez
Senior Perl Developer / Sysadmin
Web: http://guzman.braso.info
Mobile: +598 98 674020