Proxied requests hang when DNS response has wrong ident

We have recently discovered a very rare occurence when requests through
nginx will hang if the resolver sends a response with a mismatching
ident.
We are seeing this in production with 1.7.1 and I have been able to
re-produce with 1.7.3. The relevant parts of the config are:

resolver 10.65.255.4;

location / {
proxy_pass http://$host.internal$request_uri;
}

So we basically proxy .atlassian.net to
.atlassian.net.internal. The resolver is a pdns recursor
running
on the same machine.

The error we see in the logs is:

2014/06/19 20:22:29 [error] 28235#0: wrong ident 57716 response for
customer.atlassian.net.internal, expect 39916
2014/06/19 20:22:29 [error] 28235#0: unexpected response for
customer.atlassian.net.internal
2014/06/19 20:22:59 [error] 28235#0: *23776286
customer.atlassian.net.internal could not be resolved (110: Operation
timed
out), client: 83.244.247.165, server: *.atlassian.net, request: “GET
/plugins/ HTTP/1.1”, host: “customer.atlassian.net”, referrer: "
https://customer.atlassian.net/secure/Dashboard.jspa"

I have been able to re-produce this error in a test environment - this
is
what I used:

  • a basic python script pretending to be a recursive resolver, which can
    mangle the ident of a response. The resolver directive of nginx is
    pointed
    to this recursor. I added in a delay of 100ms before sending a reply
    (based
    on Mini Fake DNS server « Python recipes « ActiveState Code).
  • A proxy configuration same as above - only the resolver and
    location/proxy_pass line was added to a default nginx config
  • Static webserver as the backend
  • GNU parallel + curl to issue concurrent requests

When the ident is correct, the system behaves as expected. However, if
an
ident is incorrect, AND nginx gets multiple concurrent (5) requests for
that same backend, we see all the requests hanging. Doing a tcpdump for
DNS
traffic shows the first request go out, and the response coming back
with
the wrong ident, but no subsequent dns requests. The critical factor
seems
to be multiple incoming requests to nginx, while a dns request is
in-flight.

If needed I can provide all the scripts and config I used to produce the
error.

Thanks!

Pramod K.

On Tue, Jul 15, 2014 at 08:04:44PM +1000, Pramod K. wrote:

We have recently discovered a very rare occurence when requests through
nginx will hang if the resolver sends a response with a mismatching ident.
We are seeing this in production with 1.7.1 and I have been able to
re-produce with 1.7.3. The relevant parts of the config are:

resolver 10.65.255.4;

location / {
proxy_pass http://$host.internal$request_uri;
}

. - :

HG changeset patch

User Ruslan E. [email protected]

Date 1405424486 -14400

Tue Jul 15 15:41:26 2014 +0400

Node ID 8a16ec3871efad5990604a21c6bc00c0c9347446

Parent abd460ece11e9c85d4c0c4a8e6ac46cfb5fa62b5

Resolver: fixed resend on malformed responses.

DNS request resend on malformed responses was broken in 98876ce2a7fd.

Reported by Pramod K…

diff --git a/src/core/ngx_resolver.c b/src/core/ngx_resolver.c
— a/src/core/ngx_resolver.c
+++ b/src/core/ngx_resolver.c
@@ -1467,7 +1467,6 @@ ngx_resolver_process_a(ngx_resolver_t *r
goto failed;
}

  •    rn->naddrs6 = 0;
       qident = (rn->query6[0] << 8) + rn->query6[1];
    
       break;
    

@@ -1482,7 +1481,6 @@ ngx_resolver_process_a(ngx_resolver_t *r
goto failed;
}

  •    rn->naddrs = 0;
       qident = (rn->query[0] << 8) + rn->query[1];
    
    }

@@ -1507,6 +1505,8 @@ ngx_resolver_process_a(ngx_resolver_t *r

     case NGX_RESOLVE_AAAA:
  •        rn->naddrs6 = 0;
    
  •        if (rn->naddrs == (u_short) -1) {
               goto next;
           }
    

@@ -1519,6 +1519,8 @@ ngx_resolver_process_a(ngx_resolver_t *r

     default: /* NGX_RESOLVE_A */
  •        rn->naddrs = 0;
    
  •        if (rn->naddrs6 == (u_short) -1) {
               goto next;
           }
    

@@ -1539,6 +1541,8 @@ ngx_resolver_process_a(ngx_resolver_t *r

     case NGX_RESOLVE_AAAA:
  •        rn->naddrs6 = 0;
    
  •        if (rn->naddrs == (u_short) -1) {
               rn->code = (u_char) code;
               goto next;
    

@@ -1548,6 +1552,8 @@ ngx_resolver_process_a(ngx_resolver_t *r

     default: /* NGX_RESOLVE_A */
  •        rn->naddrs = 0;
    
  •        if (rn->naddrs6 == (u_short) -1) {
               rn->code = (u_char) code;
               goto next;
    

@@ -1817,6 +1823,25 @@ ngx_resolver_process_a(ngx_resolver_t *r
}
}

  • switch (qtype) {

+#if (NGX_HAVE_INET6)

  • case NGX_RESOLVE_AAAA:
  •    if (rn->naddrs6 == (u_short) -1) {
    
  •        rn->naddrs6 = 0;
    
  •    }
    
  •    break;
    

+#endif
+

  • default: /* NGX_RESOLVE_A */
  •    if (rn->naddrs == (u_short) -1) {
    
  •        rn->naddrs = 0;
    
  •    }
    
  • }
  • if (rn->naddrs != (u_short) -1
    #if (NGX_HAVE_INET6)
    && rn->naddrs6 != (u_short) -1

Hi,

On 16 Jul 2014, at 03.01, Pramod K. [email protected]
wrote:

I will get this build out to our production environment this week. Will report
back if there are any issues.

Thanks again!

Pramod,


nginx mailing list
[email protected]
nginx Info Page

I’m experiencing a similar issue.

I get a couple of ident mismatch errors in the error log, and then
gradually over time I begin to see thousands of ESTABLISHED connections
stuck with no activity.
Eventually worker_connections is exhausted. I believe it could be
related to this.

I’m using package nginx-1.6.1-1.el6.ngx.x86_64, which is the latest
stable. Has this fix been ported to 1.6.x stable yet?
Is there a workaround?

Thanks,

Jason

On 15 July 2014 21:41, Ruslan E. [email protected] wrote:

diff --git a/src/core/ngx_resolver.c b/src/core/ngx_resolver.c
— a/src/core/ngx_resolver.c
+++ b/src/core/ngx_resolver.c

Thanks for the quick response and patch, Ruslan. I have tested a build
incorporating this patch, and it behaves as expected, the resolver
retrying
rather than blocking behind the first request.

I will get this build out to our production environment this week. Will
report back if there are any issues.

Thanks again!

Pramod,

On Monday 18 August 2014 11:40:07 Jason W. wrote:
[…]

the latest stable. Has this fix been ported to 1.6.x stable
yet?

No, it hasn’t. You should use mainline version of nginx.

wbr, Valentin V. Bartenev

On Mon, Aug 18, 2014 at 2:45 PM, Jason W. [email protected]
wrote:

There’s the fear there are significant changes from 1.6 to 1.7 that may
introduce other problems, and would need to go through some extensive
testing before we can commit. Especially as the 1.6 is labelled “stable”
and the 1.7 “mainline” (and not stable) - maybe these terms aren’t meant to
convey the meaning they appear to though. I’ll start discussion about
testing 1.7 though.

​That is a shared fear, I guess, since questioning about ‘stable’ comes
regularly on topic.
nginx guys regularly say you should not fear using 1.7 branch in
production, however these recent days have seen a lot of bugs popping
up,
some of them which one could even consider critical (impacting
availability
of the Web server). Definitely not what you want in production.

It seems the philosophy behind stable is ‘no risk, few changes’, thus
experimental stuff will only be backported after public release/testing
on
1.7 branch.
Moreover, according to an answer brought 2 weeks ago
http://forum.nginx.org/read.php?2,252312,252329#msg-252329,
SPDY-related
functionalities are not due in 1.6 branch. Maybe 1.8, once 1.7 will be
considered stable enough?

nginx problems do not come from basic well-tested/heavily-used core
functionalities (the ones you use to fulfil a basic Web server job).
They
come from new abilities brought in, which might compromise the whole
thing…

B. R.

On 18 Aug 2014, at 12.05, Valentin V. Bartenev [email protected] wrote:

I’m using package nginx-1.6.1-1.el6.ngx.x86_64, which is
the latest stable. Has this fix been ported to 1.6.x stable
yet?

No, it hasn’t. You should use mainline version of nginx.

wbr, Valentin V. Bartenev

There’s the fear there are significant changes from 1.6 to 1.7 that may
introduce other problems, and would need to go through some extensive
testing before we can commit. Especially as the 1.6 is labelled “stable”
and the 1.7 “mainline” (and not stable) - maybe these terms aren’t meant
to convey the meaning they appear to though. I’ll start discussion about
testing 1.7 though.

Any ETA on when this might be back-ported, if at all?
I guess a second question is when will 1.7 become the stable?

Sorry if you’re the wrong person to ask! And thanks for being clear.

Jason

If its only a few lines you may consider adding the patch manually.

Posted at Nginx Forum:

On Monday 18 August 2014 13:45:46 Jason W. wrote:
[…]

There’s the fear there are significant changes from 1.6 to 1.7 that may
introduce other problems, and would need to go through some extensive
testing before we can commit. Especially as the 1.6 is labelled “stable” and
the 1.7 “mainline” (and not stable) - maybe these terms aren’t meant to
convey the meaning they appear to though. I’ll start discussion about
testing 1.7 though.
[…]

It’s a common misunderstanding about branches.
Check this out: Announcing NGINX 1.6 and 1.7 as Stable & Mainline Versions

In most cases the only care you need with the mainline branch is to read
the
changelog before update.

Also you can consider to buy nginx plus license to get the official
support
and thus feel yourself more confident.

wbr, Valentin V. Bartenev

On 18 Aug 2014, at 14.22, Valentin V. Bartenev [email protected] wrote:

It’s a common misunderstanding about branches.
Check this out: Announcing NGINX 1.6 and 1.7 as Stable & Mainline Versions

In most cases the only care you need with the mainline branch is to read the
changelog before update.

Also you can consider to buy nginx plus license to get the official support
and thus feel yourself more confident.

wbr, Valentin V. Bartenev

Thanks, that’s a perfect explanation.

The fear remains though that to fix a single small issue that is
possibly a few lines changed, I would be (in essence) changing thousands
upon thousands of lines, adding new features, updating features, and
creating a much larger surface area for potential new bugs. Where
sticking with the stable feature branch gives us what the stable feature
branch is intended to do - minimise change and surface area for new
issues.

Reading the changelog I agree is the best approach, but if a new feature
is added and it modified shared-code to support it, this might not
always included in a change log. Plus some shared code, even if
mentioned, unless I’m an Nginx developer likely won’t know what other
parts it affects. And the moment something critical is changed and fully
described in the changelog (because mainline has updates to existing
features) then we hit the blocker where we need to weigh up risk again -
upgrade with potential for problem but benefit from bug fixes, or stick
to current version and take the risk of not having bug fixes. I believe
this is the reason the stable branch exists, and I’m grateful for it.

I guess I could brew my own version with the patch. Unfortunately, I
don’t have resource to add package management to the list to ensure we
keep up to date with bug fixes since we’ll be leaving the nginx provided
repositories (great btw! thanks). It’s something I will have to weigh up
though, thanks for the suggestion.

Thanks for all the input. I guess the only question now, outside of
philosophical discussions of risk, is whether Nginx team treat this
issue as a “major bug fix”. Hopefully they do and we’ll get the fix soon
in the stable branch. If not, I’ll keep testing 1.7.x with the view to
move to it soon. And we’ll just flow into 1.8 which will be the next
stable feature branch if the product release schedule remains the same
:slight_smile:

Thank you again, and if the Nginx devs/contributors are reading this,
keep up the good work!

Jason