Forum: NGINX proxied requests hang when DNS response has wrong ident

D571e444a3b76876a89b6f1f1aac38c6?d=identicon&s=25 Pramod Korathota (Guest)
on 2014-07-15 12:05
(Received via mailing list)
We have recently discovered a very rare occurence when requests through
nginx will hang if the resolver sends a response with a mismatching
ident.
 We are seeing this in production with 1.7.1 and I have been able to
re-produce with 1.7.3. The relevant parts of the config are:

resolver 10.65.255.4;

location / {
        proxy_pass      http://$host.internal$request_uri;
}

So we basically proxy <customer>.atlassian.net to
<customer>.atlassian.net.internal. The resolver is a pdns recursor
running
on the same machine.

The error we see in the logs is:

2014/06/19 20:22:29 [error] 28235#0: wrong ident 57716 response for
customer.atlassian.net.internal, expect 39916
2014/06/19 20:22:29 [error] 28235#0: unexpected response for
customer.atlassian.net.internal
2014/06/19 20:22:59 [error] 28235#0: *23776286
customer.atlassian.net.internal could not be resolved (110: Operation
timed
out), client: 83.244.247.165, server: *.atlassian.net, request: "GET
/plugins/ HTTP/1.1", host: "customer.atlassian.net", referrer: "
https://customer.atlassian.net/secure/Dashboard.jspa"

I have been able to re-produce this error in a test environment - this
is
what I used:

- a basic python script pretending to be a recursive resolver, which can
mangle the ident of a response. The resolver directive of nginx is
pointed
to this recursor. I added in a delay of 100ms before sending a reply
(based
on http://code.activestate.com/recipes/491264-mini-fa...).
- A proxy configuration same as above - only the resolver and
location/proxy_pass line was added to a default nginx config
- Static webserver as the backend
- GNU parallel + curl to issue concurrent requests

When the ident is correct, the system behaves as expected. However, if
an
ident is incorrect, AND nginx gets multiple concurrent (5) requests for
that same backend, we see all the requests hanging. Doing a tcpdump for
DNS
traffic shows the first request go out, and the response coming back
with
the wrong ident, but no subsequent  dns requests. The critical factor
seems
to be multiple incoming requests to nginx, while a dns request is
in-flight.

If needed I can provide all the scripts and config I used to produce the
error.

Thanks!

Pramod Korathota
2154be8b4430488454f0a067e09863b9?d=identicon&s=25 Ruslan Ermilov (Guest)
on 2014-07-15 13:42
(Received via mailing list)
On Tue, Jul 15, 2014 at 08:04:44PM +1000, Pramod Korathota wrote:
> We have recently discovered a very rare occurence when requests through
> nginx will hang if the resolver sends a response with a mismatching ident.
>  We are seeing this in production with 1.7.1 and I have been able to
> re-produce with 1.7.3. The relevant parts of the config are:
>
> resolver 10.65.255.4;
>
> location / {
>         proxy_pass      http://$host.internal$request_uri;
> }

 .   - :

# HG changeset patch
# User Ruslan Ermilov <ru@nginx.com>
# Date 1405424486 -14400
#      Tue Jul 15 15:41:26 2014 +0400
# Node ID 8a16ec3871efad5990604a21c6bc00c0c9347446
# Parent  abd460ece11e9c85d4c0c4a8e6ac46cfb5fa62b5
Resolver: fixed resend on malformed responses.

DNS request resend on malformed responses was broken in 98876ce2a7fd.

Reported by Pramod Korathota.

diff --git a/src/core/ngx_resolver.c b/src/core/ngx_resolver.c
--- a/src/core/ngx_resolver.c
+++ b/src/core/ngx_resolver.c
@@ -1467,7 +1467,6 @@ ngx_resolver_process_a(ngx_resolver_t *r
             goto failed;
         }

-        rn->naddrs6 = 0;
         qident = (rn->query6[0] << 8) + rn->query6[1];

         break;
@@ -1482,7 +1481,6 @@ ngx_resolver_process_a(ngx_resolver_t *r
             goto failed;
         }

-        rn->naddrs = 0;
         qident = (rn->query[0] << 8) + rn->query[1];
     }

@@ -1507,6 +1505,8 @@ ngx_resolver_process_a(ngx_resolver_t *r

         case NGX_RESOLVE_AAAA:

+            rn->naddrs6 = 0;
+
             if (rn->naddrs == (u_short) -1) {
                 goto next;
             }
@@ -1519,6 +1519,8 @@ ngx_resolver_process_a(ngx_resolver_t *r

         default: /* NGX_RESOLVE_A */

+            rn->naddrs = 0;
+
             if (rn->naddrs6 == (u_short) -1) {
                 goto next;
             }
@@ -1539,6 +1541,8 @@ ngx_resolver_process_a(ngx_resolver_t *r

         case NGX_RESOLVE_AAAA:

+            rn->naddrs6 = 0;
+
             if (rn->naddrs == (u_short) -1) {
                 rn->code = (u_char) code;
                 goto next;
@@ -1548,6 +1552,8 @@ ngx_resolver_process_a(ngx_resolver_t *r

         default: /* NGX_RESOLVE_A */

+            rn->naddrs = 0;
+
             if (rn->naddrs6 == (u_short) -1) {
                 rn->code = (u_char) code;
                 goto next;
@@ -1817,6 +1823,25 @@ ngx_resolver_process_a(ngx_resolver_t *r
         }
     }

+    switch (qtype) {
+
+#if (NGX_HAVE_INET6)
+    case NGX_RESOLVE_AAAA:
+
+        if (rn->naddrs6 == (u_short) -1) {
+            rn->naddrs6 = 0;
+        }
+
+        break;
+#endif
+
+    default: /* NGX_RESOLVE_A */
+
+        if (rn->naddrs == (u_short) -1) {
+            rn->naddrs = 0;
+        }
+    }
+
     if (rn->naddrs != (u_short) -1
 #if (NGX_HAVE_INET6)
         && rn->naddrs6 != (u_short) -1
D571e444a3b76876a89b6f1f1aac38c6?d=identicon&s=25 Pramod Korathota (Guest)
on 2014-07-16 04:01
(Received via mailing list)
On 15 July 2014 21:41, Ruslan Ermilov <ru@nginx.com> wrote:



> diff --git a/src/core/ngx_resolver.c b/src/core/ngx_resolver.c
> --- a/src/core/ngx_resolver.c
> +++ b/src/core/ngx_resolver.c
>

<snip>

Thanks for the quick response and patch, Ruslan. I have tested a build
incorporating this patch, and it behaves as expected, the resolver
retrying
rather than blocking behind the first request.

I will get this build out to our  production environment this week. Will
report back if there are any issues.

Thanks again!

Pramod,
1b8822738bf9c82ce71ca9e836191ea8?d=identicon&s=25 Jason Woods (Guest)
on 2014-08-18 12:40
(Received via mailing list)
Hi,

On 16 Jul 2014, at 03.01, Pramod Korathota <pkorathota@atlassian.com>
wrote:

>
> I will get this build out to our  production environment this week. Will report
back if there are any issues.
>
> Thanks again!
>
> Pramod,
> _______________________________________________
> nginx mailing list
> nginx@nginx.org
> http://mailman.nginx.org/mailman/listinfo/nginx

I'm experiencing a similar issue.

I get a couple of ident mismatch errors in the error log, and then
gradually over time I begin to see thousands of ESTABLISHED connections
stuck with no activity.
Eventually worker_connections is exhausted. I believe it could be
related to this.

I'm using package nginx-1.6.1-1.el6.ngx.x86_64, which is the latest
stable. Has this fix been ported to 1.6.x stable yet?
Is there a workaround?

Thanks,

Jason
40b4c848b8fcd63b0cb60b9d170c3a77?d=identicon&s=25 Valentin V. Bartenev (Guest)
on 2014-08-18 13:06
(Received via mailing list)
On Monday 18 August 2014 11:40:07 Jason Woods wrote:
[..]
> the latest stable. Has this fix been ported to 1.6.x stable
> yet?

No, it hasn't.  You should use mainline version of nginx.

  wbr, Valentin V. Bartenev
1b8822738bf9c82ce71ca9e836191ea8?d=identicon&s=25 Jason Woods (Guest)
on 2014-08-18 14:46
(Received via mailing list)
On 18 Aug 2014, at 12.05, Valentin V. Bartenev <vbart@nginx.com> wrote:

>>
>> I'm using package nginx-1.6.1-1.el6.ngx.x86_64, which is
>> the latest stable. Has this fix been ported to 1.6.x stable
>> yet?
>
> No, it hasn't.  You should use mainline version of nginx.
>
> wbr, Valentin V. Bartenev

There's the fear there are significant changes from 1.6 to 1.7 that may
introduce other problems, and would need to go through some extensive
testing before we can commit. Especially as the 1.6 is labelled "stable"
and the 1.7 "mainline" (and not stable) - maybe these terms aren't meant
to convey the meaning they appear to though. I'll start discussion about
testing 1.7 though.

Any ETA on when this might be back-ported, if at all?
I guess a second question is when will 1.7 become the stable?

Sorry if you're the wrong person to ask! And thanks for being clear.

Jason
1266aa99d1601b47bbd3ec22affbb81c?d=identicon&s=25 B.R. (Guest)
on 2014-08-18 15:17
(Received via mailing list)
On Mon, Aug 18, 2014 at 2:45 PM, Jason Woods <devel@jasonwoods.me.uk>
wrote:

>
> There's the fear there are significant changes from 1.6 to 1.7 that may
> introduce other problems, and would need to go through some extensive
> testing before we can commit. Especially as the 1.6 is labelled "stable"
> and the 1.7 "mainline" (and not stable) - maybe these terms aren't meant to
> convey the meaning they appear to though. I'll start discussion about
> testing 1.7 though.
>

‚ÄčThat is a shared fear, I guess, since questioning about 'stable' comes
regularly on topic.
nginx guys regularly say you should not fear using 1.7 branch in
production, however these recent days have seen a lot of bugs popping
up,
some of them which one could even consider critical (impacting
availability
of the Web server). Definitely not what you want in production.

It seems the philosophy behind stable is 'no risk, few changes', thus
experimental stuff will only be backported after public release/testing
on
1.7 branch.
Moreover, according to an answer brought 2 weeks ago
<http://forum.nginx.org/read.php?2,252312,252329#ms...,
SPDY-related
functionalities are not due in 1.6 branch. Maybe 1.8, once 1.7 will be
considered stable enough?

nginx problems do not come from basic well-tested/heavily-used core
functionalities (the ones you use to fulfil a basic Web server job).
They
come from new abilities brought in, which might compromise the whole
thing...
---
*B. R.*
‚Äč
2974d09ac2541e892966b762aad84943?d=identicon&s=25 itpp2012 (Guest)
on 2014-08-18 15:20
(Received via mailing list)
If its only a few lines you may consider adding the patch manually.

Posted at Nginx Forum:
http://forum.nginx.org/read.php?2,251779,252648#msg-252648
40b4c848b8fcd63b0cb60b9d170c3a77?d=identicon&s=25 Valentin V. Bartenev (Guest)
on 2014-08-18 15:23
(Received via mailing list)
On Monday 18 August 2014 13:45:46 Jason Woods wrote:
[..]
> There's the fear there are significant changes from 1.6 to 1.7 that may
> introduce other problems, and would need to go through some extensive
> testing before we can commit. Especially as the 1.6 is labelled "stable" and
> the 1.7 "mainline" (and not stable) - maybe these terms aren't meant to
> convey the meaning they appear to though. I'll start discussion about
> testing 1.7 though.
[..]

It's a common misunderstanding about branches.
Check this out: http://nginx.com/blog/nginx-1-6-1-7-released/

In most cases the only care you need with the mainline branch is to read
the
changelog before update.

Also you can consider to buy nginx plus license to get the official
support
and thus feel yourself more confident.

  wbr, Valentin V. Bartenev
1b8822738bf9c82ce71ca9e836191ea8?d=identicon&s=25 Jason Woods (Guest)
on 2014-08-18 16:54
(Received via mailing list)
On 18 Aug 2014, at 14.22, Valentin V. Bartenev <vbart@nginx.com> wrote:

> It's a common misunderstanding about branches.
> Check this out: http://nginx.com/blog/nginx-1-6-1-7-released/
>
> In most cases the only care you need with the mainline branch is to read the
> changelog before update.
>
> Also you can consider to buy nginx plus license to get the official support
> and thus feel yourself more confident.
>
>  wbr, Valentin V. Bartenev

Thanks, that's a perfect explanation.

The fear remains though that to fix a single small issue that is
possibly a few lines changed, I would be (in essence) changing thousands
upon thousands of lines, adding new features, updating features, and
creating a much larger surface area for potential new bugs. Where
sticking with the stable feature branch gives us what the stable feature
branch is intended to do - minimise change and surface area for new
issues.

Reading the changelog I agree is the best approach, but if a new feature
is added and it modified shared-code to support it, this might not
always included in a change log. Plus some shared code, even if
mentioned, unless I'm an Nginx developer likely won't know what other
parts it affects. And the moment something critical is changed and fully
described in the changelog (because mainline has updates to existing
features) then we hit the blocker where we need to weigh up risk again -
upgrade with potential for problem but benefit from bug fixes, or stick
to current version and take the risk of not having bug fixes. I believe
this is the reason the stable branch exists, and I'm grateful for it.

I guess I could brew my own version with the patch. Unfortunately, I
don't have resource to add package management to the list to ensure we
keep up to date with bug fixes since we'll be leaving the nginx provided
repositories (great btw! thanks). It's something I will have to weigh up
though, thanks for the suggestion.

Thanks for all the input. I guess the only question now, outside of
philosophical discussions of risk, is whether Nginx team treat this
issue as a "major bug fix". Hopefully they do and we'll get the fix soon
in the stable branch. If not, I'll keep testing 1.7.x with the view to
move to it soon. And we'll just flow into 1.8 which will be the next
stable feature branch if the product release schedule remains the same
:-)

Thank you again, and if the Nginx devs/contributors are reading this,
keep up the good work!

Jason
Please log in before posting. Registration is free and takes only a minute.
Existing account

NEW: Do you have a Google/GoogleMail, Yahoo or Facebook account? No registration required!
Log in with Google account | Log in with Yahoo account | Log in with Facebook account
No account? Register here.