Nginx + Rails (Thin Server) - Double Bad Gateway Problem

luislavena · March 28, 2012, 3:27pm

Hi,

I’m facing a strange problem using Nginx with reverse proxy to a ruby
Thin Server.
When someone calls an unrouteable path in our Rails App, Nginx launches
a 502 - Bad Gateway correctly.
When Nginx receives another resquest (just after the 502), that is valid
and routeable, it launches directly again a 502 - Bad Gateway without
passing the request to the Thin Server (we can check this in the nginx
logs and the app logs).
When I call the same valid URL again, the Nginx passes the call to the
Thin Server and it returns to normal operation.

It seems to cache something in Nginx. When this happens and only in the
third call it gets back to normal track.
I’ve researched the Internet for this issue but nothing conclusive…
As a workaround, I’ve changed the configuration disabiling the
next_upstream.
This resolved my problem, but I’ve lost the next_upstream feature that
calls the others servers when one of them fail…

Here is our nginx configuration:

upstream reg-prov {
server 127.0.0.1:5010;
server 127.0.0.1:5011;
server 127.0.0.1:5012;
server 127.0.0.1:5013;
}

server {
listen 443;
ssl on;
ssl_certificate /etc/nginx/conf.d/server.crt;
ssl_certificate_key /etc/nginx/conf.d/server.key;

    server_name

reg-prov.registro.systemintegration.locaweb.com.br;
access_log /var/log/nginx/reg-prov.access.log;
root /var/www/reg-prov/public;

    location / {
            proxy_set_header X-Real-IP

$remote_addr;
proxy_set_header X-Forwarded-For
$proxy_add_x_forwarded_for;
proxy_set_header Host
$http_host;
proxy_set_header X-Forwarded-Proto https;
proxy_redirect
off;
proxy_next_upstream off;
if (-f $request_filename/index.html) {
rewrite (.) $1/index.html break;
}
if (-f $request_filename.html) {
rewrite (.) $1.html break;
}
if (!-f $request_filename) {
proxy_pass http://reg-prov;
break;
}
}
}

Nginx version: 0.7.67-3+squeeze2

Here is the behavior I described (log files).

With the next upstream DISABLED (the workaround):

Nginx Access Log *

First Call (invalid)
187.45.255.164 - - [28/Mar/2012:10:14:11 -0300] “GET
/domain_availability/disponivel.mx?ticket=ST-28035-WPQ4MirFJIKxwagfIIUe-cas
HTTP/1.1” 502 173 “-” “Ruby”

Second Call (valid)
187.45.255.164 - - [28/Mar/2012:10:14:21 -0300] “GET
/domain_availability/disponivel.com?ticket=ST-28035-WPQ4MirFJIKxwagfIIUe-cas
HTTP/1.1” 200 125 “-” “Ruby”

App Log *

First Call (invalid)
[2012-03-28 10:14:11] INFO – : Started GET
“/domain_availability/disponivel.mx?ticket=ST-28035-WPQ4MirFJIKxwagfIIUe-cas”
for 187.45.255.164 at 2012-03-28 10:14:11 -0300

Second Call (valid)
[2012-03-28 10:14:21] INFO – : Started GET
“/domain_availability/disponivel.com?ticket=ST-28035-WPQ4MirFJIKxwagfIIUe-cas”
for 187.45.255.164 at 2012-03-28 10:14:21 -0300
[2012-03-28 10:14:21] INFO – : Processing by
DomainAvailabilityController#verisign as XML
[2012-03-28 10:14:21] INFO – : Parameters:
{“ticket”=>“ST-28035-WPQ4MirFJIKxwagfIIUe-cas”, “id”=>“disponivel.com”}
[2012-03-28 10:14:21] INFO – : Guessed service url:
“nmst.api.systemintegration.locaweb.com.br”
[2012-03-28 10:14:21] INFO – : Validate CAS URI =
https://sso.systemintegration.locaweb.com.br/cas-sys/serviceValidate?service=nmst.api.systemintegration.locaweb.com.br&ticket=ST-28035-WPQ4MirFJIKxwagfIIUe-cas
[2012-03-28 10:14:21] INFO – : Ticket
“ST-28035-WPQ4MirFJIKxwagfIIUe-cas” for service
“nmst.api.systemintegration.locaweb.com.br” belonging to user “registro”
is VALID.
[2012-03-28 10:14:21] INFO – : [127.0.0.1
DomainAvailabilityController#verisign disponivel.com registro]
User-Agent: Ruby
[2012-03-28 10:14:21] INFO – : [127.0.0.1
DomainAvailabilityController#verisign disponivel.com registro] Accept:
/; q=0.5, application/xml
[2012-03-28 10:14:21] INFO – : [127.0.0.1
DomainAvailabilityController#verisign disponivel.com registro] Referer:
[2012-03-28 10:14:21] INFO – : Rendered
domain_availability/verisign.verisign_domain_availability.builder
(5.3ms)
[2012-03-28 10:14:21] INFO – : [127.0.0.1
DomainAvailabilityController#verisign disponivel.com registro] Response:

<?xml version="1.0" encoding="UTF-8"?>

<domain_availability>
true
</domain_availability>
[2012-03-28 10:14:21] INFO – : Completed 200 OK in 224ms (Views: 15.4ms
| ActiveRecord: 0.0ms) - Path:
/domain_availability/disponivel.com?ticket=ST-28035-WPQ4MirFJIKxwagfIIUe-cas

Now with the next upstream ENABLED:

Nginx Access Log *

First Call (invalid)
187.45.255.164 - - [28/Mar/2012:10:21:51 -0300] “GET
/domain_availability/disponivel.mx?ticket=ST-28035-WPQ4MirFJIKxwagfIIUe-cas
HTTP/1.1” 502 173 “-” “Ruby”

Second Call (valid)
187.45.255.164 - - [28/Mar/2012:10:21:55 -0300] “GET
/domain_availability/disponivel.com?ticket=ST-28035-WPQ4MirFJIKxwagfIIUe-cas
HTTP/1.1” 502 173 “-” “Ruby”
Nginx does not call the app, it returns 502 directly. Here it seems
to cache something…

Third Call (valid)
187.45.255.164 - - [28/Mar/2012:10:22:00 -0300] “GET
/domain_availability/disponivel.com?ticket=ST-28035-WPQ4MirFJIKxwagfIIUe-cas
HTTP/1.1” 200 125 “-” “Ruby”
Success…

App Log *

First Call (invalid)
[2012-03-28 10:21:51] INFO – : Started GET
“/domain_availability/disponivel.mx?ticket=ST-28035-WPQ4MirFJIKxwagfIIUe-cas”
for 187.45.255.164 at 2012-03-28 10:21:51 -0300
[2012-03-28 10:21:51] INFO – : Started GET
“/domain_availability/disponivel.mx?ticket=ST-28035-WPQ4MirFJIKxwagfIIUe-cas”
for 187.45.255.164 at 2012-03-28 10:21:51 -0300
[2012-03-28 10:21:51] INFO – : Started GET
“/domain_availability/disponivel.mx?ticket=ST-28035-WPQ4MirFJIKxwagfIIUe-cas”
for 187.45.255.164 at 2012-03-28 10:21:51 -0300
[2012-03-28 10:21:51] INFO – : Started GET
“/domain_availability/disponivel.mx?ticket=ST-28035-WPQ4MirFJIKxwagfIIUe-cas”
for 187.45.255.164 at 2012-03-28 10:21:51 -0300
Here we have the 4 calls because the next upstream is enabled, only
when the last server is called the error is returned by nginx, so he
keep trying with all servers.

Second Call (valid)
Nothing happens in the app log. It gets nothing from nginx…

Third Call (valid)
[2012-03-28 10:22:00] INFO – : Started GET
“/domain_availability/disponivel.com?ticket=ST-28035-WPQ4MirFJIKxwagfIIUe-cas”
for 187.45.255.164 at 2012-03-28 10:22:00 -0300
[2012-03-28 10:22:00] INFO – : Processing by
DomainAvailabilityController#verisign as XML
[2012-03-28 10:22:00] INFO – : Parameters:
{“ticket”=>“ST-28035-WPQ4MirFJIKxwagfIIUe-cas”, “id”=>“disponivel.com”}
[2012-03-28 10:22:00] INFO – : Guessed service url:
“nmst.api.systemintegration.locaweb.com.br”
[2012-03-28 10:22:00] INFO – : Validate CAS URI =
https://sso.systemintegration.locaweb.com.br/cas-sys/serviceValidate?service=nmst.api.systemintegration.locaweb.com.br&ticket=ST-28035-WPQ4MirFJIKxwagfIIUe-cas
[2012-03-28 10:22:00] INFO – : Ticket
“ST-28035-WPQ4MirFJIKxwagfIIUe-cas” for service
“nmst.api.systemintegration.locaweb.com.br” belonging to user “registro”
is VALID.
[2012-03-28 10:22:00] INFO – : [127.0.0.1
DomainAvailabilityController#verisign disponivel.com registro]
User-Agent: Ruby
[2012-03-28 10:22:00] INFO – : [127.0.0.1
DomainAvailabilityController#verisign disponivel.com registro] Accept:
/; q=0.5, application/xml
[2012-03-28 10:22:00] INFO – : [127.0.0.1
DomainAvailabilityController#verisign disponivel.com registro] Referer:
[2012-03-28 10:22:00] INFO – : Rendered
domain_availability/verisign.verisign_domain_availability.builder
(6.1ms)
[2012-03-28 10:22:00] INFO – : [127.0.0.1
DomainAvailabilityController#verisign disponivel.com registro] Response:

<?xml version="1.0" encoding="UTF-8"?>

<domain_availability>
true
</domain_availability>
[2012-03-28 10:22:00] INFO – : Completed 200 OK in 233ms (Views: 16.9ms
| ActiveRecord: 0.0ms) - Path:
/domain_availability/disponivel.com?ticket=ST-28035-WPQ4MirFJIKxwagfIIUe-cas

djlebersilvestre · March 28, 2012, 4:39pm

Hello!

On Wed, Mar 28, 2012 at 09:26:38AM -0400, djlebersilvestre wrote:

When I call the same valid URL again, the Nginx passes the call to the
Thin Server and it returns to normal operation.

It seems to cache something in Nginx. When this happens and only in the
third call it gets back to normal track.
I’ve researched the Internet for this issue but nothing conclusive…
As a workaround, I’ve changed the configuration disabiling the
next_upstream.
This resolved my problem, but I’ve lost the next_upstream feature that
calls the others servers when one of them fail…

As long as an upstream server fails and triggers
proxy_next_upstream, the upstream server in question is marked
dead and won’t get any requests till fail_timeout passes, see

http://nginx.org/en/docs/http/ngx_http_upstream_module.html#server

It looks like in your case all servers fail to handle request
properly and just close connection, hence all of them are marked
down. On next request nginx sees that all backends are down,
complains about “no live upstreams” to error log, returns error
and marks all backends up again for quick recovery. This is
expected behaviour.

Correct solution for your problem is to fix upstream servers to
properly return HTTP response in case of errors instead of just
closing connection. Something like 404 (as long as I see it’s
appropriate in your case) will be a good choice.

Maxim D.

djlebersilvestre · March 29, 2012, 6:43pm

Thanks Maxim,

I’ve discovered that the configuration in rails systemintegration /
production environment was different from development.
So, if someone called a URL that wasn’t routable, the server responded
empty response instead of sending a 404.
When I switched the config, the server started to send 404 and Nginx
worked correctly.

config/environments/production.rb

Show exception templates (must be on to launch 404 when invalid route

is called)
config.action_dispatch.show_exceptions = true

Daniel Silvestre

Posted at Nginx Forum: