Nginx access logs, timestamps "jump back"

dubstep · March 15, 2011, 5:43am

Hi there,

We’re running multiple nginx web servers on EC2 behind multiple ELBs
(load balancers). I’m seeing strange behaviour in our nginx system
logs. This strange behaviour seems to coincide with brief outages
spotted by our external monitoring (chartbeat & new relic). I’m not
sure whether I’m on the right track here, investigating this strange
logging … or if it’s just a coincidence and this is normal logging for
nginx.

It starts with typical nginx logging, timestamps increasing
chronologically. Then there’s a jump back in the timestamps while a
large block of ELB-only traffic is logged (health checks). After that,
logging returns to normal (increasing chronologically) and the
timestamps resume from where they left off before the ELB-only burst.

For example (details changed to protect the innocent):

1.2.3.4 - - [14/Mar/2011:06:39:08 +0000] “GET /stuff HTTP/1.1” 301 178
“-” “Jakarta Commons-HttpClient/3.1” http example.com /stuff
10.162.18.95 - - [14/Mar/2011:06:23:17 +0000] “GET /host HTTP/1.1” 200
0 “-” “ELB-HealthChecker/1.0” - 10.162.217.229:80 /host
10.162.126.84 - - [14/Mar/2011:06:23:19 +0000] “GET /host HTTP/1.1” 200
0 “-” “ELB-HealthChecker/1.0” - 10.162.217.229:80 /host
10.162.151.170 - - [14/Mar/2011:06:23:19 +0000] “GET /host HTTP/1.1”
200 0 “-” “ELB-HealthChecker/1.0” - 10.162.217.229:80 /host
10.162.19.235 - - [14/Mar/2011:06:23:19 +0000] “GET /host HTTP/1.1” 200
0 “-” “ELB-HealthChecker/1.0” - 10.162.217.229:80 /host
. . .
LOTS (~ 100)
. . .
10.162.151.170 - - [14/Mar/2011:06:38:47 +0000] “GET /host HTTP/1.1”
200 0 “-” “ELB-HealthChecker/1.0” - 10.162.217.229:80 /host
10.162.19.235 - - [14/Mar/2011:06:38:47 +0000] “GET /host HTTP/1.1” 200
0 “-” “ELB-HealthChecker/1.0” - 10.162.217.229:80 /host
10.162.18.95 - - [14/Mar/2011:06:39:01 +0000] “GET /host HTTP/1.1” 200
0 “-” “ELB-HealthChecker/1.0” - 10.162.217.229:80 /host
10.162.126.84 - - [14/Mar/2011:06:39:03 +0000] “GET /host HTTP/1.1” 200
0 “-” “ELB-HealthChecker/1.0” - 10.162.217.229:80 /host
10.162.151.170 - - [14/Mar/2011:06:39:03 +0000] “GET /host HTTP/1.1”
200 0 “-” “ELB-HealthChecker/1.0” - 10.162.217.229:80 /host
10.162.19.235 - - [14/Mar/2011:06:39:03 +0000] “GET /host HTTP/1.1” 200
0 “-” “ELB-HealthChecker/1.0” - 10.162.217.229:80 /host
10.162.126.84 - - [14/Mar/2011:06:39:19 +0000] “GET /host HTTP/1.1” 200
0 “-” “ELB-HealthChecker/1.0” - 10.162.217.229:80 /host
10.162.151.170 - - [14/Mar/2011:06:39:19 +0000] “GET /host HTTP/1.1”
200 0 “-” “ELB-HealthChecker/1.0” - 10.162.217.229:80 /host
10.162.19.235 - - [14/Mar/2011:06:39:19 +0000] “GET /host HTTP/1.1” 200
0 “-” “ELB-HealthChecker/1.0” - 10.162.217.229:80 /host
10.162.151.170 - - [14/Mar/2011:06:39:35 +0000] “GET /host HTTP/1.1”
200 0 “-” “ELB-HealthChecker/1.0” - 10.162.217.229:80 /host
10.162.151.170 - - [14/Mar/2011:06:39:51 +0000] “GET /host HTTP/1.1”
200 0 “-” “ELB-HealthChecker/1.0” - 10.162.217.229:80 /host
2.3.4.5 - - [14/Mar/2011:06:39:58 +0000] “GET /stuff HTTP/1.1” 200 113
“http://example.com/stuff” “Mozilla/5.0 (Macintosh; U; Intel Mac OS X
10_6_6; en-US) AppleWebKit/534.16 (KHTML, like Gecko)
Chrome/10.0.648.133 Safari/534.16” http example.com /stuff
3.4.5.6 - - [14/Mar/2011:06:39:59 +0000] “GET /stuff HTTP/1.1” 200 113
“http://example.com/stuff” “Mozilla/5.0 (Macintosh; U; Intel Mac OS X
10_6_6; en-US) AppleWebKit/534.16 (KHTML, like Gecko)
Chrome/10.0.648.133 Safari/534.16” http example.com /stuff

Notice the jump back in time, then ELB-only traffic is logged, then
logging returns to normal once the ELB timestamps have caught up with
user traffic.

Should I expect all nginx log messages to increase chronologically?
What would explain these large jumps back?

We redefined our log format to use a 32k buffer. Unfortunately, I can’t
risk changing/disabling this in production to see if the behaviour
changes. Could there be more than one 32k buffer flushing at different
times?

Thanks.

Cheers,
David.

Posted at Nginx Forum:

assistlydavid · March 15, 2011, 6:37am

Hello!

On Tue, Mar 15, 2011 at 12:41:32AM -0400, assistlydavid wrote:

while a large block of ELB-only traffic is logged (health
checks). After that, logging returns to normal (increasing
chronologically) and the timestamps resume from where they left
off before the ELB-only burst.

For example (details changed to protect the innocent):

1.2.3.4 - - [14/Mar/2011:06:39:08 +0000] “GET /stuff HTTP/1.1” 301 178 “-”
“Jakarta Commons-HttpClient/3.1” http example.com /stuff
10.162.18.95 - - [14/Mar/2011:06:23:17 +0000] “GET /host HTTP/1.1” 200 0 “-”
“ELB-HealthChecker/1.0” - 10.162.217.229:80 /host

[…]

flushing at different times?
Each worker process has it’s own log buffer and flush it as soon
it’s filled up. So the observed behaviour is normal with log
buffering used and low traffic.

Maxim D.

assistlydavid · March 16, 2011, 3:27am

Hi Maxim,

Thanks for the response.

How are the workers allocated work? I’m still curious as to why I’d see
a large block of ELB-only traffic flushed to the log at the same time,
rather than a mix. Any thoughts on that?

Presumably, the chance of multiple workers flushing the same type of log
messages (ELB-only) at the same time is very low, especially when we’re
seeing constant traffic from end-users more frequently than ELB health
checks. Am I perhaps seeing all ELB-only traffic flushed from the
buffer of a single worker? Are workers chosen based on server/location
directives?

ps. Each web server in the load balancer pool is dealing with
approximately 60 simultaneous connections, approximately 6 requests per
second, and an ELB health check is performed every 30 seconds for each
of 4 ELBs. The logs show that 3 of those health checks happen in the
same second.

Cheers,
David.

Posted at Nginx Forum:

assistlydavid · March 16, 2011, 3:27pm

Hello!

On Tue, Mar 15, 2011 at 10:26:16PM -0400, assistlydavid wrote:

Hi Maxim,

Thanks for the response.

How are the workers allocated work? I’m still curious as to why
I’d see a large block of ELB-only traffic flushed to the log at
the same time, rather than a mix. Any thoughts on that?

Workers accept() connections and process requests on accepted
connections.

Depending on settings (notably accept_mutex) and number of already
accepted connections conditions on which workers call accept()
vary. With accept_mutex turned on (default) it’s likely that on
underloaded servers all/most of new connections will be accepted by
the same worker.

Presumably, the chance of multiple workers flushing the same
type of log messages (ELB-only) at the same time is very low,
especially when we’re seeing constant traffic from end-users
more frequently than ELB health checks. Am I perhaps seeing all
ELB-only traffic flushed from the buffer of a single worker?

Most likely health checks use single keepalive connection, and
that’s why all health checks go to the same worker.

And new client connections are currently accepted by another
worker.

Are workers chosen based on server/location directives?

No.

Maxim D.

assistlydavid · March 17, 2011, 12:04am

That makes perfect sense, Maxim. I hadn’t even thought about the HTTP
1.1 connections staying open.

Thanks.

David.

Posted at Nginx Forum: