Https + large-file sending, sometimes fails

GSSSSbor_Farkas · December 17, 2007, 3:21pm

hi,

i am sending large (400mb) csv files using nginx, using https.

sometimes not the whole file is served by nginx.
it simply closes the connection before the whole file is sent.

if the downloader supports resuming, then the file-download can be
resumed.

the file is served using the “X-Accel-Redirect” technique (it’s
authenticated on a proxied apache server, and then the file is served
directly by nginx)

when such problems happen, the error-log contains this:

2007/12/17 01:02:03 [crit] 21821#0: *864836 SSL_write() failed (SSL:
error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry) while
sending response to client, client: 1.2.3.4, server: www.example.com,
URL: “/some/url/to/a.csv”, upstream:
“http://internal-ip/some/auth/url/a.csv”, host: “www.example.com”

debian lenny (it has nginx 0.5.30-1)

any ideas why is this happening?

also, i checked the release-notes for 0.5.34, and it says:

"
Bugfix: the big responses may be transferred truncated if SSL and
gzip were used.
"

can this be somehow related to it?

is there a bug-database for nginx? so is there a way to check what this
bug was?

thanks,
gabor

GSSSSbor_Farkas · December 17, 2007, 3:38pm

On Mon, Dec 17, 2007 at 03:12:33PM +0100, G?bor Farkas wrote:

any ideas why is this happening?
is there a bug-database for nginx? so is there a way to check what this
bug was?

No, this is not this bug.
Could you create debug log of this request (you may not use
X-Accel-Redirect for this) ?

GSSSSbor_Farkas · December 17, 2007, 4:10pm

Igor S. wrote:

authenticated on a proxied apache server, and then the file is served
debian lenny (it has nginx 0.5.30-1)

any ideas why is this happening?

Could you create debug log of this request (you may not use
X-Accel-Redirect for this) ?

btw. now i reproduced the problem also without using “X-AccelRedirect”.

one note: the whole thing (nginx itself, log-files,
data-files-to-be-served) are on NFS. could that be a problem?

regarding debug log output: you mean by using something like:

events {
debug_connection 127.0.0.1;
}
?

i will try, but it generates VERY large debug-log-files. is there
perhaps a way how to make those debug-files smaller?

thanks,
gabor

GSSSSbor_Farkas · December 17, 2007, 4:46pm

Igor S. wrote:

if the downloader supports resuming, then the file-download can be
sending response to client, client: 1.2.3.4, server: www.example.com,

?

Yes.

i will try, but it generates VERY large debug-log-files. is there
perhaps a way how to make those debug-files smaller?

No, but probably I need the last 2000 lines only.

hi,

ok, attached.

please note, that i made several “changes” to the file:

i only took the lines that had the same “*number” text. i assume this
is what defines the steps in the same ‘request’, but i might be wrong.
i had to remove references to host-names, ip-addresses, etc.

but i hope this data is enough to determine the problem.

if not, i can also clean-up the “complete” debug-log, and put it
somewhere online.

thanks,
gabor

GSSSSbor_Farkas · December 18, 2007, 10:28am

Igor S. wrote:

when such problems happen, the error-log contains this:

2007/12/17 01:02:03 [crit] 21821#0: *864836 SSL_write() failed (SSL:
error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry) while
sending response to client, client: 1.2.3.4, server: www.example.com,
URL: “/some/url/to/a.csv”, upstream:
“http://internal-ip/some/auth/url/a.csv”, host: “www.example.com”

debian lenny (it has nginx 0.5.30-1)

hi,

maybe i am completely wrong here, but:

(on ubuntu gutsy and hardy):

simply create a minimal https-serving nginx-config, serve a 200mb file,
and try to fetch it from a different computer, using a lot of
concurrent-requests (something like “ab -n 1000 -c 100”),
and you will get the mentioned error.

so, basically, any https-serving is broken.

i find this very hard to believe, but i do not know what should i change
in the test.

(the packages:nginx-0.5.33, openssl-0.9.8g)

any ideas why this happens?

i attached the nginx.conf, and the site-config i used
(the nginx.conf is the default config from ubuntu)

thanks,
gabor

GSSSSbor_Farkas · December 17, 2007, 4:15pm

On Mon, Dec 17, 2007 at 03:56:39PM +0100, G??bor Farkas wrote:

“http://internal-ip/some/auth/url/a.csv”, host: “www.example.com”
btw. now i reproduced the problem also without using “X-AccelRedirect”.

one note: the whole thing (nginx itself, log-files,
data-files-to-be-served) are on NFS. could that be a problem?

regarding debug log output: you mean by using something like:

events {
debug_connection 127.0.0.1;
}
?

Yes.

i will try, but it generates VERY large debug-log-files. is there
perhaps a way how to make those debug-files smaller?

No, but probably I need the last 2000 lines only.

GSSSSbor_Farkas · December 18, 2007, 11:08am

On Tue, Dec 18, 2007 at 08:49:30PM +1100, Dave C. wrote:

problem very easily so I will now try with the static files located on
a local volume, not NFS.

No, the problem is between nginx and OpenSSL. It become apparent in NFS
environment, but the problem is not in NFS. I will investigate it soon.

GSSSSbor_Farkas · December 18, 2007, 11:00am

Hi,

We’ve been seeing this problem with large (10’s of megabyte) static
files being served from an NFS volume over https. We also use https in
our application for cart and login screens but have not seen any
errors of the type reported below. I tried a number of times to
recompile nginx using different ssl libraries and different proxy
configurations but ran our of time and removed the https download.

The one clue that you’ve provided Gábor is NFS. I can reproduce the
problem very easily so I will now try with the static files located on
a local volume, not NFS.

Cheers

Dave

GSSSSbor_Farkas · December 18, 2007, 11:09am

Dave C. wrote:

problem very easily so I will now try with the static files located on a
local volume, not NFS.

hi Dave,

i do not want to destroy your hope, but unfortunately we can also
reproduce it using local-disks (so no NFS)

gabor

GSSSSbor_Farkas · December 18, 2007, 10:24pm

On Tue, Dec 18, 2007 at 10:15:03AM +0100, G??bor Farkas wrote:

i find this very hard to believe, but i do not know what should i change
in the test.

(the packages:nginx-0.5.33, openssl-0.9.8g)

any ideas why this happens?

In the debug log I have not seen any invalid things from nginx side.
Then I have looked OpenSSL sources and now I suspect the bug in OpenSSL.
The attached patch may fix OpenSSL.

Could you build patched OpenSSL version and link it statically with
nginx:

  tar zxf openssl-0.9.8g.tar.gz
  patch -d openssl-0.9.8g < bad_write_retry.txt
  tar zxf nginx-0.5.33.tar.gz
  cd nginx-0.5.33
  ./configure --with-openssl=../openssl-0.9.8g ...

GSSSSbor_Farkas · December 20, 2007, 12:58am

I don’t think this patch is correct. wpend_tot keeps track of the
user-provided buffer’s original length, so that SSL_write(3) can
detect when it’s erroneously retried with a smaller buffer.

GSSSSbor_Farkas · December 20, 2007, 11:11am

Igor S. wrote:

sometimes not the whole file is served by nginx.

so, basically, any https-serving is broken.
The attached patch may fix OpenSSL.

Could you build patched OpenSSL version and link it statically with nginx:
  tar zxf openssl-0.9.8g.tar.gz
  patch -d openssl-0.9.8g < bad_write_retry.txt
  tar zxf nginx-0.5.33.tar.gz
  cd nginx-0.5.33
  ./configure --with-openssl=../openssl-0.9.8g ...

hi,

i tried the patch, and unfortunately it did not help.

also, something i say by the testing: if you start to do a lot of
concurrent requests, and start to kill the clients (which are fetching
the file), then also other requests start to die more frequently then
normally.

thanks,
gabor

GSSSSbor_Farkas · December 20, 2007, 11:23am

On Thu, Dec 20, 2007 at 10:55:16AM +0100, G?bor Farkas wrote:

also, something i say by the testing: if you start to do a lot of
concurrent requests, and start to kill the clients (which are fetching
the file), then also other requests start to die more frequently then
normally.

Thank you, I have just reproduce the case.

GSSSSbor_Farkas · December 18, 2007, 4:33pm

Hi Gábor,

What puzzles us is the problem only occurs when downloading static
files, which is a rare occurrence, vs. checkouts and logins which
happen far more frequently.

I had tried to report this issue to the list a while ago but didn’t
do a good job of explaining the problem and didn’t pique Igors
interest. Hopefully your report has got his attention and it has
inspired me to dig deeper into the problem to come up with more
supporting data.

Cheers

Dave

GSSSSbor_Farkas · December 20, 2007, 11:23am

On Wed, Dec 19, 2007 at 03:48:30PM -0800, Matthew Dempsky wrote:

I don’t think this patch is correct. wpend_tot keeps track of the
user-provided buffer’s original length, so that SSL_write(3) can
detect when it’s erroneously retried with a smaller buffer.

You are right.

GSSSSbor_Farkas · December 20, 2007, 11:37am

On Thu, Dec 20, 2007 at 01:16:19PM +0300, Igor S. wrote:

On Thu, Dec 20, 2007 at 10:55:16AM +0100, G?bor Farkas wrote:

also, something i say by the testing: if you start to do a lot of
concurrent requests, and start to kill the clients (which are fetching
the file), then also other requests start to die more frequently then
normally.

Thank you, I have just reproduce the case.

The attached patch should fix the bug.

GSSSSbor_Farkas · December 20, 2007, 12:20pm

Yes - that is the use case that we discovered.

Client A starts downloading a large file

Client B starts downloading another large file

Either client A or B cancels the download, the other client gets an
abrupt connection close and this error is reported in the log

2007/12/19 23:15:51 [crit] 14852#0: *145 SSL_write() failed (SSL:
error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry error:
1409F07F:SSL routines:SSL3_WRITE_PEND
ING:bad write retry error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad
write retry error:1409F07F:SSL routines:SSL3_WRITE_P:) while sending
response to client, client: 172.
16.0.44, server: staging.redbbuble.com, request: “GET /
pmicorp_calendar_20071121.zip HTTP/1.0”, host: “172.16.0.90”

Cheers

Dave

GSSSSbor_Farkas · December 20, 2007, 11:47am

On Thu, Dec 20, 2007 at 01:30:50PM +0300, Igor S. wrote:

The attached patch should fix the bug.

The complete patch.

GSSSSbor_Farkas · December 20, 2007, 2:41pm

Igor S. wrote:

the file), then also other requests start to die more frequently then

2nd, it clears and logs possible previously unhandled SSL errors so
they have not affect on a new operation.

hi,

we tested your patch (not this one, but the previous one (the "
The complete patch." one), and it seems to fix the problem.

(we were fetching an 50MB file using 20 concurrent connections, and in
parallel there was one more connection, that was starting and killing
it’s download. do you have any idea how to test it even better maybe?)

regarding this one… when will there be a new version of nginx, which
contains this patch?

or, alternatively, is there a public git/mercurial/svn/whatever repo for
nginx?

also, thanks a lot for the bugfix,

gabor

GSSSSbor_Farkas · December 20, 2007, 2:52pm

On Thu, Dec 20, 2007 at 02:31:03PM +0100, G?bor Farkas wrote:

concurrent requests, and start to kill the clients (which are fetching
on error. Actually SSL_shutdown() never returns -1.
(we were fetching an 50MB file using 20 concurrent connections, and in
parallel there was one more connection, that was starting and killing
it’s download. do you have any idea how to test it even better maybe?)

Real load and looking at crit and alert errors. Alerts may be in last
patch only.

regarding this one… when will there be a new version of nginx, which
contains this patch?

0.6.23 will be next week.

or, alternatively, is there a public git/mercurial/svn/whatever repo for
nginx?

No, there’s no public repo.