File descriptor leak using persistent connections with Net::Http

pardeeb · July 27, 2010, 4:24pm

Our webapp is a mashup of several suppliers who we communicate with
using
Net::Http. I thought it would improve performance if we used persistent
connections so I modified one of the suppliers to use the
Net::Http::Persistent gem instead. Things worked great until after
about 4
hours when every incoming connection got rejected with a ‘Too many open
files’ exception. I restarted and monitored the number of fd’s in use
as
the process ran and noticed that every time there was an IOError
(Connection
reset by peer or Broken pipe), 10 more fd’s were used. If there was an
EOFError, there was no descriptor leak.

I’ve examined the Net::Http::Persisent library and I’m fairly convinced
there is no issue there. For a reset it just calls finish on the
Net::Http
instance and then start. The Net::Http library closes the underlying
TCPSocket so I can’t see any issue there. Regardless, I might suspect
it if
it was just 1 fd leaking, but I’m baffled as to why 10 fds are leaking.

So I’m not sure this is JRuby related but I’m not sure who else to
blame.
Below is a short sample of the empirical evidence I have which lists the
number of fd’s and the exceptions that result in the connection being
reset.
I was listing the number of open fds every second but I’ve removed most
of
the lines where the count remained fairly consistent:

2010-07-25T18:39:48+00:00 247
2010-07-25T18:39:49+00:00 PERSIST: Resetting connection (due to Broken
pipe

IOError) after 16 requests on 5482
2010-07-25T18:39:49+00:00 248
2010-07-25T18:39:50+00:00 258
…
2010-07-25T18:41:06+00:00 259
2010-07-25T18:41:07+00:00 PERSIST: Resetting connection (due to End of
file
reached - EOFError) after 3 requests on 3702
2010-07-25T18:41:07+00:00 258
…
2010-07-25T18:54:35+00:00 258
2010-07-25T18:54:36+00:00 PERSIST: Resetting connection (due to
Connection
reset by peer - IOError) after 32 requests on 3264
2010-07-25T18:54:36+00:00 269
2010-07-25T18:54:37+00:00 273
2010-07-25T18:54:38+00:00 269
…
2010-07-25T18:56:08+00:00 268
2010-07-25T18:56:09+00:00 PERSIST: Resetting connection (due to Broken
pipe
IOError) after 13 requests on 3266
2010-07-25T18:56:09+00:00 280
2010-07-25T18:56:10+00:00 286
2010-07-25T18:56:11+00:00 279
…
2010-07-25T18:57:18+00:00 279
2010-07-25T18:57:19+00:00 PERSIST: Resetting connection (due to Broken
pipe
IOError) after 14 requests on 5482
2010-07-25T18:57:19+00:00 289
2010-07-25T18:57:20+00:00 297
2010-07-25T18:57:21+00:00 288
…
2010-07-25T18:57:51+00:00 288
2010-07-25T18:57:52+00:00 PERSIST: Resetting connection (due to Broken
pipe
IOError) after 20 requests on 3702
2010-07-25T18:57:52+00:00 290
2010-07-25T18:57:53+00:00 299
…
2010-07-25T19:01:38+00:00 297
2010-07-25T19:01:39+00:00 PERSIST: Resetting connection (due to Broken
pipe
IOError) after 11 requests on 4074
2010-07-25T19:01:39+00:00 308

I plan on setting up a test with MRI to see if the same thing occurs but
due
to lobster mini-season I won’t get to that until next week. So free
lobster
dinner to whoever can give me some insight as to whats going on.
(Caveat -
you have to come to St Pete, FL to claim it)

–
View this message in context:
http://old.nabble.com/File-descriptor-leak-using-persistent-connections-with-Net%3A%3AHttp-tp29276945p29276945.html
Sent from the JRuby - User mailing list archive at Nabble.com.

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

pardeeb · July 28, 2010, 11:39am

Do you have a reduced test case you can post?

On Tue, 2010-07-27 at 07:24 -0700, pardeeb wrote:

I’ve examined the Net::Http::Persisent library and I’m fairly convinced

…
2010-07-25T18:56:09+00:00 280
2010-07-25T18:57:51+00:00 288
I plan on setting up a test with MRI to see if the same thing occurs but due
to lobster mini-season I won’t get to that until next week. So free lobster
dinner to whoever can give me some insight as to whats going on. (Caveat -
you have to come to St Pete, FL to claim it)

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

pardeeb · July 29, 2010, 4:49pm

Could this be related to: http://jira.codehaus.org/browse/JRUBY-4767 ?

On Wed, Jul 28, 2010 at 11:39 AM, Nick G. [email protected]
wrote:

reset by peer or Broken pipe), 10 more fd’s were used. Â If there was an
number of fd’s and the exceptions that result in the connection being reset.
2010-07-25T18:41:07+00:00 Â PERSIST: Resetting connection (due to End of file
2010-07-25T18:56:08+00:00 Â 268
2010-07-25T18:57:20+00:00 Â 297

IOError) after 11 requests on 4074

To unsubscribe from this list, please visit:

Â Â http://xircles.codehaus.org/manage_email

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

pardeeb · August 5, 2010, 9:24pm

On Thu, Aug 5, 2010 at 11:23 AM, pardeeb [email protected] wrote:

It sure looked like it could but I ran my machine with the 1.5.2dev of JRuby
and still have the same problem. Â I also tried creating a reduced test case
but I haven’t been able to reproduce the problem with it.

Are the errors you’re seeing impacting the application in some way or
just filling up log files? Almost all of the ones you posted look like
normal “broken pipe” or “connection reset by peer” issues that would
mean the opposite end of some connection shut it down (like a browser
terminating a connection or a remote server breaking off).

I have the jmap dump, lsof output, fd’s in use per second output and the
connection resets output if someone more knowledgeable than me wants to take
a look. Â Otherwise, I’ll try to learn a little more about the tools for
examining this and work on creating a successful test case.

I wouldn’t expect your errors to be related to 4767. The 4767 leak
affected servers with many live threads doing remote TCP hits, and
only “leaked” in that each thread held a selector object until it was
collected, which sometimes would hit the upper limit of how many
selectors the host JVM could handle.

For anyone interested, one anomaly I got with the 1.5.2dev version is the
following error. Â I’m not positive it’s related to this release but its one
I hadn’t seen before:

I see no JRuby in this trace, but it’s interesting that you see it
with 1.5.2.dev and not with 1.5.1 (is that the case?) since I know of
no changes in 1.5.2.dev that would cause internal selector logic in
Grizzly to have errors. This one also looks like a remote socket was
closed before Grizzly could do its select on it…

Charlie

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

pardeeb · August 5, 2010, 6:25pm

Albert R.-3 wrote:

Could this be related to: http://jira.codehaus.org/browse/JRUBY-4767 ?

It sure looked like it could but I ran my machine with the 1.5.2dev of
JRuby
and still have the same problem. I also tried creating a reduced test
case
but I haven’t been able to reproduce the problem with it.

I have the jmap dump, lsof output, fd’s in use per second output and the
connection resets output if someone more knowledgeable than me wants to
take
a look. Otherwise, I’ll try to learn a little more about the tools for
examining this and work on creating a successful test case.

For anyone interested, one anomaly I got with the 1.5.2dev version is
the
following error. I’m not positive it’s related to this release but its
one
I hadn’t seen before:

File: /var/www/rails/claritybase/current/log/production.log found
Exception
429056: Aug 5, 2010 3:48:08 PM com.sun.grizzly.SelectorHandlerRunner
handleSelectException
429057: SEVERE: doSelect IOException
429058: java.io.IOException: Bad file descriptor
429059: \tat sun.nio.ch.FileDispatcher.close0(Native Method)
429060: \tat sun.nio.ch.SocketDispatcher.close(SocketDispatcher.java:37)
429061: \tat
sun.nio.ch.SocketChannelImpl.kill(SocketChannelImpl.java:725)
429062: \tat
sun.nio.ch.EPollSelectorImpl.implDereg(EPollSelectorImpl.java:165)
429063: \tat
sun.nio.ch.SelectorImpl.processDeregisterQueue(SelectorImpl.java:131)
429064: \tat
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:69)
429065: \tat
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
429066: \tat sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
429067: \tat
com.sun.grizzly.TCPSelectorHandler.select(TCPSelectorHandler.java:546)
429068: \tat
com.sun.grizzly.SelectorHandlerRunner.doSelect(SelectorHandlerRunner.java:185)
429069: \tat
com.sun.grizzly.SelectorHandlerRunner.run(SelectorHandlerRunner.java:130)
429070: \tat
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
429071: \tat
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
429072: \tat java.lang.Thread.run(Thread.java:619)

–
View this message in context:
http://old.nabble.com/File-descriptor-leak-using-persistent-connections-with-Net%3A%3AHttp-tp29276945p29356368.html
Sent from the JRuby - User mailing list archive at Nabble.com.

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

pardeeb · August 7, 2010, 5:39am

Are the errors you’re seeing impacting the application in some way or
just filling up log files? Almost all of the ones you posted look like
normal “broken pipe” or “connection reset by peer” issues that would
mean the opposite end of some connection shut it down (like a browser
terminating a connection or a remote server breaking off).

Its affecting my app in that each time I see one of those errors 10 more
fds
get used up until I get to my limit of 1024 at which point the app
refuses
all
incoming connections.

I wouldn’t expect your errors to be related to 4767. The 4767 leak
affected servers with many live threads doing remote TCP hits, and
only “leaked” in that each thread held a selector object until it was
collected, which sometimes would hit the upper limit of how many
selectors the host JVM could handle.

Ok, it sounded similar and I was hopeful.

I see no JRuby in this trace, but it’s interesting that you see it
with 1.5.2.dev and not with 1.5.1 (is that the case?) since I know of
no changes in 1.5.2.dev that would cause internal selector logic in
Grizzly to have errors. This one also looks like a remote socket was
closed before Grizzly could do its select on it…

I wasn’t sure it was related to 1.5.2dev. Its just an error that came
up
that I hadn’t seen before. It could be related to the new persistent
connection code I’m using or it might have just been a coincidence.

–
View this message in context:
http://old.nabble.com/File-descriptor-leak-using-persistent-connections-with-Net%3A%3AHttp-tp29276945p29372572.html
Sent from the JRuby - User mailing list archive at Nabble.com.

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

pardeeb · August 7, 2010, 9:08am

On Fri, Aug 6, 2010 at 10:39 PM, pardeeb [email protected] wrote:

incoming connections.
Ok, I think this is worth opening a bug and getting a heap dump to us.
If you’re actually leaking fd’s, something’s not right somewhere.

Charlie

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

pardeeb · August 8, 2010, 1:06pm

Reposting without private data to the user list, so others can see
progress.

On Sat, Aug 7, 2010 at 10:33 AM, Charles Oliver N.

pardeeb · August 9, 2010, 11:34am

I know this isn’t anywhere a fix, but if you’re facing this in
production you could extend your application’s life somewhat by
increasing your open files limit (presuming you’re running on linux
here).

http://www.cs.uwaterloo.ca/~brecht/servers/openfiles.html

On Fri, 2010-08-06 at 20:39 -0700, pardeeb wrote:

no changes in 1.5.2.dev that would cause internal selector logic in
Grizzly to have errors. This one also looks like a remote socket was
closed before Grizzly could do its select on it…

I wasn’t sure it was related to 1.5.2dev. Its just an error that came up
that I hadn’t seen before. It could be related to the new persistent
connection code I’m using or it might have just been a coincidence.

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

pardeeb · September 11, 2010, 8:42pm

Sorry, I should have posted an update for this. A ticket was opened at
http://jira.codehaus.org/browse/JRUBY-5018. There is also a bug where
requests wrapped in Timeout blocks don’t expire when they should at
http://jira.codehaus.org/browse/JRUBY-5059. The gem 0.7.2.dev at
http://ci.jruby.org/job/jruby-ossl/34/ fixes both these issues.

–
View this message in context:
http://old.nabble.com/File-descriptor-leak-using-persistent-connections-with-Net%3A%3AHttp-tp29276945p29686581.html
Sent from the JRuby - User mailing list archive at Nabble.com.

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

pardeeb · September 11, 2010, 6:32pm

I had the same problem with jruby.

Simply run this script 5000 times from java to jruby

require ‘net/http’

response = Net::HTTP.start(“www.blabla.com”) do |http|
response = http.post(“/api/call_me.php”,{“firstparam” => “bu”,
“secondparam” => “ba”})
#puts response.body
end

I use ‘lsof | less’ with the end page button for counting the opened
files
and they always increase. At some point i get the exception “Too many
files
open”.

–
View this message in context:
http://old.nabble.com/File-descriptor-leak-using-persistent-connections-with-Net%3A%3AHttp-tp29276945p29685561.html
Sent from the JRuby - User mailing list archive at Nabble.com.

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email