Forum: Ruby-core [Bug #1696] http downloads are unuseably slow

Posted by Steven Hartland (Guest)
on 2009-06-27 18:10
(Received via mailing list)
Bug #1696: http downloads are unuseably slow
http://redmine.ruby-lang.org/issues/show/1696

Author: Steven Hartland
Status: Open, Priority: High
ruby -v: ruby 1.8.6 (2008-08-11 patchlevel 287) [amd64-freebsd7]

I've been using a a bit of code which use open-uri and http urls to 
download files and have found this is unuseably slow.
Example:

require 'open-uri'
open( <http uri> )

I did some benchmarking and the download of a 157MB file from the local 
machine was taking over 20 seconds and using 100% where as wget for the 
same file only took 0.7seconds.

Digging some more and profiling the code with RubyProf revealed that 
during the execution of the download over 11,000 threads where being 
created. This I've tracked down to the net/protocol module and the 
Net::BufferIO::rbuf_fill method which is using a timeout block to wrap 
the @io.sysread(1024) call. This is clearly an extremely bad way to do 
this, totally crippling the performance.

Below are benchmarks and some profiles of this issue and various quick 
fixes I tested here
== 1.8.6 core ==
      user     system      total        real
open 15.664062   6.164062  21.828125 ( 22.810859)

 %self     total     self     wait    child    calls  name
 14.04      5.27     3.00     2.27     0.00   109426 
<Class::Thread>#start (ruby_runtime:0}
 10.97     13.14     2.34     0.00    10.80   109427 
<Module::Timeout>#timeout (/usr/local/lib/ruby/1.8/timeout.rb:52}
  6.65      2.52     1.42     1.10     0.00   109426  Thread#kill 
(ruby_runtime:0}
  5.78     21.36     1.23     0.00    20.12        1 
Net::BufferedIO#read (/usr/local/lib/ruby/1.8/net/protocol.rb:79}
  5.27     14.27     1.12     0.00    13.14   109427  Object#timeout 
(/usr/local/lib/ruby/1.8/timeout.rb:92}
  5.12      1.11     1.09     0.02     0.00   109426  IO#sysread 
(ruby_runtime:0}
  4.75      2.61     1.02     0.00     1.59   109426  OpenURI::Buffer#<< 
(/usr/local/lib/ruby/1.8/open-uri.rb:297}
  3.40      3.62     0.73     0.00     2.89   109426 
Net::ReadAdapter#call_block 
(/usr/local/lib/ruby/1.8/net/protocol.rb:380}
  2.52      0.54     0.54     0.00     0.00   109435  String#slice! 
(ruby_runtime:0}
  2.30      1.33     0.49     0.00     0.84   109415 
#<Class:0x801ac6f30>#<< ((eval):1}
  2.08      0.50     0.45     0.00     0.05   109427  Class#initialize 
(ruby_runtime:0}
  2.08      0.98     0.45     0.00     0.54   109435 
Net::BufferedIO#rbuf_consume 
(/usr/local/lib/ruby/1.8/net/protocol.rb:137}
  2.01      4.05     0.43     0.00     3.62   109426 
Net::ReadAdapter#<< (/usr/local/lib/ruby/1.8/net/protocol.rb:371}
  1.94      0.41     0.41     0.00     0.00   109416  IO#write 
(ruby_runtime:0}
  1.72      1.03     0.37     0.00     0.66   109435  Class#new 
(ruby_runtime:0}
  1.50      0.32     0.32     0.00     0.00   328286  String#size 
(ruby_runtime:0}
  1.32     14.55     0.28     0.00    14.27   109426 
Net::BufferedIO#rbuf_fill (/usr/local/lib/ruby/1.8/net/protocol.rb:131}
  1.21      0.84     0.26     0.00     0.58   109417  Kernel#__send__ 
(ruby_runtime:0}
  1.13      0.24     0.24     0.00     0.00   109429  String#<< 
(ruby_runtime:0}
  0.99      0.30     0.21     0.00     0.09   109428  Fixnum#== 
(ruby_runtime:0}
  0.84      0.18     0.18     0.00     0.00   328289  Fixnum#+ 
(ruby_runtime:0}
  0.77      0.16     0.16     0.00     0.00   109427 
<Class::Class>#allocate (ruby_runtime:0}
  0.77      0.58     0.16     0.00     0.41   109415  IO#<< 
(ruby_runtime:0}
  0.73      0.28     0.16     0.00     0.12   109475  Hash#[] 
(ruby_runtime:0}
...

== 1.8.6 core + no timeout block ==
      user     system      total        real
open  1.757812   0.765625   2.523438 (  2.873194)

 %self     total     self     wait    child    calls  name
 21.62      2.63     2.63     0.00     0.00   109426  IO#sysread 
(ruby_runtime:0}
 15.72      1.91     1.91     0.00     0.00   109429  String#<< 
(ruby_runtime:0}
 13.98      1.70     1.70     0.00     0.00   109416  IO#write 
(ruby_runtime:0}
  9.56     12.16     1.16     0.00    10.99        1 
Net::BufferedIO#read (/usr/local/lib/ruby/1.8/net/protocol.rb:79}
  7.83      3.73     0.95     0.00     2.77   109426  OpenURI::Buffer#<< 
(/usr/local/lib/ruby/1.8/open-uri.rb:297}
  5.58      4.68     0.68     0.00     4.00   109426 
Net::ReadAdapter#call_block 
(/usr/local/lib/ruby/1.8/net/protocol.rb:385}
  3.66      4.99     0.45     0.00     4.55   109426 
Net::BufferedIO#rbuf_fill (/usr/local/lib/ruby/1.8/net/protocol.rb:138}
  3.34      2.38     0.41     0.00     1.97   109415 
#<Class:0x801ac82b8>#<< ((eval):1}
  3.27      0.53     0.40     0.00     0.13   109435 
Net::BufferedIO#rbuf_consume 
(/usr/local/lib/ruby/1.8/net/protocol.rb:142}
  2.95      5.04     0.36     0.00     4.68   109426 
Net::ReadAdapter#<< (/usr/local/lib/ruby/1.8/net/protocol.rb:376}
  2.25      0.27     0.27     0.00     0.00   328289  Fixnum#+ 
(ruby_runtime:0}
  1.54      0.19     0.19     0.00     0.00   328286  String#size 
(ruby_runtime:0}
  1.28      0.27     0.16     0.00     0.12   109475  Hash#[] 
(ruby_runtime:0}
  1.22      0.15     0.15     0.00     0.00   109431  Module#=== 
(ruby_runtime:0}
  1.22      1.85     0.15     0.00     1.70   109415  IO#<< 
(ruby_runtime:0}
  1.09      0.13     0.13     0.00     0.00   109435  String#slice! 
(ruby_runtime:0}
  0.96      1.97     0.12     0.00     1.85   109417  Kernel#__send__ 
(ruby_runtime:0}
  0.96      0.12     0.12     0.00     0.00   109426  String#length 
(ruby_runtime:0}
  0.96      0.12     0.12     0.00     0.00   109435  Hash#default 
(ruby_runtime:0}
  0.71      0.09     0.09     0.00     0.00   109439  Fixnum#< 
(ruby_runtime:0}
...

== 1.8.6 core + no timeout block + 10K sysread ==
      user     system      total        real
open  0.570312   1.406250   1.976562 (  2.521614)

 %self     total     self     wait    child    calls  name
 27.32      0.39     0.39     0.00     0.00    10953  String#slice! 
(ruby_runtime:0}
 18.58      0.27     0.27     0.00     0.00    10943  IO#write 
(ruby_runtime:0}
 12.02      0.17     0.17     0.00     0.00    10944  IO#sysread 
(ruby_runtime:0}
  8.74      1.42     0.12     0.00     1.30        1 
Net::BufferedIO#read (/usr/local/lib/ruby/1.8/net/protocol.rb:79}
  7.10      0.50     0.10     0.00     0.40    10944  OpenURI::Buffer#<< 
(/usr/local/lib/ruby/1.8/open-uri.rb:297}
  5.46      0.37     0.08     0.00     0.29    10942 
#<Class:0x801ac8b28>#<< ((eval):1}
  3.28      0.05     0.05     0.00     0.00    32840  String#size 
(ruby_runtime:0}
  2.73      0.55     0.04     0.00     0.52    10944 
Net::ReadAdapter#call_block 
(/usr/local/lib/ruby/1.8/net/protocol.rb:385}
  2.73      0.04     0.04     0.00     0.00    32843  Fixnum#+ 
(ruby_runtime:0}
  2.73      0.59     0.04     0.00     0.55    10944 
Net::ReadAdapter#<< (/usr/local/lib/ruby/1.8/net/protocol.rb:376}
  1.64      0.21     0.02     0.00     0.19    10944 
Net::BufferedIO#rbuf_fill (/usr/local/lib/ruby/1.8/net/protocol.rb:138}
  1.64      0.41     0.02     0.00     0.39    10953 
Net::BufferedIO#rbuf_consume 
(/usr/local/lib/ruby/1.8/net/protocol.rb:142}
  1.09      0.02     0.02     0.00     0.00    10944  String#length 
(ruby_runtime:0}
  1.09      0.02     0.02     0.00     0.00    10947  String#<< 
(ruby_runtime:0}
  1.09      0.02     0.02     0.00     0.00    10953  Hash#default 
(ruby_runtime:0}
  1.09      0.28     0.02     0.00     0.27    10942  IO#<< 
(ruby_runtime:0}
...

== 1.8.6 core + no timeout block + 100K sysread ==
      user     system      total        real
open  0.390625   0.484375   0.875000 (  1.013516)

 %self     total     self     wait    child    calls  name
 45.45      0.43     0.43     0.00     0.00     1098  IO#sysread 
(ruby_runtime:0}
 33.88      0.32     0.32     0.00     0.00     1097  IO#write 
(ruby_runtime:0}
  8.26      0.08     0.08     0.00     0.00     1107  String#slice! 
(ruby_runtime:0}
  1.65      0.45     0.02     0.00     0.44     1098 
Net::BufferedIO#rbuf_fill (/usr/local/lib/ruby/1.8/net/protocol.rb:138}
  1.65      0.02     0.02     0.00     0.00     3302  String#size 
(ruby_runtime:0}
  1.65      0.35     0.02     0.00     0.34     1098  OpenURI::Buffer#<< 
(/usr/local/lib/ruby/1.8/open-uri.rb:297}
  1.65      0.93     0.02     0.00     0.91        1 
Net::BufferedIO#read (/usr/local/lib/ruby/1.8/net/protocol.rb:79}
  0.83      0.01     0.01     0.00     0.00        1 
URI::HTTP#initialize (/usr/local/lib/ruby/1.8/uri/http.rb:77}
  0.83      0.09     0.01     0.00     0.08     1107 
Net::BufferedIO#rbuf_consume 
(/usr/local/lib/ruby/1.8/net/protocol.rb:142}
  0.83      0.01     0.01     0.00     0.00      357 
Module#method_added (ruby_runtime:0}
  0.83      0.01     0.01     0.00     0.00       37  <Class::Dir>#[] 
(ruby_runtime:0}
  0.83      0.01     0.01     0.00     0.00     1101  String#<< 
(ruby_runtime:0}
  0.83      0.36     0.01     0.00     0.35     1098 
Net::ReadAdapter#call_block 
(/usr/local/lib/ruby/1.8/net/protocol.rb:385}
  0.83      0.33     0.01     0.00     0.32     1096  IO#<< 
(ruby_runtime:0}
...

== 1.8.6 core + no timeout block + 1M sysread ==
      user     system      total        real
open  0.445312   0.335938   0.781250 (  0.913736)

 %self     total     self     wait    child    calls  name
 29.66      0.27     0.27     0.00     0.00      157  IO#sysread 
(ruby_runtime:0}
 28.81      0.27     0.27     0.00     0.00      166  String#slice! 
(ruby_runtime:0}
 28.81      0.27     0.27     0.00     0.00      156  IO#write 
(ruby_runtime:0}
  7.63      0.07     0.07     0.00     0.00      160  String#<< 
(ruby_runtime:0}
  1.69      0.30     0.02     0.00     0.28      157 
Net::ReadAdapter#call_block 
(/usr/local/lib/ruby/1.8/net/protocol.rb:385}
  0.85      0.01     0.01     0.00     0.00       37  <Class::Dir>#[] 
(ruby_runtime:0}
  0.85      0.01     0.01     0.00     0.00       98  Class#inherited 
(ruby_runtime:0}
  0.85      0.02     0.01     0.00     0.02        2 
Kernel#gem_original_require (ruby_runtime:0}
  0.85      0.28     0.01     0.00     0.27      157  OpenURI::Buffer#<< 
(/usr/local/lib/ruby/1.8/open-uri.rb:297}
...

== 1.8.6 core + no timeout block + 1M sysread + slice bypass ==
      user     system      total        real
open  0.093750   0.523438   0.617188 (  0.717748)

 %self     total     self     wait    child    calls  name
 66.67      0.23     0.23     0.00     0.00       52  IO#write 
(ruby_runtime:0}
 13.33      0.05     0.05     0.00     0.00       52  IO#sysread 
(ruby_runtime:0}
  4.44      0.02     0.02     0.00     0.00       55  String#<< 
(ruby_runtime:0}
  4.44      0.02     0.02     0.00     0.00        1 
TCPSocket#initialize (ruby_runtime:0}
  2.22      0.01     0.01     0.00     0.00      101  Hash#[] 
(ruby_runtime:0}
  2.22      0.01     0.01     0.00     0.00       10  String#slice! 
(ruby_runtime:0}
  2.22      0.02     0.01     0.00     0.01        2 
Kernel#gem_original_require (ruby_runtime:0}
  2.22      0.25     0.01     0.00     0.24       52  OpenURI::Buffer#<< 
(/usr/local/lib/ruby/1.8/open-uri.rb:297}
  2.22      0.01     0.01     0.00     0.00       37  <Class::Dir>#[] 
(ruby_runtime:0}
...

Note: The change does also raises questions about the core slice method 
not performing as well as it might for the no-op case.

== 1.8.6 core + select timeout + 1M sysread + slice bypass ==
      user     system      total        real
open  0.070312   0.226562   0.296875 (  0.898285)

 %self     total     self     wait    child    calls  name
 64.10      0.20     0.20     0.00     0.00       49  IO#write 
(ruby_runtime:0}
 15.38      0.05     0.05     0.00     0.00       49  IO#sysread 
(ruby_runtime:0}
  7.69      0.02     0.02     0.00     0.00       52  String#<< 
(ruby_runtime:0}
  5.13      0.23     0.02     0.00     0.21       49  OpenURI::Buffer#<< 
(/usr/local/lib/ruby/1.8/open-uri.rb:297}
  2.56      0.02     0.01     0.00     0.02        2 
Kernel#gem_original_require (ruby_runtime:0}
  2.56      0.01     0.01     0.00     0.00       37  <Class::Dir>#[] 
(ruby_runtime:0}
  2.56      0.02     0.01     0.00     0.01        3  Array#each 
(ruby_runtime:0}
...

So overall these changes improve the speed from 22seconds to 0.7seconds 
in my test case making ruby comparable with wget's performance.

The patch for this last run is attached and should be compatible with 
existing functionality through the use of select + raise Timeout::Error 
instead of timeout( @read_timeout ) { .. }

Hope this helps, any questions let me know
Posted by Steven Hartland (Guest)
on 2009-06-27 18:11
(Received via mailing list)
Issue #1696 has been updated by Steven Hartland.


The patch description above should read: patch for select timeout + 1M 
sysread + slice bypass
----------------------------------------
http://redmine.ruby-lang.org/issues/show/1696
Posted by Steven Hartland (Guest)
on 2009-06-27 20:48
(Received via mailing list)
Issue #1696 has been updated by Steven Hartland.

File protocol.rb.patch added

Silly bug in original patch, fixed version here, results still 
consistent with original.

It may be interesting to note that this also brings net/http speed 
inline with that of libcurl ruby gems
open.core     15.296875   1.437500  16.734375 ( 22.416853)
open.patched   0.093750   0.218750   0.312500 (  0.340683)
taf2-curb      0.171875   0.273438   0.445312 (  0.546568)

----------------------------------------
http://redmine.ruby-lang.org/issues/show/1696
Posted by Muhammad Ali (oldmoe)
on 2009-06-28 00:49
(Received via mailing list)
Net/HTTP in 1.9.2dev is already working as you described with two
exceptions:
1 - It uses nonblocking I/O
2 - It uses a 16KB buffer
The 1MB buffer you are using is too large for several reasons:
1 - You use blocking I/O calls, this works great if you are downloading 
from
the local machine. OTOH if the server you are connecting to is slow,
attempting to collect such a large buffer in a blocking manner is 
guaranteed
to cause serious delays.
2 - If multiple threads are downloading data together (which is normal 
for
crawlers) you will be allocating a lot of memory at once this way 
(mutliple
strings up to 1MB each). This will propably tax the GC.
You might want to upgrade to 1.8.7 (which implements read_nonblock) and
modify your patch for a non blocking behavior
oldmoe
Posted by Roger Pack (Guest)
on 2009-06-28 01:29
(Received via mailing list)
> Net/HTTP in 1.9.2dev is already working as you described with two
> exceptions:
> 1 - It uses nonblocking I/O
> 2 - It uses a 16KB buffer

does 1.9.2dev work as quickly then?
Posted by Muhammad Ali (oldmoe)
on 2009-06-28 03:58
(Received via mailing list)
Not sure really, never tested it.
Posted by Eero Saynatkari (roo_p)
on 2009-06-28 11:30
(Received via mailing list)
Excerpts from rogerdpack's message of Sun Jun 28 02:29:12 +0300 2009:
> > Net/HTTP in 1.9.2dev is already working as you described with two
> > exceptions:
> > 1 - It uses nonblocking I/O
> > 2 - It uses a 16KB buffer
> 
> does 1.9.2dev work as quickly then?

Whether it does or not, at least decreasing the buffer size
is absolutely necessary. The performance of a HTTP download
cannot be tuned based on a local, large file.


Eero
Posted by Steven Hartland (Guest)
on 2009-06-28 20:27
(Received via mailing list)
Issue #1696 has been updated by Steven Hartland.


As you can see from the test results a 16Kb buffer is way to small for 
high bandwidth connection, so as Eero mentioned this definitely needs to 
be documenting. N.B. I assume this has been moved to either a class 
variable or other accessible option.

None blocking support should be trivial, simply setting the flag on the 
socket should be sufficient, but with the current patch select will be 
achieving this anyway assuming sysread is actually using recv under the 
covers and returns with what ever data is available instead of waiting 
for size data. This is not clear from the docs as sysread, readpartial 
and read_nonblock contain conflicting info and in places some English 
which just doesn't make sense. If not just changing that one line should 
be sufficient.

Having a quick look at the current nightly snapshot the code:
1. It may not be ordered in the best way for performance especially on 
slower connections. I would suggest testing if inverting the order and 
allowing select to do its work before doing a none blocking read, or in 
fact a standard recv ( no need for none blocking, as its already 
guaranteed to work due to the select check ). Careful benchmarking of 
differing speed connections would be needed to confirm which is better.
2. slice is still in place, which could also still be causing an issue 
unless the underlying slice implementation has been fixed for the no-op 
case.

----------------------------------------
http://redmine.ruby-lang.org/issues/show/1696
Posted by Muhammad Ali (oldmoe)
on 2009-06-28 23:26
(Received via mailing list)
I believe if the descriptor is set to non blocking then sysread will 
return
whatever data available or raise an error if there isn't any.
But non block is critical for large buffer sizes as select will return 
the
socket as ready even if it has only one byte to read.

I would also recommend the nonblocking call happens before the select. 
This
way fast connections will not be taxed by two system calls, while slow
connections are slow anyway so they can wait for the Errno::EAGAIN 
exception
to be handled

oldmoe
Posted by Steven Hartland (Guest)
on 2009-06-29 02:14
(Received via mailing list)
Issue #1696 has been updated by Steven Hartland.


> But non block is critical for large buffer sizes as select will return the
> socket as ready even if it has only one byte to read.

This makes no difference to the socket level recv, even if the socket 
isn't
set as none blocking, it will always return what it has. It doesn't wait 
till
the data size matches size of the buffer passed, so again it depends on 
how the
lower level methods actually deal with this; which is not clear from the 
docs.

> I would also recommend the nonblocking call happens before the select. This
> way fast connections will not be taxed by two system calls, while slow
> connections are slow anyway so they can wait for the Errno::EAGAIN exception
> to be handled.

This really should be profiled instead of making assumptions. I say that 
as I
suspect exception handling is actually quite a bit heavier than a select 
call,
so if its used in any significant amount, it could cause noticeably 
higher CPU
load. It could go either way, but its always better to know and document 
why a
design decision was made than to make just make it on a supposition.

A very quick test of 10 requests for a 157MB from a high speed local 
connection
shows:
                        user     system      total        real
open.none-blocking      5.195312   4.023438   9.218750 ( 11.924775)
open.blocking-patched   2.312500   4.562500   6.875000 (  9.244287)

N.B. The none-blocking test was done using 1.8 install just replacing 
protocol.rb
with the nightly snapshot, increasing the buffer to 1MB and removing the 
rescue
type for compatibility.
----------------------------------------
http://redmine.ruby-lang.org/issues/show/1696
Posted by Tanaka Akira (Guest)
on 2009-06-29 20:18
(Received via mailing list)
In article <4a464441bf3f7_13bd3907d016634@redmine.ruby-lang.org>,
  Steven Hartland <redmine@ruby-lang.org> writes:

> This I've tracked down to the net/protocol module and the Net::BufferIO::rbuf_fill method which is using a timeout block to wrap the @io.sysread(1024) call. This is clearly an extremely bad way to do this, totally crippling the performance.

There is a reson to use timeout().  It is used to fix
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=265429

Before the fix, IO.select was used in rbuf_fill.
Posted by Eero Saynatkari (roo_p)
on 2009-06-29 20:39
(Received via mailing list)
Excerpts from Tanaka Akira's message of Mon Jun 29 21:17:58 +0300 2009:
> In article <4a464441bf3f7_13bd3907d016634@redmine.ruby-lang.org>,
>   Steven Hartland <redmine@ruby-lang.org> writes:
> 
> > This I've tracked down to the net/protocol module and the Net::BufferIO::rbuf_fill method which is using a timeout block to wrap the @io.sysread(1024) call. This is clearly an extremely bad way to do this, totally crippling the performance.
> 
> There is a reson to use timeout().  It is used to fix
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=265429
> 
> Before the fix, IO.select was used in rbuf_fill.

On the face of it, using timeout does not seem like the
optimal fix to the problem...speaking of, can anyone
reproduce this issue currently, or has it gotten fixed
otherwise in the 5 intervening years?


Eero
Posted by James Gray (bbazzarrakk)
on 2009-06-29 23:12
(Received via mailing list)
On Jun 29, 2009, at 1:38 PM, Eero Saynatkari wrote:

>> There is a reson to use timeout().  It is used to fix
>> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=265429
>>
>> Before the fix, IO.select was used in rbuf_fill.
>
> On the face of it, using timeout does not seem like the
> optimal fix to the problem...speaking of, can anyone
> reproduce this issue currently, or has it gotten fixed
> otherwise in the 5 intervening years?

Well, this definitely sounds like the kind of thing I am seeing.

James Edward Gray II
Posted by James Gray (bbazzarrakk)
on 2009-06-29 23:14
(Received via mailing list)
On Jun 29, 2009, at 1:17 PM, Tanaka Akira wrote:

>
> Before the fix, IO.select was used in rbuf_fill.

And Timeout does get to interrupt that?

James Edward Gray II
Posted by Eero Saynatkari (roo_p)
on 2009-06-29 23:39
(Received via mailing list)
Excerpts from James Gray's message of Tue Jun 30 00:12:23 +0300 2009:
>
> Well, this definitely sounds like the kind of thing I am seeing.

Which system are you on, and do you have some reasonably
small test case? Presumably requires recompiling without
the timeout?


Eero
Posted by James Gray (bbazzarrakk)
on 2009-06-29 23:46
(Received via mailing list)
On Jun 29, 2009, at 4:38 PM, Eero Saynatkari wrote:

> Excerpts from James Gray's message of Tue Jun 30 00:12:23 +0300 2009:
>>
>> Well, this definitely sounds like the kind of thing I am seeing.
>
> Which system are you on, and do you have some reasonably
> small test case? Presumably requires recompiling without
> the timeout?

I haven't been able to reliably reproduce this, no:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/24018

James Edward Gray II
Posted by Tanaka Akira (Guest)
on 2009-06-30 07:06
(Received via mailing list)
In article <473C3283-40AA-4079-92C0-31FA11A64BE0@grayproductions.net>,
  James Gray <james@grayproductions.net> writes:

> Well, this definitely sounds like the kind of thing I am seeing.

Do you use SSL?
Posted by James Gray (bbazzarrakk)
on 2009-06-30 16:29
(Received via mailing list)
On Jun 30, 2009, at 12:05 AM, Tanaka Akira wrote:

> In article <473C3283-40AA-4079-92C0-31FA11A64BE0@grayproductions.net>,
>  James Gray <james@grayproductions.net> writes:
>
>> Well, this definitely sounds like the kind of thing I am seeing.
>
> Do you use SSL?

Yes.

James Edward Gray II
Posted by Young Hyun (Guest)
on 2009-06-30 19:59
(Received via mailing list)
On Jun 29, 2009, at 2:12 PM, James Gray wrote:

>>>> way to do this, totally crippling the performance.
>
> Well, this definitely sounds like the kind of thing I am seeing.

For what it's worth, I've been fighting a similar problem for a while
now, trying Ruby 1.8.5, 1.8.6, and 1.8.7.  The circumstances are
similar: TCP connection with SSL, Ruby hanging on sysread, and very
rare failure (once every 3 months with persistent connections).

One possibility is that IO.select is returning a false readiness
notification, so a subsequent sysread hangs.  Linux manpages note that
select(2) "may report a socket file descriptor as 'ready for reading',
while nevertheless a subsequent read blocks" [1], and the recommended
solution is to work with nonblocking sockets.  (Incidentally, my
failures were under FreeBSD.)  However, until Ruby 1.9, SSL sockets
didn't support nonblocking I/O, so we were always susceptible to this
problem; we just lucked out because of the rarity of this failure mode.

Another possibility, and the one I have some suspicion of, is that
there's some pathological interaction with SSL; that is, IO.select
indicates readiness (and there actually are some bytes available to
read), the client invokes SSLSocket#sysread, and the SSL library does
some low-level SSL re-negotiations (which according to the SSL
protocol can happen at any time) and then blocks waiting on actual
user-level data.       Or instead of a low-level SSL re-negotiation,
perhaps the connection with the peer is lost (say, the remote box
loses power so that there's no notification of connection loss) in an
inopportune moment in SSLSocket#sysread (for example, in the middle of
reading an SSL protocol message), and SSLSocket#sysread hangs since it
has no way of detecting the connection loss (assuming SSLSocket isn't
using keepalives).

Anyway, this is all speculation, but I've had multiple failures over
the past 18 months, and a perusal of the Ruby core dump shows the
exact same pattern of a hang on select(2) called underneath in the
implementation of rb_io_wait_readable() and similar internal functions
(I'm not talking about a hang in select(2) as a direct result of a
user-level code invoking IO.select()).

I'm going to try using the nonblocking SSL operations under Ruby 1.9
to see whether it prevents my particular problem.  In general, though,
it seems safest to always use nonblocking sockets with Ruby if you're
going to use IO.select.

  --Young

   [1] http://linux.die.net/man/2/select
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.