Networking: select() blocks for seconds (> timeout)

I am using Ruby (as a prototype version) to communicate with a network
of solar cells (driven by firmware), over Udp. It works most of the time
perfectly (delivering messages with 1-5 milliseconds precision, even
beyond what we need).

But once in a while, this happens (ubuntu machine) while waiting the
response (from solar cells):

result = select(( [ comm ], nil, [ comm ], 0.050))

if result.nil?
# handle 50 msec timeout
else
# data
end

All is perfect for several thousands of messages, until this: we receive
data (in the “else” branch) 5 seconds later; so, the select remained
blocked for 5 seconds without signaling the 50 msec timeout.

I controlled the Api (even going back even to Comer & Stevens books, as
the system calls map to the C ones), and I think that the code (I only
show a small part above) is correct.
Finally, I concluded that the garbage collector must be entering in
action and for a few seconds (that usually are very close to 5)
everything stops in Ruby.

We were planning to port this area to C++ in any case; but at a certain
moment the performance results were so accurate that we thought to keep
Ruby even for communication (as it allows, using a bit of
metaprogramming, to read on the fly configuration files, in other words
all that we love with Ruby, etc).

If anyone has insight on this, let me know.

(Please, only avoid answers like “if you want real time, use C, not
Ruby!”; we know that. But the point is that Ruby was a magnificent
surprise in this area, aside from what described above, and I just
wonder if there is any thought or insight in this area).

Raul P.

Raul P. wrote:

if result.nil?
# handle 50 msec timeout
else
# data
end

All is perfect for several thousands of messages, until this: we receive
data (in the “else” branch) 5 seconds later; so, the select remained
blocked for 5 seconds without signaling the 50 msec timeout.

Hi,

Here was an original discussion of the issue: (something like, select
may
have already succeeded, before a bad UDP checksum is detected by the
kernel)

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/148443

Here is a kludge-workaround I ended up using: (via timeout)

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/304785

These days, I use EventMachine.

Hope this helps,

Bill

All is perfect for several thousands of messages, until this: we receive
data (in the “else” branch) 5 seconds later; so, the select remained
blocked for 5 seconds without signaling the 50 msec timeout.

Maybe it is doing DNS lookup. You could try
BasicSocket.do_not_reverse_lookup=true

-rp

On 4/25/10, Raul P. [email protected] wrote:

All is perfect for several thousands of messages, until this: we receive
data (in the “else” branch) 5 seconds later; so, the select remained
blocked for 5 seconds without signaling the 50 msec timeout.

I controlled the Api (even going back even to Comer & Stevens books, as
the system calls map to the C ones), and I think that the code (I only
show a small part above) is correct.
Finally, I concluded that the garbage collector must be entering in
action and for a few seconds (that usually are very close to 5)
everything stops in Ruby.

I also suspect GC is your issue here. You could try to reduce the
latency cost of GC by forcing GC to happen every time thru the main
loop… This will reduce the tme it takes for any individual GC cycle
by increasing the total amount of time spent in GC. You might also try
reducing the amount of garbage (and for that matter non-garbage)
objects which your program creates.

Bill K. wrote:

… blocked for 5 seconds without signaling the 50 msec timeout.

Hi,
Here was an original discussion of the issue: (something like, select
may
have already succeeded, before a bad UDP checksum is detected by the
kernel)

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/148443


Bill

Thanks, Bill, I will look in more detail what happens between the
“select()” return, and the socket.recvfrom.

I must tell you that from other tests I did, simpler, not talking to
firmware, just 2 ruby programs sending a packet continuously to each
other inside a regular Mac (not on ubuntu, which is quite faster) I
noticed this:
a) in average the round trip of a packet was 2.6 msec (good, taking in
account that the sw prints something per packet, computes the min, avg
and max round trip, etc).
b) but the maximum roundtrip time was 70 msec (the minimum was an
astonishing 0.148msec, i.e. 148 microseconds).
c) when I increased the number of exchanges (from 1 million to 10
millions) the average time was just a bit higher (from 2.6 msec to 2.89
msec), but the maximum roundtrip time jumped to 93 msec. So the
distribution is still close to the the average, but the “queue of the
bell” (I am not sure how to say this exactly in English, but I think you
understand) expands significantly, although very low.

So, Ruby is above our hopes (“in average”) for high speed comm, but it
is not predictable (the maximum delay in the above example being 25
times the average, which would force us to set the timeouts absurdly
high). Something happens inside Ruby once in a while which freezes it;
from here, my suspicion of the garbage collector.

In any case, I will look carefully at what you mentioned and I will
repeat my tests more rigorously (I should not do them on my dear Mac,
because who knows what the os may do from time to time).

I appreciated your input very much

Raul

I also suspect GC is your issue here. You could try to reduce the
latency cost of GC by forcing GC to happen every time thru the main
loop… This will reduce the tme it takes for any individual GC cycle
by increasing the total amount of time spent in GC. You might also try
reducing the amount of garbage (and for that matter non-garbage)
objects which your program creates.

If you’re looking to reduce jitter, you could use jruby which has a
tunable GC.

http://wiki.github.com/rdp/ruby_tutorials_core/gc

GL.
-rp

On 4/27/2010 12:01 PM, Raul P. wrote:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/148443
noticed this:
understand) expands significantly, although very low.

So, Ruby is above our hopes (“in average”) for high speed comm, but it
is not predictable (the maximum delay in the above example being 25
times the average, which would force us to set the timeouts absurdly
high). Something happens inside Ruby once in a while which freezes it;
from here, my suspicion of the garbage collector.

A quick way to confirm that would be to disable garbage collection via
GC.disable(). Of course this will result in a memory leak, but is the
best way to test for sure if the GC is the problem. If it turns out
that it is, indeed the GC, you could just disable it for the critical
sections.

Walton H. wrote:

On 4/27/2010 12:01 PM, Raul P. wrote:


So, Ruby is above our hopes (“in average”) for high speed comm, but it
is not predictable (the maximum delay in the above example being 25
times the average, which would force us to set the timeouts absurdly
high). Something happens inside Ruby once in a while which freezes it;
from here, my suspicion of the garbage collector.

A quick way to confirm that would be to disable garbage collection via
GC.disable(). Of course this will result in a memory leak, but is the
best way to test for sure if the GC is the problem. If it turns out
that it is, indeed the GC, you could just disable it for the critical
sections.

Walton

excellent suggestion; I will try it as soon as I can.

While the move to C++ may not reversible for us, I still want to know
the whole Ruby story on this point; and I will report back.

Thanks

Raul

Caleb C. wrote:

On 4/25/10, Raul P. [email protected] wrote:

All is perfect for several thousands of messages, until this: we receive
data (in the “else” branch) 5 seconds later; so, the select remained
blocked for 5 seconds without signaling the 50 msec timeout.

Finally, I concluded that the garbage collector must be entering in
action and for a few seconds (that usually are very close to 5)
everything stops in Ruby.

I also suspect GC is your issue here. You could try to reduce the
latency cost of GC by forcing GC to happen every time thru the main
loop… This will reduce the tme it takes for any individual GC cycle
by increasing the total amount of time spent in GC. You might also try
reducing the amount of garbage (and for that matter non-garbage)
objects which your program creates.

Thanks Caleb

can you point out a link to how to do that (making GC to happen every
time?); I found pages which mention setting Ruby Constants, but let me
know what is the best page if you have the chance.

Regarding reducing amount of garbage, it is a good tip (that I had began
to apply); coming from compiled languages, I tend to use a variable for
each “entity”, so that the final computation looks like a clear
mathematical formula (instead of those 140 characters lines that I see
all over in Rails which compute everything right there on the line. I
begin to understand why they may be doing it… although I wonder what it
must be maintaining such a monstrous code. Hopefully, there is a good
compromise).

Thanks very much for the suggestions

Raul

I have some final results on the problem described.

First, I must correct something on my initial post; I had stated that
there was an occasional high delay between the moment of the ‘select’
call and the ‘select’ return (ie, although the timeout set in the select
was 50msec, the delay could be of 5 seconds).

Actually, tracing across all calls in that section of code, the delay
occurs between the select return and the ‘recvfrom’. The details of what
follows may be of interest to anyone using Ruby for fast communication.
Test environment:

  • pure Ruby 1.9.2 (no gems, just the ‘socket’ library) on an ubuntu
    machine (lots of memory)
  • Ruby sends 4 Udp msgs per second to a micro-controller
  • The micro (C/assembler) responds (Udp) within a 10-30 milliseconds
    range
  • So it is 4 msgs sent and 4 responses rcvd every second

This is what I saw since midnight in one of the systems (the symbol
‘<->’ means 1 msg sent and response; the symbol ‘!!’ was inserted to
grep all abnormal results):

Time as Hour:Min:Sec:Msec; the ‘delay_sel_rcv’ (the time between

return of ‘select’ and ‘recvfrom’) value is in Seconds

log from midnight; all perfect until 1:21 am

01:21:19:914: <->: !! delay_sel_rcv=10.006525661
01:21:29:928: <->: !! delay_sel_rcv=10.010217133
01:21:39:937: <->: !! delay_sel_rcv=10.004327574
01:21:49:954: <->: !! delay_sel_rcv=10.011541082
01:21:59:972: <->: !! delay_sel_rcv=10.005877574
01:22:05:973: <->: !! delay_sel_rcv= 5.998151639

then all ok unti:

02:22:27:374: <->: !! delay_sel_rcv=10.008022384
02:22:37:394: <->: !! delay_sel_rcv=10.008430684
02:22:47:401: <->: !! delay_sel_rcv=10.004019076
02:22:57:409: <->: !! delay_sel_rcv=10.005836859
02:23:07:580: <->: !! delay_sel_rcv=10.008476556
02:23:17:610: <->: !! delay_sel_rcv=10.007506338
02:23:27:642: <->: !! delay_sel_rcv=10.007311141
02:23:37:655: <->: !! delay_sel_rcv=10.008225368
02:23:47:685: <->: !! delay_sel_rcv=10.018187389

then all ok until

04:24:08:873: <->: !! delay_sel_rcv=10.006355125

We can see from the above:

  • the first 80 minutes (from midnight to 01:21) went fine
  • then we see several delays of 10 seconds, in the same minute (each 10
    seconds apart from the other)
  • for 1 hour all was pefect again, exchanging some 12,000 messages with
    perfect timing
  • then we have 9 delays of 10 seconds (again separated by 10 seconds)
  • for 1 hour all went fine again; then the cycle repeats

This pattern can only indicate (in my view) the garbage collector, which
Ruby seems to run for 10 seconds several times in the same minute or so.
I could not put the calls to GC.disable/enable (to have the final
proof), around the select/recvfrom (not to interfere with a real
experiment that was moving heavy equipment). Notice that, if it is the
GC, disabling/enabling it will only shift the problem from one area of
the communication handler to another (and thus having a similar impact
on the applications using the comm handler).

Interestingly, this problem does not happen within 1 computer; I used
the identical Ruby program but replacing the Firmware with a Ruby
simulator (with same machines, same Udp and the same binary strings
exchanged); in a test of 10 hours, I only saw occasional “delays” betwen
select and recvfrom, but in the order of 100 milliseconds, and never of
5/10 seconds.
This would seem to indicate an inefficiency in the Udp stack (when used
across computers).

My conclusion is that if you want a predictable delay (with values
spread across a ‘tight bell’ curve, not just increasing the timeout to
cope with ‘everything’), you must use (for that section of the software)
a compiled language; at least until the technology of garbage collector
changes.

I hope that this is useful to others who use Ruby for high speed
communication (and the ones working on garbage collectors).

Last note: one year ago I met in a party a JPL engineer working on the
Mars exploration program; he was admirative of Ruby, but after some
jokes on the expressivity and beauty in old and new languages, he added
that they would never use scripting languages because “we don’t want the
garbage collector to enter in action just when we should to begin to
slow down the spacecraft near Mars and miss the landing! in fact, we
don’t even use C++, as we did not find it totally predictable; so we
will still use C for years to come”.

I never knew how well I would learn to appreciate his point

Raul P.

[email protected]

On 4/27/10, Raul P. [email protected] wrote:

can you point out a link to how to do that (making GC to happen every
time?); I found pages which mention setting Ruby Constants, but let me
know what is the best page if you have the chance.

Just call GC.start inside your main loop somewhere. Like right before
or after the call to recv.

  • for 1 hour all went fine again; then the cycle repeats

This pattern can only indicate (in my view) the garbage collector, which
Ruby seems to run for 10 seconds several times in the same minute or so.

You can test your hypothesis by running it against jruby with a
“constant time” GC.
If you’re using 1.9 you could also test your hypothesis by using the
GC::Profiler
http://en.wikibooks.org/wiki/Ruby_Programming/Reference/Objects/IO/GC/Profiler

Also the “exact” 5 second delays makes me wonder if it isn’t a DNS
lookup.

Try your code with
BasicSocket.do_not_reverse_lookup=true

-r

Try your code with
BasicSocket.do_not_reverse_lookup=true

Three observations:

a) I do not think that there can be a DNS lookup between a ‘select(x, y,
z)’ and a ‘socket.recvfrom’.

It might, see
http://www.ensta.fr/~diam/ruby/online/ruby-doc-stdlib/libdoc/socket/rdoc/classes/Socket.html#M001258

Why don’t you try it and find out?

I could not try jruby.

Did it not install?

Also did you try 1.9.1 with it’s GC::Profiler?

-r

Roger P. wrote:

You can test your hypothesis by running it against jruby with a
“constant time” GC.
If you’re using 1.9 you could also test your hypothesis by using the
GC::Profiler
http://en.wikibooks.org/wiki/Ruby_Programming/Reference/Objects/IO/GC/Profiler

Also the “exact” 5 second delays makes me wonder if it isn’t a DNS
lookup.

Try your code with
BasicSocket.do_not_reverse_lookup=true

Three observations:

a) I do not think that there can be a DNS lookup between a ‘select(x, y,
z)’ and a ‘socket.recvfrom’.

I also clarify that the 2 statements are almost sequential; after the
select, you verify that the there is a ‘read condition’ and then you
issue the ‘recvfrom’. The whole section takes less than 10 lines of code
(with printouts, etc), and it should not generate any external query.

b) for the same reason I did not need a profiler (in a real system
moving machinery) to evaluate the time spent between those 2 statements.
Just computing the difference between the 2 times (‘Time.now’ when
select and recvfrom return), and then printing when the difference was >
a threshold was enough.

As I said, this happened regularly (after many thousands of messages
were exchanged with perfect timing), producing several ‘processing gaps’
of either 5 or 10 seconds every hour (approx). This occurred only when
UDP was used over Ethernet between 2 machines (with the apps inside the
same computer I never saw the problem).

c) I could not try jruby. Although Java may not be immune to perhaps
similar problems (see problems found in the Android phone).

My conclusion is that dynamic allocation and deallocation of memory for
tasks that occur every 100 msec or so is to be minimized (if possible;
else, move that task to a C program).

Still, more fundamentally, I think that we need a different garbage
collector architecture for Ruby. All the games disabling/enabling Gc,
tinkering, restructuring code, etc, are not the solution (for me).

Raul P. wrote:

a) I do not think that there can be a DNS lookup between a ‘select(x, y,
z)’ and a ‘socket.recvfrom’.

But in Ruby, recvfrom itself (unfortunately) does a reverse DNS lookup
on the IP address of the peer. You really need to turn this off.

If one DNS packet gets lost, and your resolver doesn’t retransmit for 10
seconds, then there’s your 10 second delay. It’s unlikely that the
garbage collector would (a) take so long, and (b) take such a consistent
amount of time.

Regards,

Brian.

Brian C. wrote:

Raul P. wrote:

a) I do not think that there can be a DNS lookup between a ‘select(x, y,
z)’ and a ‘socket.recvfrom’.

But in Ruby, recvfrom itself (unfortunately) does a reverse DNS lookup
on the IP address of the peer. You really need to turn this off.

You could try recv, which I don’t think does any DNS lookups. Also
setting do_not_reverse_lookup should overcome the related delay…

Roger P. wrote:

Why don’t you try it and find out?
Seconded: it’s worth a try.

Also, if you really believe this is GC-related, you can prove it by
GC.disable before the select, GC.enable after the recvfrom.

It seems highly unlikely that GC would be eating up 10 seconds.

Roger P. wrote:

Try your code with
BasicSocket.do_not_reverse_lookup=true

Three observations:

a) I do not think that there can be a DNS lookup between a ‘select(x, y,
z)’ and a ‘socket.recvfrom’.

It might, see
http://www.ensta.fr/~diam/ruby/online/ruby-doc-stdlib/libdoc/socket/rdoc/classes/Socket.html#M001258

Roger,

It is stunning to think that a ‘recvfrom’ on a socket (done after the
‘select’ signaled that there data ready to be read in the queue for that
socket) might go off doing reverse dns queries (instead of just
returning the packet).
Also, I did not see anything pointing to this in the link you provided
(I read the whole page and the only place where it mentions ‘reverse dns
queries’ is when issuing ‘gethostbyname’).

But it is true that we cannot trust anything nowadays, and it does not
harm to include a line to avoid reverse dns queries. I did (at socket
creation).

We are not running any more the experiment where this problem occurred,
so I cannot say now if this fixed it. But thanks for the tip.

Raul P.
[email protected]

Wow Raul,
You know how to write a thriller at least!

We are not running any more the experiment where this problem occurred,
so I cannot say now if this fixed it. But thanks for the tip.

Does this mean the story is unfinished?

MarkT

It is stunning to think that a ‘recvfrom’ on a socket (done after the
‘select’ signaled that there data ready to be read in the queue for that
socket) might go off doing reverse dns queries (instead of just
returning the packet).

Agreed. Thankfully with 1.9.2 this is no longer the default.

Also, I did not see anything pointing to this in the link you provided
(I read the whole page and the only place where it mentions ‘reverse dns
queries’ is when issuing ‘gethostbyname’).

Agreed.