JRuby 1.5RC3 hang with a finalizer

cremes · May 10, 2010, 5:08pm

I need some help tracking down a problem. I am testing some code (an FFI
wrapped library) that only hangs when I test a path that defines an
object finalizer. If I disable that codepath at runtime, I don’t ever
see a hang.

I’m hoping there is a way I can interrupt the JRuby process and get a
stacktrace when I see the hang. Is this possible? Or do I need to run it
under the java debugger?

Also, I only notice this hang on my linux box. The exact same code does
not hang when run under the OSX JVM.

I suspect there is something flaky with the finalizer when it gets run,
but it’s only a suspicion. This exact same code runs without incident
under MRI 1.8.7-p249 and 1.9.1-p378.

I’d appreciate any pointers on how to track this down.

cr

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

cremes · May 10, 2010, 5:13pm

Hi Chuck,

Press Control-\ on *nix or Control-Break on Windows and you’ll be able
to see the stacktraces.

Thanks,
–Vladimir

On Mon, May 10, 2010 at 5:07 PM, Chuck R. [email protected]
wrote:

cr

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

cremes · May 10, 2010, 5:24pm

Vladimir,

thanks for the tip. I’ve created a gist containing the stacktraces for
the client and server after they hang.

gist.github.com

https://gist.github.com/chuckremes/396172

client

^\2010-05-10 10:20:51
Full thread dump Java HotSpot(TM) 64-Bit Server VM (16.3-b01 mixed mode):

"ReferenceReaper" daemon prio=10 tid=0x0000000041df2800 nid=0x2b72 in Object.wait() [0x00007f9b1041d000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x00007f9b1ccee248> (a java.lang.ref.ReferenceQueue$Lock)
	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
	- locked <0x00007f9b1ccee248> (a java.lang.ref.ReferenceQueue$Lock)
	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)

This file has been truncated. show original

gistfile2.txt

^\2010-05-10 10:20:47
Full thread dump Java HotSpot(TM) 64-Bit Server VM (16.3-b01 mixed mode):

"ReferenceReaper" daemon prio=10 tid=0x00000000417f0000 nid=0x2b8f in Object.wait() [0x00007f16501dd000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x00007f165eba4ff8> (a java.lang.ref.ReferenceQueue$Lock)
	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
	- locked <0x00007f165eba4ff8> (a java.lang.ref.ReferenceQueue$Lock)
	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)

This file has been truncated. show original

gistfile3.txt

[cremes@box1 examples]$ jruby -v
jruby 1.5.0.RC3 (ruby 1.8.7 patchlevel 249) (2010-05-04 603f15a) (Java HotSpot(TM) 64-Bit Server VM 1.6.0_20) [amd64-java]
[cremes@box1 examples]$ java -version
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)

I don’t know how to interpret the trace. Does anything look amiss?

Also, is there a way to get the ruby stacktrace by interrupting the
runtime? I’d like to see where in the ruby code this sucker is getting
hung.

cr

On May 10, 2010, at 10:13 AM, Vladimir S. wrote:

I’m hoping there is a way I can interrupt the JRuby process and get a stacktrace when I see the hang. Is this possible? Or do I need to run it under the java debugger?

Also, I only notice this hang on my linux box. The exact same code does not hang when run under the OSX JVM.

I suspect there is something flaky with the finalizer when it gets run, but it’s only a suspicion. This exact same code runs without incident under MRI 1.8.7-p249 and 1.9.1-p378.

I’d appreciate any pointers on how to track this down.

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

cremes · May 10, 2010, 6:25pm

I’ve opened an issue with the jffi github project.

I’ve discovered a few additional details.

Turning off the JIT has no effect.
Using --client or --server has no effect.
The hang doesn’t occur under JRuby 1.4.0.
It is NOT related to the finalizer. I can reproduce the hang without
exercising that code path.

It may take a few thousand or a few tens of thousands of iterations
before the hang shows up. It is likely it is executing the hanging
function many times without error until finally it just chokes.
Obviously this makes reproducing it with a small sample pretty
difficult.

cr

On May 10, 2010, at 10:37 AM, Thomas E Enebo wrote:

client · GitHub

Also, I only notice this hang on my linux box. The exact same code does not hang when run under the OSX JVM.

I suspect there is something flaky with the finalizer when it gets run, but it’s only a suspicion. This exact same code runs without incident under MRI 1.8.7-p249 and 1.9.1-p378.

I’d appreciate any pointers on how to track this down.

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

cremes · May 10, 2010, 5:37pm

At least it looks like it is hanging in the same method…I am
guessing in jffi. Perhaps Wayne has some ideas on this?

-Tom

On Mon, May 10, 2010 at 10:23 AM, Chuck R. [email protected]
wrote:

cr

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

–
blog: http://blog.enebo.com twitter: tom_enebo
mail: [email protected]

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

cremes · May 11, 2010, 8:31am

Not sure what the root cause is, but you do have a bug in
ZMQ::UnmanagedMessage#initialize().

You allocate native memory via a FFI::MemoryPointer.new(), which is
managed by the ruby runtime, so when the variable holding the memory
goes out of scope, the native memory can be freed. Thats not what you
want. Also, LibZMQ::MessageDeallocator won’t actually be freeing
anything, since you don’t get original-object-return - you get an
opaque pointer which doesn’t have a #free method on it.

My recommendation is to map in malloc() and free() from libc, allocate
your pointer using LibC.malloc() to allocate the message buffer, and
use LibC.free() in your LibZMQ::MessageDeallocator callback to free
it.

Also UnmanagedMessage#data() doesn’t seem to be correct. That is
allocating a new FFI::MemoryPointer … which will be filled with
zeroes. You really just want to just return
LibZMQ.zmq_msg_data(@struct)), since that already returns a pointer.

On 11 May 2010 01:07, Chuck R. [email protected] wrote:

cr

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

cremes · May 11, 2010, 3:39pm

On May 11, 2010, at 1:30 AM, Wayne M. wrote:

My recommendation is to map in malloc() and free() from libc, allocate
your pointer using LibC.malloc() to allocate the message buffer, and
use LibC.free() in your LibZMQ::MessageDeallocator callback to free
it.

Also UnmanagedMessage#data() doesn’t seem to be correct. That is
allocating a new FFI::MemoryPointer … which will be filled with
zeroes. You really just want to just return
LibZMQ.zmq_msg_data(@struct)), since that already returns a pointer.

Wow, I pretty much got everything wrong, didn’t I?

I really appreciate you looking at the code. I’ve been flying by the
seat of my pants with this FFI stuff. I’ll add some updates to the wiki
(particularly on memory allocation) to hopefully clarify things for
future programmers.

cr

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

cremes · May 13, 2010, 4:34pm

Added a new page on the wiki to help retain the information gleaned from
this thread. Please check it for errors when you get a chance.

http://wiki.github.com/ffi/ffi/callbacks

cr

On May 12, 2010, at 1:24 AM, Wayne M. wrote:

it.

Also UnmanagedMessage#data() doesn’t seem to be correct. That is
allocating a new FFI::MemoryPointer … which will be filled with
zeroes. You really just want to just return
LibZMQ.zmq_msg_data(@struct)), since that already returns a pointer.

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

cremes · May 14, 2010, 12:53am

Looks good, apart from the bit about the blocking attribute and
callbacks. It only has effect when creating a Function object for an
existing native function (which is what attach_function uses
internally). For callbacks, when they start running ruby code, the
GIL is always acquired (naturally, otherwise Bad Things Happen [tm]).

On 14 May 2010 00:33, Chuck R. [email protected] wrote:

needed, change the parameter type to :pointer - I can’t remember

goes out of scope, the native memory can be freed. Thats not what you
allocating a new FFI::MemoryPointer … which will be filled with

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

cremes · May 14, 2010, 5:32am

Was I wrong to say that the GIL can be released for pure native function
calls? What you wrote below doesn’t seem to contradict that notion.

I’ll reword the :blocking piece, but I need to understand it better.

So, it releases the GIL under what conditions? (It appears it will
release the GIL when told to.) It should release the GIL under what
conditions?

cr

On May 13, 2010, at 5:52 PM, Wayne M. wrote:

Added a new page on the wiki to help retain the information gleaned from this thread. Please check it for errors when you get a chance.

whether callback params will accept a FFI::Function or not. Try it

want. Also, LibZMQ::MessageDeallocator won’t actually be freeing
zeroes. You really just want to just return
LibZMQ.zmq_msg_data(@struct)), since that already returns a pointer.

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

cremes · May 14, 2010, 6:31am

Nope, thats correct.

Basically :blocking => true in the Function options means the call
path looks like this:

ruby → ffi stub for parameter conversion → release GIL → call
actual native function → re-acquire GIL → ffi stub for result
conversion → return to ruby

Wheras without :blocking => true it is:

ruby → ffi stub for parameter conversion → actual native function →
ffi stub for result conversion-> return to ruby

Callbacks on the other hand always do this:
native code → ffi callback stub →
if thread.has_gil?
convert parameters to ruby
call ruby
convert result to native

elsif thread.is_ruby_thread?
acquire GIL
convert parameters to ruby
call ruby
convert result to native
release GIL

else # not a ruby thread - can’t call into ruby directly
bundle up ffi data
pass to ruby callback processing thread
wait for signal from callback processing thread
end

->native code

i.e. :blocking => true only affects ruby-to-native calls, and has no
effect on callbacks (i.e. native-to-ruby), since callbacks always
check for and acquire the GIL if needed.

Once you’re in a ruby callback (i.e. running ruby code), then you have
the GIL, and it will only be released again once you return from the
callback, or you call a native function that is marked as :blocking =>
true and it releases the GIL.

… and that is about the point your brain implodes. The JVM handles
all this nonsense in a much saner way.

On 14 May 2010 13:30, Chuck R. [email protected] wrote:

Looks good, apart from the bit about the blocking attribute and

and see).
On 11 May 2010 23:38, Chuck R. [email protected] wrote:

anything, since you don’t get original-object-return - you get an
LibZMQ.zmq_msg_data(@struct)), since that already returns a pointer.

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

cremes · May 12, 2010, 8:25am

A performance tweak you might want to do, is to turn
LibZMQ::MessageDeallocator into a FFI::Function instance (and if
needed, change the parameter type to :pointer - I can’t remember
whether callback params will accept a FFI::Function or not. Try it
and see).

The auto-magic Proc → function conversion is handy, but introduces a
tiny bit of overhead to each call that takes a callback, since it has
to dig into the proc object and see if there is already a
FFI::Function allocated for it.

See http://ffi.github.com/api/FFI/Function.html for info on how to use
Function directly (if its not in the wiki)

On 11 May 2010 23:38, Chuck R. [email protected] wrote:

anything, since you don’t get original-object-return - you get an
LibZMQ.zmq_msg_data(@struct)), since that already returns a pointer.

http://xircles.codehaus.org/manage_email

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email