Thread/sync.rb memory corruption

unknown · July 6, 2006, 3:52pm

i’m using ef (electric fence) to show the memory corruption in this
script.
to cause it corrupt memory use

CORRUPT=true ef ruby bug.rb

file: bug.rb

 require 'sync'

 class A
   def initialize
     extend Sync_m
     @observers = []
   end
   def meth
     synchronize(:EX){
       @observers.each do |o|
         if ENV['CORRUPT']
           o.notify nil
         else
           o.notify
         end
       end
     }
   end
   def add_observer o
     synchronize(:EX){
       @observers << o
     }
   end
 end

 class B
   def initialize a
     @a = a
     @a.add_observer self
   end
   def notify *a
     Thread.new{ @a.meth }
   end
 end

 a = A.new
 b = B.new a
 a.meth
 STDIN.gets

note that it is the simple passing, or not, of arguments to the notify
method
which triggers the bug. perhaps this will yield some hints.

regards.

-a

unknown · July 6, 2006, 3:58pm

Ara, I still believe the error is in the script (see my other posting).

Kind regards

robert

2006/7/6, [email protected] [email protected]:

unknown · July 6, 2006, 4:02pm

On Thu, 6 Jul 2006, Robert K. wrote:

Ara, I still believe the error is in the script (see my other posting).

Kind regards

robert

hi robert-

i never saw any post!? maybe mailing issues again? sigh…

can you re-state?

regards.

-a

unknown · July 6, 2006, 4:45pm

Could someone please confirm this can be reproduced on 1.8.5 pre1?

unknown · July 6, 2006, 5:33pm

On Thu, 6 Jul 2006, URABE Shyouhei wrote:

Could someone please confirm this can be reproduced on 1.8.5 pre1?

brains hgs 123 %> ./!$
./syncbug.rb
ruby 1.8.5 (2006-07-04) [sparc-solaris2.9]

brains hgs 124 %>

That blank line was just me hitting return, after about a second.
It exited immediately. I don’t have electric fence to hand.

this was with the build of ruby from my other posts.

[email protected] wrote:

i’m using ef (electric fence) to show the memory corruption in this
script.
to cause it corrupt memory use

CORRUPT=true ef ruby bug.rb

    Hugh

unknown · July 6, 2006, 5:49pm

On Fri, 7 Jul 2006, Hugh S. wrote:

That blank line was just me hitting return, after about a second.
It exited immediately. I don’t have electric fence to hand.

if you don’t have electric fence you’ll need to run for quite a long
time to
see the code begin to leak - pull up top while you’re running if you can
to
confirm

i’m trying to test on my box now but stable-snapshot won’t build for me.
i’m
getting

gcc -I. -I…/… -I…/…/. -I…/…/./ext/dl -DHAVE_DLFCN_H
-DHAVE_DLOPEN -DHAVE_DLCLOSE -DHAVE_DLSYM -DHAVE_DLERROR -I. -fPIC -g
-O2 -fno-defer-pop -fno-omit-frame-pointer -c cfunc.c
cfunc.c:46: warning: struct cfunc_data' declared inside parameter list cfunc.c:46: warning: its scope is only this definition or declaration, which is probably not what you want cfunc.c: In functiondlcfunc_free’:
cfunc.c:48: dereferencing pointer to incomplete type
cfunc.c:49: dereferencing pointer to incomplete type

while compiling ext/dl

anyone?

cheers.

-a

unknown · July 6, 2006, 6:00pm

On Thu, 6 Jul 2006, Robert K. wrote:

Yes, but they are still queued up against the mutex. I completely agree
with Tom’s analysis. Having an Observer trigger the notification again is
definitively a bad idea - at least if events are not queued up.

ah - but they are indeed queued - this is precisely what sync.rb is
supposed
to do. you can prove this for yourself by adding

p ‘Thread.list.size’ => Thread.list.size

in the notify method. if you do so you’ll see something like

on @ 1152201072.40667
{“Thread.list.size”=>1}
off @ 1152201072.41042
{“Thread.list.size”=>2}
on @ 1152201072.41508
{“Thread.list.size”=>3}
off @ 1152201072.4219
{“Thread.list.size”=>3}
on @ 1152201072.43386
{“Thread.list.size”=>2}
off @ 1152201072.44186
{“Thread.list.size”=>3}
on @ 1152201072.46079
{“Thread.list.size”=>2}

the number of threads running will never exceed three. the bug does not
seem
to be directly caused by sync.rb in any case. here is a more minimal
script
which will produce the error:

 require 'sync'
 class A
   def initialize
     extend Sync_m
     @observers = []
   end
   def meth
     synchronize(:EX){
       @observers.each do |o|
         if ENV['CORRUPT']
           o.notify nil
         else
           o.notify
         end
       end
     }
   end
   def add_observer o
     synchronize(:EX){
       @observers << o
     }
   end
 end

 class B
   def initialize a
     @a = a
     @a.add_observer self
   end
   def notify *a
     Thread.new{ @a.meth }
   end
 end

 a = A.new
 b = B.new a
 a.meth
 STDIN.gets

if you run it normally

ruby a.rb

it will run fine. if you run with

CORRUPT=true ruby a.rb

you will trash memory. if you have ef and do

CORRUPT=true ef ruby a.rb

electric fence will core dump almost immediately.

note that the only difference between crashing and not is calling notify
with
and argument - when that occurs the number of threads will remain
steady, but
the GC will stop collecting finished threads even though no reference to
them
is held.

regards.

-a

unknown · July 6, 2006, 5:50pm

2006/7/6, [email protected] [email protected]:

i never saw any post!? maybe mailing issues again? sigh…

can you re-state?

No prob. I’ll put you on CC - just in case.

robert

2006/7/6, [email protected] [email protected]:

So now you switch it on.
Toggle gets notified, and creates a thread that switches it off.
But when that Thread switches it off, toggle gets notified, and starts a
thread that switches it on again…
Basically it’s an infinite loop, and i suspect it’s leaking because GC
doesn’t kick in.

it would be, but the synchronize prevents two threads from entering any of the
Switch methods at a time so only thread can actually be running at once.

Yes, but they are still queued up against the mutex. I completely
agree with Tom’s analysis. Having an Observer trigger the
notification again is definitively a bad idea - at least if events are
not queued up.

Kind regards

robert

unknown · July 6, 2006, 6:23pm

On Fri, 7 Jul 2006 [email protected] wrote:

i’m trying to test on my box now but stable-snapshot won’t build for me. i’m
getting

i managed to build 1.8.5 without ext/dl or ext/ripper. re-running my
sample
program seems to work (does not core dump under electric fence) but the
memory
seems to grow without bound under top. i ran under valgrind and got

harp:~ > CORRUPT=1 valgrind --leak-check=yes
–workaround-gcc296-bugs=yes ruby-1.8.5/bin/ruby bug.rb
==1064== Memcheck, a memory error detector.
==1064== Copyright © 2002-2006, and GNU GPL’d, by Julian Seward et
al.
==1064== Using LibVEX rev 1606, a library for dynamic binary
translation.
==1064== Copyright © 2004-2006, and GNU GPL’d, by OpenWorks LLP.
==1064== Using valgrind-3.2.0, a dynamic binary instrumentation
framework.
==1064== Copyright © 2000-2006, and GNU GPL’d, by Julian Seward et
al.
==1064== For more details, rerun with: -v
==1064==

==1064==
==1064== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 17 from
1)
==1064== malloc/free: in use at exit: 398,835 bytes in 8,491 blocks.
==1064== malloc/free: 20,526 allocs, 12,035 frees, 155,280,147 bytes
allocated.
==1064== For counts of detected errors, rerun with: -v
==1064== searching for pointers to 8,491 not-freed blocks.
==1064== checked 604,052 bytes.
==1064==
==1064== 3,336 bytes in 162 blocks are possibly lost in loss record 3
of 10
==1064== at 0x401A6DE: malloc (vg_replace_malloc.c:149)
==1064== by 0x806A631: ruby_xmalloc (gc.c:121)
==1064== by 0x805F709: scope_dup (eval.c:8061)
==1064== by 0x805A1F7: rb_yield_0 (eval.c:5002)
==1064== by 0x805A5B5: rb_yield (eval.c:5035)
==1064== by 0x80B3E36: rb_ary_each (array.c:1130)
==1064== by 0x80662FA: call_cfunc (eval.c:5617)
==1064== by 0x805B886: rb_call0 (eval.c:5776)
==1064== by 0x805C1A8: rb_call (eval.c:6006)
==1064== by 0x8056B81: rb_eval (ruby.h:644)
==1064== by 0x805606C: rb_eval (eval.c:2905)
==1064== by 0x805BBDB: rb_call0 (eval.c:5912)
==1064==
==1064== LEAK SUMMARY:
==1064== definitely lost: 0 bytes in 0 blocks.
==1064== possibly lost: 3,336 bytes in 162 blocks.
==1064== still reachable: 395,499 bytes in 8,329 blocks.
==1064== suppressed: 0 bytes in 0 blocks.
==1064== Reachable blocks (those to which a pointer was found) are
not shown.
==1064== To see them, rerun with: --show-reachable=yes

if i run without the --workaround-gcc296-bugs=yes option (note that i am
not
using 296 but gcc 3.2.3 on Red Hat Enterprise Linux WS release 3 (Taroon
Update
7)) i get tons of these:

==1171== Invalid read of size 4
==1171== at 0x8062726: rb_thread_save_context (eval.c:10158)
==1171== by 0x806310E: rb_thread_schedule (eval.c:10718)
==1171== by 0x8064D3F: rb_thread_wait_other_threads (eval.c:12064)
==1171== by 0x8053F01: ruby_cleanup (eval.c:1567)
==1171== by 0x8053FD9: ruby_stop (eval.c:1606)
==1171== by 0x805403D: ruby_run (eval.c:1627)
==1171== by 0x8052450: main (main.c:46)
==1171== Address 0xBEFF8B38 is just below the stack ptr. To
suppress, use: --workaround-gcc296-bugs=yes

i don’t know if either error is ‘correct’ or not. thoughts?

-a

unknown · July 6, 2006, 6:29pm

On Thu, 6 Jul 2006, Robert K. wrote:

With queued I meant something different in this case, i.e. have a single
event queue in the obervable that is processed by a single thread only.

Well, yes. Still I think that the design is flawed because there is a
recursion via threads. If you have two listeners the # of threads will
constantly grow. Try it out! I tried it with these changes (attached).
Probably your version crashes with a single listener because threads are not
as fast GC’ed as on my machine or whatever.

agreed. this is actually the model i am using in the real code. note
that
this code is simply a minimal example which triggers the bug.
regardless of
whether it’s a bad design or not (which it is) memory should not be
corrupted
by a simply ruby script - if so i think it’s a bug.

note that it’s not that it simple gets slow that i’m showing - if you
run
under electric fence is shows actual memory corruption - same with
valgrind.

Try my changes. More and more threads queue up at the mutex.

oh i know - that’s how the real code works - with a Queue and a single
thread
consumer.

Btw, it does not crash on cygwin - neither way.

huh. suprising. do you have electric fence? you cannot make it crash
without it unless you wait a very long time.

Weird indeed.

yes - this is the issue that concerns me.

cheers.

-a

unknown · July 6, 2006, 6:19pm

2006/7/6, [email protected] [email protected]:

On Thu, 6 Jul 2006, Robert K. wrote:

Yes, but they are still queued up against the mutex. I completely agree
with Tom’s analysis. Having an Observer trigger the notification again is
definitively a bad idea - at least if events are not queued up.

ah - but they are indeed queued - this is precisely what sync.rb is supposed
to do. you can prove this for yourself by adding

With queued I meant something different in this case, i.e. have a
single event queue in the obervable that is processed by a single
thread only.

off @ 1152201072.4219
{“Thread.list.size”=>3}
on @ 1152201072.43386
{“Thread.list.size”=>2}
off @ 1152201072.44186
{“Thread.list.size”=>3}
on @ 1152201072.46079
{“Thread.list.size”=>2}

Well, yes. Still I think that the design is flawed because there is a
recursion via threads. If you have two listeners the # of threads
will constantly grow. Try it out! I tried it with these changes
(attached). Probably your version crashes with a single listener
because threads are not as fast GC’ed as on my machine or whatever.

the number of threads running will never exceed three. the bug does not seem
to be directly caused by sync.rb in any case. here is a more minimal script
which will produce the error:

Try my changes. More and more threads queue up at the mutex.

it will run fine. if you run with

CORRUPT=true ruby a.rb

you will trash memory. if you have ef and do

CORRUPT=true ef ruby a.rb

electric fence will core dump almost immediately.

Btw, it does not crash on cygwin - neither way.

note that the only difference between crashing and not is calling notify with
and argument - when that occurs the number of threads will remain steady, but
the GC will stop collecting finished threads even though no reference to them
is held.

Weird indeed.

Kind regards

robert

unknown · July 6, 2006, 6:53pm

On Fri, 7 Jul 2006, [email protected] wrote:

brains hgs 124 %>

That blank line was just me hitting return, after about a second.
It exited immediately. I don’t have electric fence to hand.

if you don’t have electric fence you’ll need to run for quite a long time to
see the code begin to leak - pull up top while you’re running if you can to
confirm

Oh! I see. I was thinking that the gets was to keep the console open
on platforms that needed it, and that it would not exit until it reached
that.
But it’s threaded…

brains hgs 141 %> ./syncbug.rb
ruby 1.8.2 (2004-12-25) [sparc-solaris2.9]
^C./syncbug.rb:125:in gets': Interrupt from ./syncbug.rb:125 brains hgs 142 %> vim !$ vim ./syncbug.rb brains hgs 143 %> ./syncbug.rb ruby 1.8.5 (2006-07-04) [sparc-solaris2.9] ^C./syncbug.rb:125:ingets’: Interrupt
from ./syncbug.rb:125
brains hgs 144 %>

I see increases in memory of the order of Megabytes within a couple of
minutes, for both cases… What else can I look at?

brains hgs 146 %> truss -fci ./syncbug.rb
ruby 1.8.5 (2006-07-04) [sparc-solaris2.9]
^C./syncbug.rb:125:in `gets’: Interrupt
from ./syncbug.rb:125
signals ------------
SIGVTALRM 5961
total: 5961

syscall seconds calls errors
read .002 16
write .000 1
open .000 7
close .000 14
brk .021 294
stat .003 42 30
getuid .000 3
fstat .000 7
getgid .000 3
ioctl .000 1
execve .000 1
poll .247 4060
sigprocmask .000 4
sigaction .000 13
sigfillset .000 1
getcontext .000 1
setcontext .513 5961
evsys .000 1
evtrapret .000 1
mmap .001 18
munmap .001 5
getrlimit .000 1
memcntl .000 4
setitimer .000 1
llseek .000 7
resolvepath .000 8
stat64 .000 11 9
fstat64 .000 3
getrlimit64 .000 1
open64 .000 7
-------- ------ ----
sys totals: .797 10497 39
usr time: 58.173
elapsed: 61.510
brains hgs 147 %>

and I got 39 errors for just truss -c.

The Process class can’t query its size. I can’t see how to extract that
from /proc either.

cfunc.c: In function `dlcfunc_free’:
cfunc.c:48: dereferencing pointer to incomplete type
cfunc.c:49: dereferencing pointer to incomplete type

while compiling ext/dl

anyone?

I don’t know about that.

cheers.

-a

    Hugh

unknown · July 6, 2006, 10:04pm

2006/7/6, Ara.T.Howard [email protected]:

agreed. this is actually the model i am using in the real code. note that
this code is simply a minimal example which triggers the bug. regardless of
whether it’s a bad design or not (which it is) memory should not be corrupted
by a simply ruby script - if so i think it’s a bug.

Hm, I’m not sure I agree. Try this one liner that will eventually
blow because too many threads are created that all wait for the mutex:

robert@fussel ~
$ ruby -r thread -e ‘m=Mutex.new; m.synchronize { loop { Thread.new {
m.synchronize { } } } }’
[FATAL] failed to allocate memory

robert@fussel ~
$

This is not a Ruby bug but a but in the code.

note that it’s not that it simple gets slow that i’m showing - if you run
under electric fence is shows actual memory corruption - same with valgrind.

Try my changes. More and more threads queue up at the mutex.

oh i know - that’s how the real code works - with a Queue and a single thread
consumer.

Good. You could also prevent notifications while notifying. Or store
notifications in a queue and loop as long in the notify method as
there are notifications in the queue.

Btw, it does not crash on cygwin - neither way.

huh. suprising. do you have electric fence? you cannot make it crash
without it unless you wait a very long time.

Please recheck that it’s not the same situation I presented above. I
still have the feeling that this might be the real reason.

Kind regards

robert

unknown · July 7, 2006, 9:36am

On Fri, Jul 07, 2006 at 05:01:25AM +0900, Robert K. wrote:

blow because too many threads are created that all wait for the mutex:

robert@fussel ~
$ ruby -r thread -e ‘m=Mutex.new; m.synchronize { loop { Thread.new {
m.synchronize { } } } }’
[FATAL] failed to allocate memory

This is an out of memory failure, with no unreleased GC’able memory in
sight. Of course this fails, and there is no alternative to this.

Memory corruption is a very different, and should not be produceable
by ruby scripts.

-JÃ¼rgen

unknown · July 7, 2006, 3:38pm

On Fri, 7 Jul 2006, Juergen S. wrote:

Memory corruption is a very different, and should not be produceable
by ruby scripts.

yes - this is exactly my point - no whether the approach is sane or not.
it
definitely should not be able to corrupt memory with a vanilla ruby
script
but, with the one i posted, it seems to be.

regards.

-a

unknown · July 10, 2006, 2:59pm

How are these results? They seem to point to memory leaks in sync.rb

    Hugh