Building Ruby with dmalloc

Has anyone managed to build Ruby with dmalloc support? I’m having
numerous problems trying to do so on MacOS X 10.4.8 and FreeBSD 6.1.
I’m trying to hunt down a memory leak in Ruby, probably in the Mutex
code. Also, fastthread 0.6.1 is crashing for me, possibly due to a
memory corruption, and I’d like to figure out what’s going on there.

–Young

Has anyone managed to build Ruby with dmalloc support? I’m having numerous
problems trying to do so on MacOS X 10.4.8 and FreeBSD 6.1. I’m trying to
hunt down a memory leak in Ruby, probably in the Mutex code. Also,

The leak in Mutex is because of the shift() method on Array. There has
been discussion of this on the list fairly recently that you should be
able to find in the archives if you are interested in looking up the
details.

Kirk H.

On Thu, 2007-01-11 at 03:54 +0900, Young H. wrote:

Has anyone managed to build Ruby with dmalloc support? I’m having
numerous problems trying to do so on MacOS X 10.4.8 and FreeBSD 6.1.
I’m trying to hunt down a memory leak in Ruby, probably in the Mutex
code. Also, fastthread 0.6.1 is crashing for me, possibly due to a
memory corruption, and I’d like to figure out what’s going on there.

It’s not quite as shiny as dmalloc, but have you tried electric fence
(libefence)? On Linux, at least, you can use it via LD_PRELOAD, without
recompiling Ruby.

Also, do you see the memory corruption under both MacOS X and FreeBSD?

-mental

On Fri, 2007-01-12 at 05:37 +0900, Young H. wrote:

On Jan 10, 2007, at 1:39 PM, MenTaLguY wrote:

I’ve given up trying to build Ruby with dmalloc support now that I’ve
learned that MacOS X has built-in support for dmalloc-like memory
debugging.

Have you gotten any useful reports from the memory debugging facility?

I don’t get these crashes if I run my program without fastthread
(but my program can’t run for long periods of time without fastthread
because of a serious memory leak, so I can’t say with 100% certainty
that these crashes don’t happen without fastthread).

Try replacing your stdlib’s thread.rb with the attached, modified
version which may mitigate the memory leak somewhat (though it won’t
offer fastthread’s performance). I’d like to be sure that the crash
doesn’t happen without fastthread.

Assuming it’s a fastthread issue, I’m a little suspicious of Queues.
Where and how does your code use them? If we can narrow it down to a
specific class, then it is easier to derive a simple test case.

-mental

On Jan 10, 2007, at 1:39 PM, MenTaLguY wrote:

It’s not quite as shiny as dmalloc, but have you tried electric fence
(libefence)? On Linux, at least, you can use it via LD_PRELOAD,
without
recompiling Ruby.

I haven’t tried electric fence, but thanks for mentioning it.

I’ve given up trying to build Ruby with dmalloc support now that I’ve
learned that MacOS X has built-in support for dmalloc-like memory
debugging. These debugging features are available automatically in
the standard malloc routines, and I don’t have to do anything special
in the build process of Ruby. For the curious, more information is
available in Apple’s Tech Note TN2124:

Technical Note TN2124: Mac OS X Debugging Magic

Also, do you see the memory corruption under both MacOS X and FreeBSD?

Yes, there seems to be memory corruption under both. Here’s what I
get under FreeBSD 6.1-STABLE-200607. I have two processes
communicating over SSL that abort in two different ways. The first
dies with ‘rb_gc_mark(): unknown data type’, which causes the second
to crash at exit (after I hit ^C) with ‘[BUG] Segmentation
fault’ (I’ve included the actual messages below). I don’t get these
crashes if I run my program without fastthread (but my program can’t
run for long periods of time without fastthread because of a serious
memory leak, so I can’t say with 100% certainty that these crashes
don’t happen without fastthread).

If I immediately interrupt (^C) the server process before the client
has connected to it, then I don’t get ‘[BUG] Segmentation fault’.
Threads are already running at this point (and require ‘fastthread’
has executed), but perhaps mutex operations haven’t been done yet.
Now, once a client connects, and some communication takes place
(definitely causing mutexes/fastthread to be used), I get a segfault
if I interrupt the server process. Here’s the transcript:

$ ~/ruby-1.8.5-p12/bin/ruby g.rb
Waiting for clients on port 8742…
^C./globalserver.rb:62:in join': Interrupt from ./globalserver.rb:62:in join’
from g.rb:10
$ ~/ruby-1.8.5-p12/bin/ruby g.rb
Waiting for clients on port 8742…
accepted connection from 192.172.226.88
GlobalSpaceDemux: got hello from $Id: globalmux.rb,v 1.43 2006/12/13
20:51:49 youngh Exp $, protocol 1
Waiting for clients on port 8742…
^C./globalserver.rb:62:in join': Interrupt from ./globalserver.rb:62:in join’
from g.rb:10
./globalserver.rb:62: [BUG] Segmentation fault
ruby 1.8.5 (2006-12-25) [i386-freebsd6.1]

Abort trap: 6 (core dumped)


building ruby:

export CFLAGS=-g # prevent building with -O2
./configure --prefix=/home/youngh/ruby-1.8.5-p12 --enable-pthread

building fastthread-0.6.1 with ‘~/ruby-1.8.5-p12/bin/ruby setup.rb’:

gcc -I. -I/home/youngh/ruby-1.8.5-p12/lib/ruby/1.8/i386-freebsd6.1 -I/
home/young
h/ruby-1.8.5-p12/lib/ruby/1.8/i386-freebsd6.1 -I/home/youngh/ruby/
fastthread-0.6
.1/ext/fastthread -fPIC -g -c fastthread.c
gcc -shared -Wl,-soname,fastthread.so -L’/home/youngh/ruby-1.8.5-p12/
lib’ -Wl,-R
‘/home/youngh/ruby-1.8.5-p12/lib’ -o fastthread.so fastthread.o -
lpthread -lcry
pt -lm -lc

incidentally, I get the same/similar problems if I build ruby

without --enable-pthread

for both crashes, the stack is corrupted–notice how rb_bug() is

at frame #97

the stack doesn’t get corrupted this badly or at all with MacOS X

(running on PowerPC)

$ gdb -c ruby.core
(gdb) file /home/youngh/ruby-1.8.5-p12/bin/ruby
Reading symbols from /home/youngh/ruby-1.8.5-p12/bin/ruby…done.
(gdb) bt
#0 0x2814a537 in ?? ()
#1 0x28137f71 in ?? ()
#2 0x00000000 in ?? ()
#3 0x00000004 in ?? ()
#4 0x00000006 in ?? ()
#5 0x00000005 in ?? ()
#6 0x28127c00 in ?? ()
#7 0x28127500 in ?? ()
#8 0x28127600 in ?? ()
#9 0x28127700 in ?? ()
#10 0x28127800 in ?? ()
#11 0x28127900 in ?? ()
#12 0x2810256a in ?? ()
#13 0x28127b00 in ?? ()
#14 0x28127c00 in ?? ()
#15 0x00000020 in ?? ()
#16 0x00000000 in ?? ()
#17 0x00000000 in ?? ()
#18 0x00000000 in ?? ()
#19 0x00000000 in ?? ()
#20 0x00000000 in ?? ()
#21 0x00000000 in ?? ()
#22 0x0000000d in ?? ()
#23 0x0000000d in ?? ()
#24 0x28142819 in ?? ()
#25 0x2814d4b4 in ?? ()
#26 0x083b6400 in ?? ()
#27 0xbfbfd9d4 in ?? ()

#90 0x00000258 in ?? ()
#91 0x083aefd0 in ?? ()
#92 0x00000001 in ?? ()
#93 0x2814d4b4 in ?? ()
#94 0xbfbfe230 in ?? ()
#95 0x00000002 in ?? ()
#96 0xbfbfded8 in ?? ()
#97 0x080de0be in rb_bug (fmt=0x8116000 “@?%(\025?\233???\020\b”)
at error.c:214
Previous frame inner to this frame (corrupt stack?)
(gdb)


crash in the client process:

the location given varies per run–this isn’t a YAML bug

/home/youngh/ruby-1.8.5-p12/lib/ruby/1.8/yaml/rubytypes.rb:360: [BUG]
rb_gc_mark(): unknown data type 0x20(0x83dffb0) non object
ruby 1.8.5 (2006-12-25) [i386-freebsd6.1]

Abort trap: 6 (core dumped)

–Young

On Jan 12, 2007, at 3:32 PM, MenTaLguY wrote:

On Fri, 2007-01-12 at 05:37 +0900, Young H. wrote:

On Jan 10, 2007, at 1:39 PM, MenTaLguY wrote:

I’ve given up trying to build Ruby with dmalloc support now that I’ve
learned that MacOS X has built-in support for dmalloc-like memory
debugging.

Have you gotten any useful reports from the memory debugging facility?

Yes I have. It showed that huge amounts of memory (500MB in a matter
of minutes) was being used by the realloc() call in
rb_thread_save_context. The call stack is something like

  rb_ary_collect (or rb_ary_each in half the cases)
  rb_yield
  ...
  rb_callcc
  rb_thread_save_context
  realloc

(Incidentally, the call sequence rb_thread_schedule →
rb_thread_save_context wasn’t eating up memory.)

I got the same behavior with and without fastthread.

I finally tracked down the memory leak, and it’s in
SyncEnumerator#each rather than in any thread synchronization class.
If I refrain from using SyncEnumerator, then my program’s memory
usage holds steady at around 33MB. Sorry for the wild goose chase,
but Mutex & company definitely have a bad reputation, and they seemed
the most likely candidates. I still need to investigate why exactly
I’m getting such poor behavior from SyncEnumerator#each. I did
notice that at least one other person has had this problem and
reported it to ruby-talk:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/105936

Assuming it’s a fastthread issue, I’m a little suspicious of Queues.
Where and how does your code use them? If we can narrow it down to a
specific class, then it is easier to derive a simple test case.

Even though my memory leak problem seems to be resolved, I still want
to help you to diagnose the crash with fastthread. My app makes
heavy use of threading, so having faster thread synchronization would
be useful for me.

I can now say that the crash only appears to happen when I use
fastthread. I mentioned that there are two types of crashes, a
segfault at exit and an “rb_gc_mark(): unknown data type” error. I
can give you some more information about the former kind of crash. I
can easily reproduce it (and I’ll try to create a small program to
reproduce it later), and I’ve run my app with memory corruption
detection turned on in MacOS X’s malloc. As far as malloc is
concerned, there are NO heap corruptions, overruns, or underruns. I
even tried with MacOS X’s very aggressive libgmalloc (which puts
unwritable virtual memory pages before or after an allocated block),
and also found no heap overruns or underruns.

The segfault at exit happens when mutex objects are finalized by the
GC. Here’s what I get in GDB when I start up my app, wait for it to
do a small amount of work (just enough to exercise fastthread a bit),
and then halt it with ^C, forcing the finalizers to run (note that
the app crashes on normal exit() as well, not just when forced to
quit with SIGINT):

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00000005
0x001bb2f8 in free_entries (first=0x1) at fastthread.c:74
74 next = first->next;
(gdb) bt
#0 0x001bb2f8 in free_entries (first=0x1) at fastthread.c:74
#1 0x001bb368 in finalize_list (list=0x603a74) at fastthread.c:85
#2 0x001bb870 in finalize_mutex (mutex=0x603a70) at fastthread.c:227
#3 0x001bc550 in finalize_queue (queue=0x603a70) at fastthread.c:562
#4 0x001bc5b4 in free_queue (queue=0x603a70) at fastthread.c:572
#5 0x0002c8bc in rb_gc_call_finalizer_at_exit () at gc.c:1884
#6 0x00005e5c in ruby_finalize_1 () at eval.c:1549
#7 0x00006048 in ruby_cleanup (ex=1) at eval.c:1584
#8 0x00006274 in ruby_stop (ex=6) at eval.c:1615
#9 0x00006348 in ruby_run () at eval.c:1636
#10 0x00002bdc in main (argc=2, argv=0xbffff874, envp=0xbffff880) at
main.c:46
(gdb) info locals
next = (Entry *) 0x0
(gdb) up
#1 0x001bb368 in finalize_list (list=0x603a74) at fastthread.c:85
85 free_entries(list->entry_pool);
(gdb) p *list
$1 = {
entries = 0x6040b0,
last_entry = 0x0,
entry_pool = 0x1,
size = 0
}
(gdb) up
#2 0x001bb870 in finalize_mutex (mutex=0x603a70) at fastthread.c:227
227 finalize_list(&mutex->waiting);
(gdb) p *mutex
$2 = {
owner = 6308016,
waiting = {
entries = 0x6040b0,
last_entry = 0x0,
entry_pool = 0x1,
size = 0
}
}
(gdb) p/x mutex->owner
$3 = 0x6040b0
(gdb) up
#3 0x001bc550 in finalize_queue (queue=0x603a70) at fastthread.c:562
562 finalize_mutex(&queue->mutex);
(gdb) p *queue
$4 = {
mutex = {
owner = 6308016,
waiting = {
entries = 0x6040b0,
last_entry = 0x0,
entry_pool = 0x1,
size = 0
}
},
value_available = {
waiting = {
entries = 0x0,
last_entry = 0x0,
entry_pool = 0x0,
size = 0
}
},
space_available = {
waiting = {
entries = 0x0,
last_entry = 0x0,
entry_pool = 0x0,
size = 0
}
},
values = {
entries = 0x0,
last_entry = 0x0,
entry_pool = 0x0,
size = 0
},
capacity = 0
}
(gdb)

The invalid values in fastthread’s mutex object is similar to what
we’ve seen in the 2nd type of crash (“rb_gc_mark(): unknown data
type”). I’ll try to create a test program to reproduce this crash at
exit, and since the corruption appears similar, this test program
should hopefully be useful for diagnosing the 2nd type of crash as well.

–Young

On Jan 15, 2007, at 7:28 AM, MenTaLguY wrote:

I believe the two crashes are related. Also, it appears that the
corruption only happens with Queues, not other uses of Mutexes.
Likely
this means that some queue-specific routine expecting a queue is
getting
passed a pointer to a member of the queue instead, or a routine
expecting one member is getting passed another.

The program below reproduces the problem, and surprisingly, I only
use Mutex and ConditionVariable–no Queue.

========================================
require ‘fastthread’
require ‘thread’

class GlobalSpaceMux

def initialize()
@mutex = Mutex.new
@condition = ConditionVariable.new
@queue = Array.new

 @send_thread = Thread.new(&method(:send_thread_loop))

end

def send_thread_loop
loop do
@mutex.synchronize do
@condition.wait(@mutex) while @queue.empty?
@queue.shift
end
end
end

end

x = GlobalSpaceMux.new

$ gdb ~/ruby-1.8.5-p12/bin/ruby
(gdb) r zzz-crash5.rb
Starting program: /Users/youngh/ruby-1.8.5-p12/bin/ruby zzz-crash5.rb
Reading symbols for shared libraries … done
Reading symbols for shared libraries . done

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00000005
0x001772f8 in free_entries (first=0x1) at fastthread.c:74
74 next = first->next;
(gdb) bt
#0 0x001772f8 in free_entries (first=0x1) at fastthread.c:74
#1 0x00177368 in finalize_list (list=0x434614) at fastthread.c:85
#2 0x00177870 in finalize_mutex (mutex=0x434610) at fastthread.c:227
#3 0x00178550 in finalize_queue (queue=0x434610) at fastthread.c:562
#4 0x001785b4 in free_queue (queue=0x434610) at fastthread.c:572
#5 0x0002c8bc in rb_gc_call_finalizer_at_exit () at gc.c:1884
#6 0x00005e5c in ruby_finalize_1 () at eval.c:1549
#7 0x00006048 in ruby_cleanup (ex=0) at eval.c:1584
#8 0x00006274 in ruby_stop (ex=0) at eval.c:1615
#9 0x00006348 in ruby_run () at eval.c:1636
#10 0x00002bdc in main (argc=2, argv=0xbffff780, envp=0xbffff78c) at
main.c:46
(gdb)

–Young

On Sat, 2007-01-13 at 10:13 +0900, Young H. wrote:

The invalid values in fastthread’s mutex object is similar to what
we’ve seen in the 2nd type of crash (“rb_gc_mark(): unknown data
type”). I’ll try to create a test program to reproduce this crash at
exit, and since the corruption appears similar, this test program
should hopefully be useful for diagnosing the 2nd type of crash as well.

I believe the two crashes are related. Also, it appears that the
corruption only happens with Queues, not other uses of Mutexes. Likely
this means that some queue-specific routine expecting a queue is getting
passed a pointer to a member of the queue instead, or a routine
expecting one member is getting passed another.

I’d expect the compiler to catch this sort of thing, and I didn’t see
anything obvious auditing the code by hand, but perhaps something’s
getting lost in a VALUE cast…

Anyway, it’s probably best to focus on Queue when constructing test
cases.

-mental