Hi. I created Bitmap Marking GC for Ruby2.0. Source code: https://github.com/authorNari/ruby/tree/bitmap_marking Patch: https://github.com/authorNari/patch_bag/blob/maste... In following environment, this patch works 'make check' and 'make TESTS="--gc-stress" test-all'. $ ruby -v ruby 2.0.0dev (2011-11-18 trunk 33786) [x86_64-linux] = Performance evaluation == make benchmark The result of make benchmark OPTS="-r 5" is here. https://gist.github.com/1542547 In general, it's a little bit slower. In Bitmap Marking GC, GC will need to find a bitmap for a object in a mark process. So, GC will be a little bit slow. == skkzipcode Bitmap Marking GC is copy-on-write friendly as Ruby Enterprise Edition does. http://www.rubyenterpriseedition.com/faq.html I measured a above improvement by skkzipcode which is a benchmark program. In skkzipcode, the parent process keeps many data and child processes uses data that is shared with the parent process. https://github.com/authorNari/skkzipcode (This program uses /proc/PID/smaps to profile memory usages) origin PROCESS_CNT : 5 SHARED_TOTAL: 59124 kb PRIV_TOTAL : 224892 kb REE - GC.copy_on_write_friendly = true PROCESS_CNT : 5 SHARED_TOTAL: 207720 kb PRIV_TOTAL : 164572 kb bmap - Bitmap Marking GC for Ruby 2.0 PROCESS_CNT : 5 SHARED_TOTAL: 170744 kb PRIV_TOTAL : 138336 kb * PROCESS_CNT: count of child processes * SHARED_TOTAL: total of shared memory usage of child processes (KB) * PRIV_TOTAL: total of private memory usage of child processes (KB) bmap is copy-on-write friendly!! = Implementation Let me introduce some implementation topics. * A heap block address is aligned by 16KB to find fast a bitmap. * In Linux, it uses posix_memalign() or memalign(). * In Windows, it uses _aligned_malloc(). * To avoid unnecessary writing, GC decreases to relink freelist. * GC doesn't relink objects that are linked on freelist at starting GC. * A heap slot has freelist. * I embed a struct heaps_slot to a heap block. This patch improves memory usage on programs that are using fork() in Linux. We have to use fork() when we need a real parallel performance in CRuby. And, we already have many libraries that are using fork(). (e.g. Unicorn, Resque). And, GC is a little bit slower. But, I think it's in acceptable range. I already posted this topic to ruby-dev. http://bugs.ruby-lang.org/issues/5839 Matz agreed to commit this patch to trunk. Thanks.
on 2012-01-05 12:04
on 2012-01-05 15:54
> And, GC is a little bit slower. But, I think it's in acceptable range.
Maybe you could/should make GC collector type user specifiable, to
avoid the slowdown [this is what REE does I think].
For instance on windows you don't need it.
Cheers!
on 2012-01-05 16:49
Hi,
In message "Re: [ruby-core:41919] Re: Proposal: Bitmap Marking GC"
on Thu, 5 Jan 2012 23:53:38 +0900, Roger Pack
<rogerdpack2@gmail.com> writes:
|Maybe you could/should make GC collector type user specifiable, to
|avoid the slowdown [this is what REE does I think].
No, REE uses bitmap marking all the time. Besides that, extra
complexity in GC is not a good idea, especially when we expect bitmap
marking will be the basis for further optimization.
matz.
on 2012-01-05 17:38
> No, REE uses bitmap marking all the time. I don't think it does. http://www.rubyenterpriseedition.com/documentation... Tunable would be nice, too :) > Besides that, extra > complexity in GC is not a good idea. True. As long as there's no significant slowdown.
on 2012-01-05 19:07
Hi,
In message "Re: [ruby-core:41922] Re: Proposal: Bitmap Marking GC"
on Fri, 6 Jan 2012 01:37:28 +0900, Roger Pack
<rogerdpack2@gmail.com> writes:
|
|> No, REE uses bitmap marking all the time.
|
|I don't think it does.
|
|http://www.rubyenterpriseedition.com/documentation...
I checked REE long time ago. So my memory is not reliable. Sorry.
But as far as I recall, REE's switching copy-on-write-friendliness
slows down GC.
In fact, REE's GC itself is significantly slower than stock 1.8.7.
REE + Passenger is faster just because of tcmalloc + GC parameters +
less memory consumption.
|Tunable would be nice, too :)
1.9.3 has GC tuning parameters.
|> Besides that, extra
|> complexity in GC is not a good idea.
|
|True. As long as there's no significant slowdown.
Did you see his benchmark results? I didn't see any significant
slowdown.
matz.
on 2012-01-05 21:12
>> And, GC is a little bit slower. But, I think it's in acceptable range. > > Maybe you could/should make GC collector type user specifiable, to > avoid the slowdown [this is what REE does I think]. > For instance on windows you don't need it. I disagree. all of 'selectability' is just work around. Almost people don't know GC internal so much. thus, they can't choose right gc so easily. The right way is, many developers run his new gc and join his optimization work. We definitely need only one fastest gc.
on 2012-01-05 22:30
Narihiro Nakamura <authornari@gmail.com> wrote: > And, GC is a little bit slower. But, I think it's in acceptable range. I agree the performance loss is acceptable. Any chance of integrating this with your parallel mark GC? That could improve overall performance and negate any performance loss from this. > Matz agreed to commit this patch to trunk. Great news!
on 2012-01-05 23:34
> Any chance of integrating this with your parallel mark GC? That could > improve overall performance and negate any performance loss from this. Except that the mark phase can't be run in parallel :) Actually now that I think about it you might be able to have multiple threads "ruby_gc_mark" through various children. That might be an interesting optimization. -r
on 2012-01-06 00:22
2012/1/6 Eric Wong <normalperson@yhbt.net>: > Narihiro Nakamura <authornari@gmail.com> wrote: >> And, GC is a little bit slower. But, I think it's in acceptable range. > > I agree the performance loss is acceptable. > > Any chance of integrating this with your parallel mark GC? That could > improve overall performance and negate any performance loss from this. > Yes, I'm going to implement Parallel Marking GC with Bitmap Marking :)
on 2012-01-06 08:25
> I created Bitmap Marking GC for Ruby2.0. > > Source code: https://github.com/authorNari/ruby/tree/bitmap_marking > Patch: https://github.com/authorNari/patch_bag/blob/maste... > I've tried benchmark against trunk@34217 on Windows XP with mingw32 gcc version 4.5.2 (tdm-1). I used __mingw_aligned_malloc() cause of build error with _aligned_[malloc|free]. I also found _aligned_malloc() is usable if #define __MSVCRT_VERSION__ 0x0700. Result of make benchmark OPTS="-r 5". https://gist.github.com/1569277 I skipped vm_thread_mutex2 and vm_thread_pass_flood cause of errors. Some benchmarks are not a little slower on my box. Some benchmarks are faster. Win7 might have different results. Here is _aligned_[malloc|free] timing result. https://gist.github.com/1569002 It's based on Jon Forums's work. https://github.com/jonforums/tma _aligned_[malloc|free] looks slower than plain malloc/free. What do you think of windows implementation? Thanks,
on 2012-01-06 10:33
2012/1/6 Hiroshi Shirosaki <h.shirosaki@gmail.com>: > I also found _aligned_malloc() is usable if #define __MSVCRT_VERSION__ 0x0700. > Thank you! I commit this patch to my repository of github. https://gist.github.com/1569277#file_mingw_aligned... https://github.com/authorNari/ruby/commit/5b0b56c7... > Result of make benchmark OPTS="-r 5". > https://gist.github.com/1569277 > > I skipped vm_thread_mutex2 and vm_thread_pass_flood cause of errors. > Some benchmarks are not a little slower on my box. Some benchmarks are > faster. Win7 might have different results. Thanks for your help :) > Here is _aligned_[malloc|free] timing result. > https://gist.github.com/1569002 > > It's based on Jon Forums's work. > https://github.com/jonforums/tma > > _aligned_[malloc|free] looks slower than plain malloc/free. It would not be a big problem because my patch don't call _aligned_malloc/free frequently. Thanks.
on 2012-01-08 00:05
> Thank you! I commit this patch to my repository of github. > https://gist.github.com/1569277#file_mingw_aligned... > https://github.com/authorNari/ruby/commit/5b0b56c7... Thanks for your work :) I saw commit r34225. This has __mingw_aligned_malloc but doesn't have __mingw_aligned_free. This causes a build error with mingw32 gcc. compiling ../../../ruby/gc.c ../../../ruby/gc.c: In function 'aligned_free': ../../../ruby/gc.c:1085:5: error: implicit declaration of function '_aligned_free' make: *** [gc.o] Error 1 rake aborted! Command failed with status (2): [make...]
on 2012-01-08 00:24
Narihiro Nakamura <authornari@gmail.com> wrote:
> * A heap block address is aligned by 16KB to find fast a bitmap.
Just wondering, why/how did you determine 16K alignment is optimal?
Normal page size in Linux is only 4K, so 16K seems large.
on 2012-01-08 00:58
> Narihiro Nakamura <authornari@gmail.com> wrote: >> * A heap block address is aligned by 16KB to find fast a bitmap. > > Just wondering, why/how did you determine 16K alignment is optimal? > Normal page size in Linux is only 4K, so 16K seems large. Moreover, posix_memalign() and memalign() sound bad choice. It need a few header bytes. then, some malloc implementations might allocate 32K instead of 16K. So, why can't we use mmap() directly or use "16K - a few bytes" length?
on 2012-01-08 05:14
2012/1/8 KOSAKI Motohiro <kosaki.motohiro@gmail.com>: >> Narihiro Nakamura <authornari@gmail.com> wrote: >>> * A heap block address is aligned by 16KB to find fast a bitmap. >> >> Just wondering, why/how did you determine 16K alignment is optimal? >> Normal page size in Linux is only 4K, so 16K seems large. > We defined the heap block size at following commit. https://github.com/ruby/ruby/commit/4d93af26df1c32... I've run benchmarks on different heap block size. Then, I chose 16KB because it was fastest. But, I probably should have chosen 4KB as you say. > Moreover, posix_memalign() and memalign() sound bad choice. It need > a few header bytes. then, some malloc implementations might allocate > 32K instead of 16K. > > So, why can't we use mmap() directly or use "16K - a few bytes" length? > No special reason. We should choose to use mmap() directly if it's efficient way.
on 2012-01-08 06:36
Narihiro Nakamura <authornari@gmail.com> wrote: > because it was fastest. > But, I probably should have chosen 4KB as you say. A larger (16KB) _size_ I can easily understand to be more efficient. I just don't see a reason for _alignment_ to be 16KB, too. Larger alignment increases the chance of fragmentation. I've only used posix_memalign() to avoid sharing of L1 cache lines (with alignment being only 32-128 bytes on modern CPUs). I'm not even sure if any larger alignment makes sense for the Ruby heap. > > Moreover, posix_memalign() and memalign() sound bad choice. It need > > a few header bytes. then, some malloc implementations might allocate > > 32K instead of 16K. > > > > So, why can't we use mmap() directly or use "16K - a few bytes" length? > > No special reason. > We should choose to use mmap() directly if it's efficient way. Using mmap() directly will avoid (userspace) fragmentation entirely. On newish glibc, I expect posix_memalign()+free() to be fastest, especially for fast startup and short-lived processes. So I think "16K - 2*sizeof(size_t)" may be best (but I'm too lazy to test :)
on 2012-01-27 20:38
Eric Wong <normalperson@yhbt.net> wrote: > A larger (16KB) _size_ I can easily understand to be more efficient. > I just don't see a reason for _alignment_ to be 16KB, too. Larger > alignment increases the chance of fragmentation. Btw, I took at Bug #5901 and now realize why this GC requires alignment to match size. It makes mapping a VALUE back to its Ruby heap much easier (compared to the REE GC, which needed to scan heaps).
on 2012-01-27 23:03
On 1/27/12 1:38 PM, Eric Wong wrote: > Eric Wong <normalperson@yhbt.net> wrote: >> A larger (16KB) _size_ I can easily understand to be more efficient. >> I just don't see a reason for _alignment_ to be 16KB, too. Larger >> alignment increases the chance of fragmentation. > > Btw, I took at Bug #5901 and now realize why this GC requires > alignment to match size. It makes mapping a VALUE back to its Ruby > heap much easier (compared to the REE GC, which needed to scan heaps). > All the more reason to use mmap() instead of malloc_aligned(). -- KAS
Please log in before posting. Registration is free and takes only a minute.
Existing account
(Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
Log in with Google account | Log in with Yahoo account
No account? Register here.