Forum: Ruby-core Proposal: Bitmap Marking GC

Posted by Narihiro Nakamura (Guest)
on 2012-01-05 12:04
(Received via mailing list)
Hi.

I created Bitmap Marking GC for Ruby2.0.

Source code: https://github.com/authorNari/ruby/tree/bitmap_marking
Patch: 
https://github.com/authorNari/patch_bag/blob/maste...

In following environment, this patch works 'make check' and
'make TESTS="--gc-stress" test-all'.

$ ruby -v
ruby 2.0.0dev (2011-11-18 trunk 33786) [x86_64-linux]

= Performance evaluation

== make benchmark
The result of make benchmark OPTS="-r 5" is here.
https://gist.github.com/1542547
In general, it's a little bit slower.

In Bitmap Marking GC, GC will need to find a bitmap for a object in a
mark process. So, GC will be a little bit slow.

== skkzipcode
Bitmap Marking GC is copy-on-write friendly as Ruby Enterprise
Edition does.
http://www.rubyenterpriseedition.com/faq.html

I measured a above improvement by skkzipcode which is a benchmark 
program.
In skkzipcode, the parent process keeps many data and child processes
uses data that is shared with the parent process.
https://github.com/authorNari/skkzipcode
(This program uses /proc/PID/smaps to profile memory usages)

origin
PROCESS_CNT : 5
SHARED_TOTAL: 59124     kb
PRIV_TOTAL  : 224892    kb

REE - GC.copy_on_write_friendly = true
PROCESS_CNT : 5
SHARED_TOTAL: 207720    kb
PRIV_TOTAL  : 164572    kb

bmap - Bitmap Marking GC for Ruby 2.0
PROCESS_CNT : 5
SHARED_TOTAL: 170744    kb
PRIV_TOTAL  : 138336    kb

* PROCESS_CNT: count of child processes
* SHARED_TOTAL: total of shared memory usage of child processes (KB)
* PRIV_TOTAL: total of private memory usage of child processes (KB)

bmap is copy-on-write friendly!!

= Implementation
Let me introduce some implementation topics.

* A heap block address is aligned by 16KB to find fast a bitmap.
  * In Linux, it uses posix_memalign() or memalign().
  * In Windows, it uses _aligned_malloc().
* To avoid unnecessary writing, GC decreases to relink freelist.
  * GC doesn't relink objects that are linked on freelist at starting 
GC.
* A heap slot has freelist.
* I embed a struct heaps_slot to a heap block.

This patch improves memory usage on programs that are using fork() in
Linux. We have to use fork() when we need a real parallel performance
in CRuby. And, we already have many libraries that are using fork().
(e.g. Unicorn, Resque).

And, GC is a little bit slower. But, I think it's in acceptable range.

I already posted this topic to ruby-dev.
http://bugs.ruby-lang.org/issues/5839
Matz agreed to commit this patch to trunk.

Thanks.
Posted by Roger Pack (Guest)
on 2012-01-05 15:54
(Received via mailing list)
> And, GC is a little bit slower. But, I think it's in acceptable range.

Maybe you could/should make GC collector type user specifiable, to
avoid the slowdown [this is what REE does I think].
For instance on windows you don't need it.

Cheers!
Posted by Yukihiro Matsumoto (Guest)
on 2012-01-05 16:49
(Received via mailing list)
Hi,

In message "Re: [ruby-core:41919] Re: Proposal: Bitmap Marking GC"
    on Thu, 5 Jan 2012 23:53:38 +0900, Roger Pack 
<rogerdpack2@gmail.com> writes:

|Maybe you could/should make GC collector type user specifiable, to
|avoid the slowdown [this is what REE does I think].

No, REE uses bitmap marking all the time.  Besides that, extra
complexity in GC is not a good idea, especially when we expect bitmap
marking will be the basis for further optimization.

              matz.
Posted by Roger Pack (Guest)
on 2012-01-05 17:38
(Received via mailing list)
> No, REE uses bitmap marking all the time.

I don't think it does.

http://www.rubyenterpriseedition.com/documentation...

Tunable would be nice, too :)

> Besides that, extra
> complexity in GC is not a good idea.

True.  As long as there's no significant slowdown.
Posted by Yukihiro Matsumoto (Guest)
on 2012-01-05 19:07
(Received via mailing list)
Hi,

In message "Re: [ruby-core:41922] Re: Proposal: Bitmap Marking GC"
    on Fri, 6 Jan 2012 01:37:28 +0900, Roger Pack 
<rogerdpack2@gmail.com> writes:
|
|> No, REE uses bitmap marking all the time.
|
|I don't think it does.
|
|http://www.rubyenterpriseedition.com/documentation...

I checked REE long time ago.  So my memory is not reliable.  Sorry.
But as far as I recall, REE's switching copy-on-write-friendliness
slows down GC.

In fact, REE's GC itself is significantly slower than stock 1.8.7.
REE + Passenger is faster just because of tcmalloc + GC parameters +
less memory consumption.

|Tunable would be nice, too :)

1.9.3 has GC tuning parameters.

|> Besides that, extra
|> complexity in GC is not a good idea.
|
|True.  As long as there's no significant slowdown.

Did you see his benchmark results?  I didn't see any significant
slowdown.

              matz.
Posted by KOSAKI Motohiro (Guest)
on 2012-01-05 21:12
(Received via mailing list)
>> And, GC is a little bit slower. But, I think it's in acceptable range.
>
> Maybe you could/should make GC collector type user specifiable, to
> avoid the slowdown [this is what REE does I think].
> For instance on windows you don't need it.

I disagree. all of 'selectability' is just work around. Almost people 
don't
know GC internal so much. thus, they can't choose right gc so easily.
The right way is, many developers run his new gc and join his 
optimization
work. We definitely need only one fastest gc.
Posted by Eric Wong (Guest)
on 2012-01-05 22:30
(Received via mailing list)
Narihiro Nakamura <authornari@gmail.com> wrote:
> And, GC is a little bit slower. But, I think it's in acceptable range.

I agree the performance loss is acceptable.

Any chance of integrating this with your parallel mark GC?  That could
improve overall performance and negate any performance loss from this.

> Matz agreed to commit this patch to trunk.

Great news!
Posted by Roger Pack (Guest)
on 2012-01-05 23:34
(Received via mailing list)
> Any chance of integrating this with your parallel mark GC? That could
> improve overall performance and negate any performance loss from this.

Except that the mark phase can't be run in parallel :)

Actually now that I think about it you might be able to have multiple
threads "ruby_gc_mark" through various children.  That might be an
interesting optimization.

-r
Posted by Narihiro Nakamura (Guest)
on 2012-01-06 00:22
(Received via mailing list)
2012/1/6 Eric Wong <normalperson@yhbt.net>:
> Narihiro Nakamura <authornari@gmail.com> wrote:
>> And, GC is a little bit slower. But, I think it's in acceptable range.
>
> I agree the performance loss is acceptable.
>
> Any chance of integrating this with your parallel mark GC? That could
> improve overall performance and negate any performance loss from this.
>

Yes, I'm going to implement Parallel Marking GC with Bitmap Marking :)
Posted by Hiroshi Shirosaki (Guest)
on 2012-01-06 08:25
(Received via mailing list)
> I created Bitmap Marking GC for Ruby2.0.
>
> Source code: https://github.com/authorNari/ruby/tree/bitmap_marking
> Patch: 
https://github.com/authorNari/patch_bag/blob/maste...
>

I've tried benchmark against trunk@34217 on Windows XP with mingw32
gcc version 4.5.2 (tdm-1).
I used __mingw_aligned_malloc() cause of build error with
_aligned_[malloc|free].
I also found _aligned_malloc() is usable if #define __MSVCRT_VERSION__ 
0x0700.

Result of make benchmark OPTS="-r 5".
https://gist.github.com/1569277

I skipped vm_thread_mutex2 and vm_thread_pass_flood cause of errors.
Some benchmarks are not a little slower on my box. Some benchmarks are
faster. Win7 might have different results.

Here is _aligned_[malloc|free] timing result.
https://gist.github.com/1569002

It's based on Jon Forums's work.
https://github.com/jonforums/tma

_aligned_[malloc|free] looks slower than plain malloc/free.


What do you think of windows implementation?

Thanks,
Posted by Narihiro Nakamura (Guest)
on 2012-01-06 10:33
(Received via mailing list)
2012/1/6 Hiroshi Shirosaki <h.shirosaki@gmail.com>:
> I also found _aligned_malloc() is usable if #define __MSVCRT_VERSION__ 0x0700.
>

Thank you! I commit this patch to my repository of github.
https://gist.github.com/1569277#file_mingw_aligned...
https://github.com/authorNari/ruby/commit/5b0b56c7...

> Result of make benchmark OPTS="-r 5".
> https://gist.github.com/1569277
>
> I skipped vm_thread_mutex2 and vm_thread_pass_flood cause of errors.
> Some benchmarks are not a little slower on my box. Some benchmarks are
> faster. Win7 might have different results.

Thanks for your help :)

> Here is _aligned_[malloc|free] timing result.
> https://gist.github.com/1569002
>
> It's based on Jon Forums's work.
> https://github.com/jonforums/tma
>
> _aligned_[malloc|free] looks slower than plain malloc/free.

It would not be a big problem because my patch don't call
_aligned_malloc/free frequently.


Thanks.
Posted by Hiroshi Shirosaki (Guest)
on 2012-01-08 00:05
(Received via mailing list)
> Thank you! I commit this patch to my repository of github.
> https://gist.github.com/1569277#file_mingw_aligned...
> 
https://github.com/authorNari/ruby/commit/5b0b56c7...

Thanks for your work :)
I saw commit r34225. This has __mingw_aligned_malloc but doesn't have
__mingw_aligned_free.
This causes a build error with mingw32 gcc.

compiling ../../../ruby/gc.c
../../../ruby/gc.c: In function 'aligned_free':
../../../ruby/gc.c:1085:5: error: implicit declaration of function
'_aligned_free'
make: *** [gc.o] Error 1
rake aborted!
Command failed with status (2): [make...]
Posted by Eric Wong (Guest)
on 2012-01-08 00:24
(Received via mailing list)
Narihiro Nakamura <authornari@gmail.com> wrote:
> * A heap block address is aligned by 16KB to find fast a bitmap.

Just wondering, why/how did you determine 16K alignment is optimal?
Normal page size in Linux is only 4K, so 16K seems large.
Posted by KOSAKI Motohiro (Guest)
on 2012-01-08 00:58
(Received via mailing list)
> Narihiro Nakamura <authornari@gmail.com> wrote:
>> * A heap block address is aligned by 16KB to find fast a bitmap.
>
> Just wondering, why/how did you determine 16K alignment is optimal?
> Normal page size in Linux is only 4K, so 16K seems large.

Moreover, posix_memalign() and memalign() sound bad choice. It need
a few header bytes. then, some malloc implementations might allocate
32K instead of 16K.

So, why can't we use mmap() directly or use "16K - a few bytes" length?
Posted by Narihiro Nakamura (Guest)
on 2012-01-08 05:14
(Received via mailing list)
2012/1/8 KOSAKI Motohiro <kosaki.motohiro@gmail.com>:
>> Narihiro Nakamura <authornari@gmail.com> wrote:
>>> * A heap block address is aligned by 16KB to find fast a bitmap.
>>
>> Just wondering, why/how did you determine 16K alignment is optimal?
>> Normal page size in Linux is only 4K, so 16K seems large.
>

We defined the heap block size at following commit.
https://github.com/ruby/ruby/commit/4d93af26df1c32...
I've run benchmarks on different heap block size. Then, I chose 16KB
because it was fastest.
But, I probably should have chosen 4KB as you say.

> Moreover, posix_memalign() and memalign() sound bad choice. It need
> a few header bytes. then, some malloc implementations might allocate
> 32K instead of 16K.
>
> So, why can't we use mmap() directly or use "16K - a few bytes" length?
>

No special reason.
We should choose to use mmap() directly if it's efficient way.
Posted by Eric Wong (Guest)
on 2012-01-08 06:36
(Received via mailing list)
Narihiro Nakamura <authornari@gmail.com> wrote:
> because it was fastest.
> But, I probably should have chosen 4KB as you say.

A larger (16KB) _size_ I can easily understand to be more efficient.
I just don't see a reason for _alignment_ to be 16KB, too.  Larger
alignment increases the chance of fragmentation.

I've only used posix_memalign() to avoid sharing of L1 cache lines
(with alignment being only 32-128 bytes on modern CPUs).  I'm not
even sure if any larger alignment makes sense for the Ruby heap.

> > Moreover, posix_memalign() and memalign() sound bad choice. It need
> > a few header bytes. then, some malloc implementations might allocate
> > 32K instead of 16K.
> >
> > So, why can't we use mmap() directly or use "16K - a few bytes" length?
>
> No special reason.
> We should choose to use mmap() directly if it's efficient way.

Using mmap() directly will avoid (userspace) fragmentation entirely.

On newish glibc, I expect posix_memalign()+free() to be fastest,
especially for fast startup and short-lived processes.  So I think
"16K - 2*sizeof(size_t)" may be best (but I'm too lazy to test :)
Posted by Eric Wong (Guest)
on 2012-01-27 20:38
(Received via mailing list)
Eric Wong <normalperson@yhbt.net> wrote:
> A larger (16KB) _size_ I can easily understand to be more efficient.
> I just don't see a reason for _alignment_ to be 16KB, too.  Larger
> alignment increases the chance of fragmentation.

Btw, I took at Bug #5901 and now realize why this GC requires
alignment to match size.  It makes mapping a VALUE back to its Ruby
heap much easier (compared to the REE GC, which needed to scan heaps).
Posted by Kurt Stephens (Guest)
on 2012-01-27 23:03
(Received via mailing list)
On 1/27/12 1:38 PM, Eric Wong wrote:
> Eric Wong <normalperson@yhbt.net> wrote:
>> A larger (16KB) _size_ I can easily understand to be more efficient.
>> I just don't see a reason for _alignment_ to be 16KB, too.  Larger
>> alignment increases the chance of fragmentation.
>
> Btw, I took at Bug #5901 and now realize why this GC requires
> alignment to match size.  It makes mapping a VALUE back to its Ruby
> heap much easier (compared to the REE GC, which needed to scan heaps).
>
All the more reason to use mmap() instead of malloc_aligned().
-- KAS
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.