Forum: Ruby-core [Open] String#gsub shouldn't allocate so many Strings in its loop

A250061e969e82becbed01f2e7a605d0?d=identicon&s=25 unknown (Guest)
on 2014-03-25 23:56
(Received via mailing list)
Issue #9676 has been reported by Sam Rawlins.

----------------------------------------
Bug #9676: String#gsub shouldn't allocate so many Strings in its loop
https://bugs.ruby-lang.org/issues/9676

* Author: Sam Rawlins
* Status: Open
* Priority: Normal
* Assignee:
* Category:
* Target version:
* ruby -v: trunk
* Backport: 2.0.0: UNKNOWN, 2.1: UNKNOWN
----------------------------------------
`rb_reg_search()` allocates (dups) a String to attach to the
backreference object ( `RMATCH(match)->str = rb_str_new4(str);` ). If
#gsub has been passed 2 arguments (not Enumerator form) and the second
argument is a String, then it shouldn't make these allocations when
calling `rb_reg_search()` inside it's loop.

Here's an example:

    # gsub-allocates-too-much.rb
    require File.join(__dir__, "lib", "allocation_stats")

    def puts_object_list(name, stats)
      objects = stats.allocations.group_by(:sourcefile, :sourceline,
:class).all.
        values.flatten.map(&:object).
        map {|o| o.is_a?(String) ? "#{o.inspect}<#{o.encoding.to_s}>" :
o.inspect }
      puts "#{name} #{objects.flatten.size} new objects:"
      objects.group_by(&:hash).values.each { |ary| puts "#{ary.join(",
")}" }
    end

    slash = '/'; underscore = '_'; colon = ':' # allocate before the
trace
    str = "12:34:45:67"
    stats = AllocationStats.trace { str.gsub(colon, underscore) }
    puts '> "12:34:45:67".gsub(":", "_")'
    puts_object_list("gsub substitutes 3x times:", stats)


    $ ruby ../allocation_stats/gsub-allocates-too-much.rb
    > "12:34:45:67".gsub(":", "_")
    gsub substitutes 3x times: 12 new objects:
    "12:34:45:67"<UTF-8>, "12:34:45:67"<UTF-8>, "12:34:45:67"<UTF-8>,
"12:34:45:67"<UTF-8>
    "12_34_45_67"<UTF-8>
    ":"<ASCII-8BIT>, ":"<ASCII-8BIT>
    ":"<US-ASCII>, ":"<US-ASCII>
    #<MatchData ":">
    #<MatchData nil>
    /:/

The Strings that are copies of the original String are all unnecessary
(except one, the last).

I have a fix (attached and at [1]) that involves allocating the str
attribute of the backreference object only when necessary. In order to
do this without changing the signature of `rb_reg_search()`, this patch
changes `rb_reg_search()` to wrap a new function `rb_reg_search0()`. So
no calls to `rb_reg_search()` need to change, and `str_gsub()` changes
two calls into `rb_reg_search0()` to avoid the allocations. (I believe
String#split suffers from the same extra allocations, and can make a
similar call to `rb_reg_search0()`.)

The impact of this fix is primarily faster garbage collection. I have
two "real world" examples:

* ActiveRecord sqlite3 specs: total time in GC reduced from 11.2s to
10.4s (7% savings).
* Mail gem specs: total time in GC reduced from 0.220s to 0.215s (2%
savings).

These numbers bounced around a lot though. I'm open to better
benchmarking suggestions. I used ActiveRecord and Mail for real world
examples of #gsub, where realistic Strings are gsubbed.

[1] https://github.com/ruby/ruby/pull/578

---Files--------------------------------
ruby-578.diff (2.56 KB)
This topic is locked and can not be replied to.