Forum: Ruby Symbols garbage collector in Ruby1.9, fixed?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
0f1f17ba297242e9d3c86d4cc0a6ea85?d=identicon&s=25 Iñaki Baz Castillo (Guest)
on 2009-03-30 10:10
(Received via mailing list)
Hi, in Ruby 1.8 there is an issue when adding more and more Symbols
since they remain in memory and are never removed.

I'm doing a server in Ruby that receives messages with headers (From,
To, Subject, X-Custom-Header-1...) and after parsing I store the
headers in a hash using symbols as keys:

  headers = {
    :from => "alice@aaa.com",
    :to => "bob@bbb.com",
    :"x-custom-header-1" => "Hi there"
  }

I could use strings as keys instead of symbols, but I've checked that
getting a Hash entry is ~25% faster using Symbols.

The problem is that I could receive custom headers so for each one a
new Symbol would be created. An attacker could send lots of custom
headers to fill the server memory and cause a denial of service.

Perhaps this is solved in Ruby 1.9? any suggestion on it? Thanks a lot.
0f1f17ba297242e9d3c86d4cc0a6ea85?d=identicon&s=25 Iñaki Baz Castillo (Guest)
on 2009-03-30 10:17
(Received via mailing list)
2009/3/30 Iñaki Baz Castillo <ibc@aliax.net>:
> Perhaps this is solved in Ruby 1.9? any suggestion on it? Thanks a lot.

Is there any way to check if a Symbol already exist before creating it?
B57c5af36f5c1f33243dd8b2dd9043b1?d=identicon&s=25 F. Senault (Guest)
on 2009-03-30 11:05
(Received via mailing list)
Le 30 mars 2009 à 10:09, Iñaki Baz Castillo a écrit :

> The problem is that I could receive custom headers so for each one a
> new Symbol would be created. An attacker could send lots of custom
> headers to fill the server memory and cause a denial of service.
>
> Perhaps this is solved in Ruby 1.9? any suggestion on it? Thanks a lot.

It depends on what exactly you are trying to do with your hash.  If you
need to access to a few well known headers in your code, use symbols for
those and add another pseudo-header for the rest of the info :

USEFUL_HEADERS = [ :from, :to, :"x-mailer" ]

headers = {
  :from => "alice@aaa.com",
  :to => "bob@bbb.com",
  :"x-mailer" => "Pegasus Mail for Windows (4.50 PB1)",
  :"_custom" => {
    "x-custom-header-1" => "Hi there",
    "x-spam-scanned" => "Of course"
  }
}

(Now, you'll lose time at the parse step.  Again, depending on what
you're trying to do, it may be efficient if each mail is parsed one time
and, then, each header is accessed a lot of times.)

Fred
0f1f17ba297242e9d3c86d4cc0a6ea85?d=identicon&s=25 Iñaki Baz Castillo (Guest)
on 2009-03-30 11:18
(Received via mailing list)
2009/3/30 F. Senault <fred@lacave.net>:
>  :"x-mailer" => "Pegasus Mail for Windows (4.50 PB1)",
>  :"_custom" => {
>    "x-custom-header-1" => "Hi there",
>    "x-spam-scanned" => "Of course"
>  }
> }
>
> (Now, you'll lose time at the parse step.  Again, depending on what
> you're trying to do, it may be efficient if each mail is parsed one time
> and, then, each header is accessed a lot of times.)

Thanks, but I prefer to store all the headers in a transparent way so
accessing to a core and well known header is the same as accesing to a
custom and never seen header:
  headers[:from]
  header[:"x-custom-headers"]

This is, in the transport/parsing layer I cannot know which headers
will be important or not in the "application" layer.

A way to check if a Symbol already exist would be enought for me, but
it doesn't work:
To know all the current Symbols I can inspect Symbol.all_symbols, but
if I want to check a Symbol:
  Symbol.all_symbols.include?(:new_symbol)
this will always return true since :new_symbol is automatically added
XDDD

Thanks.
4feed660d3728526797edeb4f0467384?d=identicon&s=25 Bill Kelly (Guest)
on 2009-03-30 11:32
(Received via mailing list)
From: "Iñaki Baz Castillo" <ibc@aliax.net>
>
> A way to check if a Symbol already exist would be enought for me, but
> it doesn't work:
> To know all the current Symbols I can inspect Symbol.all_symbols, but
> if I want to check a Symbol:
>   Symbol.all_symbols.include?(:new_symbol)
> this will always return true since :new_symbol is automatically added  XDDD

potential_new_symbol = "xyzzy"
Symbol.all_symbols.map {|s| s.to_s}.include? potential_new_symbol


?


Regards,

Bil
B57c5af36f5c1f33243dd8b2dd9043b1?d=identicon&s=25 F. Senault (Guest)
on 2009-03-30 11:38
(Received via mailing list)
Le 30 mars 2009 à 11:17, Iñaki Baz Castillo a écrit :

> A way to check if a Symbol already exist would be enought for me, but
> it doesn't work:
> To know all the current Symbols I can inspect Symbol.all_symbols, but
> if I want to check a Symbol:
>   Symbol.all_symbols.include?(:new_symbol)

Symbol.all_symbols.find { |s| s.to_s == "string" }

But, now, you're creating strings instead...  :)

Fred
0f1f17ba297242e9d3c86d4cc0a6ea85?d=identicon&s=25 Iñaki Baz Castillo (Guest)
on 2009-03-30 11:39
(Received via mailing list)
2009/3/30 Bill Kelly <billk@cts.com>:
>
> potential_new_symbol = "xyzzy"
> Symbol.all_symbols.map {|s| s.to_s}.include? potential_new_symbol

Thanks but it is too slow:

Benchmark.realtime{ Symbol.all_symbols.map {|s| s.to_s}.include? "qwe" }
=> 0.00371980667114258

I cannot do this test for each header in each received message.

Thanks.
4feed660d3728526797edeb4f0467384?d=identicon&s=25 Bill Kelly (Guest)
on 2009-03-30 11:55
(Received via mailing list)
From: "Iñaki Baz Castillo" <ibc@aliax.net>
> >> XDDD
> >
> > potential_new_symbol = "xyzzy"
> > Symbol.all_symbols.map {|s| s.to_s}.include? potential_new_symbol
>
> Thanks but it is too slow:
>
> Benchmark.realtime{ Symbol.all_symbols.map {|s| s.to_s}.include? "qwe" }
> => 0.00371980667114258
>
> I cannot do this test for each header in each received message.

I assumed you had a plan for that.  :)

We could cache them as a hash, for rapid lookup:

  @known_symbols = Hash[ *Symbol.all_symbols.map {|s|
[s.to_s,true]}.flatten ]

# Later....

  @known_symbols.include? "xyzzy"


Regards,

Bill
0f1f17ba297242e9d3c86d4cc0a6ea85?d=identicon&s=25 Iñaki Baz Castillo (Guest)
on 2009-03-30 12:02
(Received via mailing list)
2009/3/30 Bill Kelly <billk@cts.com>:
>
>  @known_symbols = Hash[ *Symbol.all_symbols.map {|s| [s.to_s,true]}.flatten
> ]
>
> # Later....
>
>  @known_symbols.include? "xyzzy"

That sounds interesting, I'll try it.

Thanks :)
8f6f95c4bd64d5f10dfddfdcd03c19d6?d=identicon&s=25 Rick Denatale (rdenatale)
on 2009-03-30 13:57
(Received via mailing list)
On Mon, Mar 30, 2009 at 4:09 AM, Iñaki Baz Castillo <ibc@aliax.net>
wrote:

>    :"x-custom-header-1" => "Hi there"
>  }
>
> I could use strings as keys instead of symbols, but I've checked that
> getting a Hash entry is ~25% faster using Symbols.
>
> The problem is that I could receive custom headers so for each one a
> new Symbol would be created. An attacker could send lots of custom
> headers to fill the server memory and cause a denial of service.
>

Which is why Rails (actually activesupport) which implements a
HashWithIndifferentAccess to allows using strings and symbols
equivalently
for hash access, uses the string form in the actual hash forgoing the
access
performance in favor of safety.


--
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale
WWR: http://www.workingwithrails.com/person/9021-rick-denatale
LinkedIn: http://www.linkedin.com/in/rickdenatale
753dcb78b3a3651127665da4bed3c782?d=identicon&s=25 Brian Candler (candlerb)
on 2009-03-30 14:16
Iñaki Baz Castillo wrote:
> I could use strings as keys instead of symbols, but I've checked that
> getting a Hash entry is ~25% faster using Symbols.
>
> The problem is that I could receive custom headers so for each one a
> new Symbol would be created. An attacker could send lots of custom
> headers to fill the server memory and cause a denial of service.
>
> Perhaps this is solved in Ruby 1.9? any suggestion on it? Thanks a lot.

It's not "solved" in 1.9, because this is intentional and necessary
behaviour.

The important property of a symbol is that it has the same id wherever
and whenever it is used in your program, and hence it can never be
garbage-collected. This is so that it can be used for looking up method
names - foo.bar is a shortcut for foo.send(:bar)

Using symbols for hash keys is a common idiom, but arguably is abuse of
the symbol table. It's fine as long as all the keys are fixed symbol
constants in your program, but as you've observed, it causes huge
problems if your symbols are generated dynamically in response to user
data (especially from untrusted or potentially malicious sources)

The solution: use strings as keys, and beware premature optimisation.
Whilst you may have measured that "getting a Hash entry is 25% faster
using Symbols", does this really make your whole application 25% faster?
I suspect not. Maybe it makes your whole application 0.25% faster. Maybe
it makes your application slower, as each incoming String has to be
converted into a Symbol.

In any case, although we all want things to go "as fast as possible",
few applications have a specific acceptance criteria for CPU utilisation
or response time. If your application *does* have a specific performance
criterion that you must meet, then it might be better to consider a
different language, rather than mis-using what Ruby offers. Or including
all things like development costs, it may be more cost-effective to
choose faster hardware to meet the performance goal.

Regards,

Brian.
0f1f17ba297242e9d3c86d4cc0a6ea85?d=identicon&s=25 Iñaki Baz Castillo (Guest)
on 2009-03-30 14:41
(Received via mailing list)
2009/3/30 Brian Candler <b.candler@pobox.com>:
> It's not "solved" in 1.9, because this is intentional and necessary
> problems if your symbols are generated dynamically in response to user
> few applications have a specific acceptance criteria for CPU utilisation
> or response time. If your application *does* have a specific performance
> criterion that you must meet, then it might be better to consider a
> different language, rather than mis-using what Ruby offers. Or including
> all things like development costs, it may be more cost-effective to
> choose faster hardware to meet the performance goal.

Ok, thanks for your explanation.
4299e35bacef054df40583da2d51edea?d=identicon&s=25 James Gray (bbazzarrakk)
on 2009-03-30 14:48
(Received via mailing list)
On Mar 30, 2009, at 7:16 AM, Brian Candler wrote:

>
> It's not "solved" in 1.9, because this is intentional and necessary
> behaviour.

Dave Thomas seems to have thought this was going away:

http://pragdave.blogs.pragprog.com/pragdave/2008/0...

James Edward Gray II
1bc63d01bd3fcccc36fb030a62039352?d=identicon&s=25 David Masover (Guest)
on 2009-03-30 16:01
(Received via mailing list)
On Monday 30 March 2009 07:48:16 James Gray wrote:
> Dave Thomas seems to have thought this was going away:
>
> http://pragdave.blogs.pragprog.com/pragdave/2008/0...

That article looks like pure speculation.

Alright, yes, #to_i and #id2name and similar are gone. That makes sense
--
encapsulate things the average user really doesn't need. Theoretically,
these
could allow Symbols to be implemented in the heap, if needed. Or it
would
allow them to be implemented in some way that looks nothing like the
current
concept of an integer.

However, the purpose of symbols, I would think, remains the same.

And given the purpose of symbols, and the dynamic nature of Ruby (it has
eval!), there's really no way you could ever garbage collect symbols.

You could implement symbols as immutable strings on the heap, and do
string
comparisons between them, but that would defeat the purpose of symbols,
at
least in every program I've ever wrote -- to avoid string comparisons,
and to
be generally much faster than strings.

And for that matter, if you really, really want to be digging around at
that
low level, you still can:

irb(main):001:0> :foo.object_id
=> 351848
irb(main):002:0> ObjectSpace._id2ref 351848
=> :foo
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2009-03-30 17:05
(Received via mailing list)
Hi,

In message "Re: Symbols garbage collector in Ruby1.9, fixed?"
    on Mon, 30 Mar 2009 21:16:00 +0900, Brian Candler
<b.candler@pobox.com> writes:

|It's not "solved" in 1.9, because this is intentional and necessary
|behaviour.

Garbage collection for Symbols is planned, but not implemented yet.
It's not an easy task.

              matz.
3ccecc71b9fb0a3d7f00a0bef6f0a63a?d=identicon&s=25 Kent Sibilev (Guest)
on 2009-03-30 17:07
(Received via mailing list)
On Mon, Mar 30, 2009 at 8:48 AM, James Gray <james@grayproductions.net>
wrote:
>>>
>
Unfortunately, symbols are still not garbage-collected at the moment.
It seems that the only difference comparing to 1.8 is that symbols are
backed by real frozen string objects instead of arrays of chars.
3131fcea0a711e5ad89c8d49cc9253b4?d=identicon&s=25 Julian Leviston (Guest)
on 2009-04-01 15:39
(Received via mailing list)
Is the garbage collection in 1.9 better than 1.8?

Blog: http://random8.zenunit.com/
Learn rails: http://sensei.zenunit.com/

On 31/03/2009, at 2:05 AM, Yukihiro Matsumoto <matz@ruby-lang.org>
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2009-04-02 01:12
(Received via mailing list)
Hi,

In message "Re: Symbols garbage collector in Ruby1.9, fixed?"
    on Wed, 1 Apr 2009 22:37:46 +0900, Julian Leviston
<julian@coretech.net.au> writes:

|Is the garbage collection in 1.9 better than 1.8?

Yes, but slightly.  For example, it returns unused memory regions to
the OS more often than 1.8.

              matz.
2f55791ab9018b4d01fb741fab21843d?d=identicon&s=25 Tony Arcieri (Guest)
on 2009-04-02 01:59
(Received via mailing list)
On Mon, Mar 30, 2009 at 2:09 AM, Iñaki Baz Castillo <ibc@aliax.net>
wrote:

>    :"x-custom-header-1" => "Hi there"
>  }
>
> I could use strings as keys instead of symbols, but I've checked that
> getting a Hash entry is ~25% faster using Symbols.
>

Use symbols... FOR SPEED!  Unfortunately that speed comes at a price...
you
really want to globally internalize arbitrary input?  Symbols are
effectively a freeform enumeration... the reason you're running into
problems is because you're trying to enumerate arbitrary inputs.

Is this really an important bottleneck in your application?  If not, use
strings and move on.
0f1f17ba297242e9d3c86d4cc0a6ea85?d=identicon&s=25 Iñaki Baz Castillo (Guest)
on 2009-04-02 02:03
(Received via mailing list)
El Jueves 02 Abril 2009, Tony Arcieri escribió:

> Use symbols... FOR SPEED!  Unfortunately that speed comes at a price... you
> really want to globally internalize arbitrary input?  Symbols are
> effectively a freeform enumeration... the reason you're running into
> problems is because you're trying to enumerate arbitrary inputs.

Yes. It's a parser so custom headers could arrive. I want to store them
in a
hash like:

  headers = { :from => "alice@qweeq", ":to => "bob@qweqwe }

So after parsing the message I create these entries. The problem is that
any
custom header would create a Symbol.


> Is this really an important bottleneck in your application?

I think it's important since after parsing hte main task of the server
will be
accessing some headers to read their content. But since it's just in a
very
early stage I cannot sure it.

Thanks.
Ef3aa7f7e577ea8cd620462724ddf73b?d=identicon&s=25 Rob Biedenharn (Guest)
on 2009-04-02 06:17
(Received via mailing list)
On Apr 1, 2009, at 8:03 PM, Iñaki Baz Castillo wrote:
>
> server will be
> accessing some headers to read their content. But since it's just in
> a very
> early stage I cannot sure it.
>
> Thanks.
> --
> Iñaki Baz Castillo <ibc@aliax.net>


Just key the hash with Strings:
  headers = { 'from' => "alice@qweeq", 'to' => "bob@qweeq" }

If you really need to use symbols, perhaps add methods to a subclass
of Hash like the HashWithIndifferentAccess from Rails which mostly
eliminates the need to care whether you actually stored against a
Symbol or a String key.

There's also nothing stopping you from having both kinds of keys at
once:
  headers = { :from => "alice@qweeq", :to => "bob@qweeq", 'snack' =>
"raisins" }

but then you might have to "worry" about having both :to and 'to' as
keys.

Symbols are only faster because they are immutable and don't get
garbage collected. But I'd go with Tony and just use String all the
time.

-Rob

Rob Biedenharn    http://agileconsultingllc.com
Rob@AgileConsultingLLC.com
8f6f95c4bd64d5f10dfddfdcd03c19d6?d=identicon&s=25 Rick Denatale (rdenatale)
on 2009-04-02 09:49
(Received via mailing list)
On Thu, Apr 2, 2009 at 12:17 AM, Rob Biedenharn
<Rob@agileconsultingllc.com>wrote:

> but then you might have to "worry" about having both :to and 'to' as keys.
>
> Symbols are only faster because they are immutable and don't get garbage
> collected. But I'd go with Tony and just use String all the time.
>

Actually, I'm pretty sure that Symbols are faster as hash keys because
Hash#== is O(1) while String#== is O(n) where n is the length of the
string.
That said, the HashWithIndifferentAccess class in activesupport  allows
either strings or symbols to be used interchangeably as the key argument
in
methods like [] and []=, but it always USES the string form as the key.

I was quite surprised when I discovered this, since I'd assumed that the
reason for using symbols was for the speed advantage, but it was finally
pointed out to me the problem of "memory leaks" when arbitrary keys get
interned as symbols.

But I do in general prefer the look in source code of

   :id => 3

rather than

   'id' = > 3

And when the symbols come in the source like this there's less chance of
arbitrary growth of interned symbols.

--
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale
WWR: http://www.workingwithrails.com/person/9021-rick-denatale
LinkedIn: http://www.linkedin.com/in/rickdenatale
Baf83fa62a7481a08c40353795e11f44?d=identicon&s=25 Michael Neumann (Guest)
on 2009-04-02 10:19
(Received via mailing list)
Iñaki Baz Castillo wrote:

>
>   headers = { :from => "alice@qweeq", ":to => "bob@qweqwe }

I'd figure out what very common headers are and make them freezed
constants,
like:

  FROM = "From".freeze
  TO = "To".freeze

and put references to those string "constants" as keys into the Hash. I
assume that this will be as fast as symbols when accessing the hash with
those constants, as equality testing just needs to tests for object
identity
(object_id) and not for the equality of the content.

  headers = {}
  headers[FROM] = "alice@qweeq"
  headers[TO] = "bob@qweqwe"

  ...

  p headers[TO]
  p headers["To"] # works as well, but should be slower


Would you like to benchmark this against using symbols?

Btw, this is the approach that for example Mongrel uses.

Regards,

  Michael
0f1f17ba297242e9d3c86d4cc0a6ea85?d=identicon&s=25 Iñaki Baz Castillo (Guest)
on 2009-04-02 11:21
(Received via mailing list)
2009/4/2 Rick DeNatale <rick.denatale@gmail.com>:

> Actually, I'm pretty sure that Symbols are faster as hash keys because
> Hash#== is O(1) while String#== is O(n) where n is the length of the string.
> That said, the HashWithIndifferentAccess class in activesupport  allows
> either strings or symbols to be used interchangeably as the key argument in
> methods like [] and []=, but it always USES the string form as the key.

Oh, then it's better just to use strings, I don't need to support
string and symbols at the same time, I just need to make a decission.

Thanks for pointing it out.
0f1f17ba297242e9d3c86d4cc0a6ea85?d=identicon&s=25 Iñaki Baz Castillo (Guest)
on 2009-04-03 04:31
(Received via mailing list)
2009/4/2 Michael Neumann <mneumann@ntecs.de>:
>
>  headers = {}
>  headers[FROM] = "alice@qweeq"
>  headers[TO] = "bob@qweqwe"

Very interesting solution, but I would have some issues with it:

a) I receive a request with various headers, most of them are well
kwnown but others can be custom.
When I extract the header name (after parsing) I get "From" and
"Custom-Header" strings, and I need to check if these strings belongs
to well known headers or not before storing them as FROM and
"Custom-Header". Wouldn't this check be inneficient?

b) Some wellk wnown headers have a name like "Record-Route". The "-"
symbol is of course dissallowed as Ruby Constant. Using Symbols I can
use it as :"record-route".

Well, I have to think about it. Thanks a lot for all the received help.
Regards.



> Btw, this is the approach that for example Mongrel uses.

Then I must investigate how it handles case b).
Baf83fa62a7481a08c40353795e11f44?d=identicon&s=25 Michael Neumann (Guest)
on 2009-04-03 06:11
(Received via mailing list)
Iñaki Baz Castillo wrote:

>> identity (object_id) and not for the equality of the content.
> "Custom-Header" strings, and I need to check if these strings belongs
> to well known headers or not before storing them as FROM and
> "Custom-Header". Wouldn't this check be inneficient?

No!

time ruby -e "s,t='a'*100,'a'*100;1_000_000.times{s==t}"
0.567u 0.000s 0:00.58 96.5%     5+1563k 0+0io 0pf+0w

Comparing 1 million strings of size 100 is just half a second in the
worst
case (of which around the half is just method calling overhead!).

If you take more reasonable sized strings (15 characters):

time ruby -e "s,t='a'*15,'a'*15;1_000_000.times{s==t}"
0.326u 0.007s 0:00.33 96.9%     5+1588k 0+0io 0pf+0w

Compared against object id comparison (notice "s == s"):

time ruby -e "s,t='a'*15,'a'*15;1_000_000.times{s==s}"
0.265u 0.000s 0:00.26 100.0%    5+1595k 0+0io 0pf+0w

So, I wouldn't call Ruby strings inefficient. Not the lookup is in
general
the problem with performance, but the memory allocation. Even if string
comparison is wc. O(n), a key lookup of a hash is in general O(1)
regardless
of strings or symbols as keys (especially as the length of the keys is
usually limited).

I don't think that this lookup will be significant. If it is significant
then you're probably using the wrong language :).

> b) Some wellk wnown headers have a name like "Record-Route". The "-"
> symbol is of course dissallowed as Ruby Constant. Using Symbols I can
> use it as :"record-route".

I didn't meant constants, but "constant", i.e. frozen, values.

  FROM = 'From'.freeze
  RECORD_ROUTE = 'Record-Route'.freeze

  KNOWN_HEADERS = {
    FROM => FROM,
    RECORD_ROUTE => RECORD_ROUTE
  }

  headers = {}
  for key, value in h
    headers[ KNOWN_HEADERS[key] || key ] = value
  end

Regards,

  Michael
0f1f17ba297242e9d3c86d4cc0a6ea85?d=identicon&s=25 Iñaki Baz Castillo (Guest)
on 2009-04-03 06:39
(Received via mailing list)
2009/4/2 Michael Neumann <mneumann@ntecs.de>:
>    headers[ KNOWN_HEADERS[key] || key ] = value
>  end

This seems a wonderful solution :)

Thanks a lot.
This topic is locked and can not be replied to.