Why aren't string literals treated similarly to symbols?

dubstep · July 26, 2011, 7:17am

Explaining the difference between symbols and strings got me wondering
why string literals aren’t treated similarly to symbols by the
interpreter. To be clear it seems that string literals could be
allocated 1 time each, where duplicates simply reference the first
created instance.

The trick would be to make each reference to the literal a dup of the
singular, hidden instance. With the COW semantics that dup’d strings
have (or should have), this should be a more memory efficient way to
deal with programs that have many instances of a string literal.

One such possibly common place that could benefit from this optimization
would be within frequently called methods or iterated loops that
directly use string literals for read-only purposes. If a developer
wants to avoid the allocation problem of string literals as things
currently stand, he/she would have to be aware of this problem first and
then find ways to push his/her literals into constants or other exterior
references.

For example:

1.upto(2) do
string = “example”
puts string
end

Unless I misunderstand the current implementation in MRI, a loop such as
the one above would needlessly create a new instance of “good” for every
pass through the loop. One possible fix would be something like:

string = “example”
1.upto(2) do
puts string
end

In this trivial case, the solution is good enough, but the general case
requires more knowledge about how to optimize for the interpreter than
should be necessary. I’m probably missing something obvious, so if
anyone knows what it is, please point it out. Maybe MRI already does
this obvious thing.

-Jeremy

jrun · July 26, 2011, 7:24am

Hi,

It’s because Ruby strings are mutable, just like array expression
makes a new instance of Array everytime. On the other hand, symbols
are immutable.

          matz.

In message “Re: Why aren’t string literals treated similarly to
symbols?”
on Tue, 26 Jul 2011 14:16:20 +0900, Jeremy B. [email protected]
writes:

|Explaining the difference between symbols and strings got me wondering
|why string literals aren’t treated similarly to symbols by the
|interpreter. To be clear it seems that string literals could be
|allocated 1 time each, where duplicates simply reference the first
|created instance.
|
|The trick would be to make each reference to the literal a dup of the
|singular, hidden instance. With the COW semantics that dup’d strings
|have (or should have), this should be a more memory efficient way to
|deal with programs that have many instances of a string literal.
|
|One such possibly common place that could benefit from this optimization
|would be within frequently called methods or iterated loops that
|directly use string literals for read-only purposes. If a developer
|wants to avoid the allocation problem of string literals as things
|currently stand, he/she would have to be aware of this problem first and
|then find ways to push his/her literals into constants or other exterior
|references.
|
|For example:
|
|1.upto(2) do
| string = “example”
| puts string
|end
|
|Unless I misunderstand the current implementation in MRI, a loop such as
|the one above would needlessly create a new instance of “good” for every
|pass through the loop. One possible fix would be something like:
|
|string = “example”
|1.upto(2) do
| puts string
|end
|
|In this trivial case, the solution is good enough, but the general case
|requires more knowledge about how to optimize for the interpreter than
|should be necessary. I’m probably missing something obvious, so if
|anyone knows what it is, please point it out. Maybe MRI already does
|this obvious thing.
|
|-Jeremy

jrun · July 26, 2011, 8:48am

I had no idea Yukihiro M. is fluent in English!!!

jrun · July 26, 2011, 4:05pm

On 7/26/2011 00:24, Yukihiro M. wrote:

Hi,

It’s because Ruby strings are mutable, just like array expression
makes a new instance of Array everytime. On the other hand, symbols
are immutable.

Thank you for your response, matz.

Yes, strings are mutable, so that’s why I suggested that the string
instance given to the script be a dup of the string instance
representing the actual literal. However, I realized after sleeping on
this that a dup of a string instance still requires at least a minimal
amount of object data to be allocated, so my naive suggestion wouldn’t
help very much except in the case of looping over code containing a
large string literal that dwarfs the size of the basic object structure.
That case by itself isn’t likely very common, but it looks like it’s
already handled according to Robert.

I’ll try to take a look at the code Robert linked when I get some time
to study it.

-Jeremy

jrun · July 26, 2011, 4:24pm

On Tue, 26 Jul 2011 16:28:54 +0900, Robert K. wrote:

than

rb_str_resize
rb_str_modify
str_independent
str_make_independent

Kind regards

robert

Just wanted to mention that RXR is a bit more convenient to lookup
things like this.
http://rxr.whitequark.org/mri/source/string.c#1761

And it has identifier search!

jrun · July 26, 2011, 9:28am

Jeremy B. wrote in post #1013041:

Explaining the difference between symbols and strings got me wondering
why string literals aren’t treated similarly to symbols by the
interpreter. To be clear it seems that string literals could be
allocated 1 time each, where duplicates simply reference the first
created instance.

Actually this is what happens with COW underneath. Still you get
multiple instances which is correct and needed (see below).

The trick would be to make each reference to the literal a dup of the
singular, hidden instance. With the COW semantics that dup’d strings
have (or should have), this should be a more memory efficient way to
deal with programs that have many instances of a string literal.

This is what happens. Note, it’s not a dup of the reference but a dup
of the instance (i.e. like CONSTANT_STRING.dup was called).

One such possibly common place that could benefit from this optimization
would be within frequently called methods or iterated loops that
directly use string literals for read-only purposes. If a developer
wants to avoid the allocation problem of string literals as things
currently stand, he/she would have to be aware of this problem first and
then find ways to push his/her literals into constants or other exterior
references.

For example:

1.upto(2) do
string = “example”
puts string
end

Unless I misunderstand the current implementation in MRI, a loop such as
the one above would needlessly create a new instance of “good” for every
pass through the loop.

You probably meant “example” instead of “good”. And you are wrong about
“needless”. To elaborate what Matz said: your suggestion will break
with this code

irb(main):004:0> a=[]
=> []
irb(main):005:0> 3.times {|i| a << (“foo” << i.to_s)}
=> 3
irb(main):006:0> a
=> [“foo0”, “foo1”, “foo2”]

It will behave like this

irb(main):007:0> a=[]
=> []
irb(main):008:0> s=“foo”
=> “foo”
irb(main):009:0> 3.times {|i| a << (s << i.to_s)}
=> 3
irb(main):010:0> a
=> [“foo012”, “foo012”, “foo012”]

One possible fix would be something like:

string = “example”
1.upto(2) do
puts string
end

In this trivial case, the solution is good enough, but the general case
requires more knowledge about how to optimize for the interpreter than
should be necessary. I’m probably missing something obvious, so if
anyone knows what it is, please point it out. Maybe MRI already does
this obvious thing.

It’s doing COW already internally.
http://svn.ruby-lang.org/cgi-bin/viewvc.cgi/branches/ruby_1_9_2/string.c?revision=31809&view=markup

See e.g. lines 1984ff. Then functions

rb_str_resize
rb_str_modify
str_independent
str_make_independent

Kind regards

robert

jrun · July 26, 2011, 4:32pm

On 7/26/2011 02:28, Robert K. wrote:

Jeremy B. wrote in post #1013041:

Explaining the difference between symbols and strings got me wondering
why string literals aren’t treated similarly to symbols by the
interpreter. To be clear it seems that string literals could be
allocated 1 time each, where duplicates simply reference the first
created instance.

Actually this is what happens with COW underneath. Still you get
multiple instances which is correct and needed (see below).

Thanks for your response, Robert.

Yes, this makes perfect sense because a string literal is mutable like
any other string. Each string literal could be modified at any time, so
the safe and possibly only solution is to treat each on as if it will
be modified and give it its own instance.

The trick would be to make each reference to the literal a dup of the
singular, hidden instance. With the COW semantics that dup’d strings
have (or should have), this should be a more memory efficient way to
deal with programs that have many instances of a string literal.

This is what happens. Note, it’s not a dup of the reference but a dup
of the instance (i.e. like CONSTANT_STRING.dup was called).

Right. Thanks for clarifying. It’s good to know that the string data
itself isn’t being duplicated, even if it may typically be rare for
anything aside from small strings to be used repeatedly at literals.

1.upto(2) do
string = “example”
puts string
end

Unless I misunderstand the current implementation in MRI, a loop such as
the one above would needlessly create a new instance of “good” for every
pass through the loop.

You probably meant “example” instead of “good”.

Yes. There was a little of my earlier explanation message leaking
through. Late night emails…

It will behave like this

irb(main):007:0> a=[]
=> []
irb(main):008:0> s=“foo”
=> “foo”
irb(main):009:0> 3.times {|i| a << (s << i.to_s)}
=> 3
irb(main):010:0> a
=> [“foo012”, “foo012”, “foo012”]

My actual suggestion, whether I conveyed it clearly or not, was to work
with dups. However, even dups require allocation, so we wouldn’t
save much at all in the end. I should have slept on my suggestion a bit
before sending it out.

In my example above, though, the dup is needless because the string is
not modified, and that was the point. That code could safely be trusted
to use the original string literal instance, but handling this in the
general case where the literal might be modified would probably be
non-trivial and maybe even impossible without some cheating by the
interpreter.

anyone knows what it is, please point it out. Maybe MRI already does
str_make_independent
Thanks for the pointer here. I’ll try to set aside some time to study
this code.

I’m trying to think of a way to allow the interpreter to cheat with
regard to string literals such that until the program attempts to modify
the literal or otherwise pierce the veil (calling object_id or similar)
the program actually uses the literal’s object instance directly. This
would allow the interpreter to avoid allocating extra string objects for
literals until they actually need to be modified or attributes such as
the object identity need to be known. I’m not sure this is possible or
would even yield any performance benefits. It’s more of a curiosity
now.

Thanks again for your help.

-Jeremy

jrun · July 26, 2011, 4:45pm

Jeremy B. wrote in post #1013121:

I’m trying to think of a way to allow the interpreter to cheat with
regard to string literals such that until the program attempts to modify
the literal or otherwise pierce the veil (calling object_id or similar)
the program actually uses the literal’s object instance directly. This
would allow the interpreter to avoid allocating extra string objects for
literals until they actually need to be modified or attributes such as
the object identity need to be known. I’m not sure this is possible or
would even yield any performance benefits. It’s more of a curiosity
now.

It’s probably not worth the effort because for a short script you won’t
notice and for other applications the user still can use a constant and
freeze the String. Plus, it will be tough for the interpreter to
reliably detect when a string literal is not changed. IMHO the
current solution is good enough.

Thanks again for your help.

You’re welcome!

Kind regards

robert

jrun · July 26, 2011, 6:18pm

On Tue, 26 Jul 2011, Kaye Ng wrote:

I had no idea Yukihiro M. is fluent in English!!!

His Texan is passable as well…

– Matt
It’s not what I know that counts.
It’s what I can remember in time to use.

jrun · July 26, 2011, 9:23pm

On 26 Ιουλ 2011, at 9:48 π.μ., Kaye Ng wrote:

I had no idea Yukihiro M. is fluent in English!!!

–
Posted via http://www.ruby-forum.com/.

You can’t be serious…

–
Panagiotis A.

email: [email protected]
blog: http://www.convalesco.org

The wise man said: “Never argue with an idiot. They bring you down to
their level and beat you with experience.”