Forum: Ruby Re: Question of reference and (sub)strings.

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
1c1e3bdfe006a22214102fcd6434a012?d=identicon&s=25 Daniel Sheppard (Guest)
on 2005-12-15 06:52
(Received via mailing list)
> Given that I only want to compute the offsets once, an
> obvious solution
> would be to construct an Array of String - each element
> representing a
> sub-string of the original... but this would double memory use.  What
> would be the best way to avoid duplicating the character
> sequences and
> causing run-time bloat?

I might be wrong - but I'm pretty sure that substrings in ruby are
created with copy-on-write. That is, when you take a substring, a new
block of memory isn't allocated to the new String, it references the
same block of memory as the original string - the allocation of a new
block of memory only occurs when one of the strings is modified.


#####################################################################################
This email has been scanned by MailMarshal, an email content filter.
#####################################################################################
5befe95e6648daec3dd5728cd36602d0?d=identicon&s=25 Robert Klemme (Guest)
on 2005-12-15 11:23
(Received via mailing list)
Daniel Sheppard wrote:
>> Given that I only want to compute the offsets once, an
>> obvious solution
>> would be to construct an Array of String - each element
>> representing a
>> sub-string of the original... but this would double memory use.  What
>> would be the best way to avoid duplicating the character
>> sequences and
>> causing run-time bloat?
>
> I might be wrong

You're not.

> - but I'm pretty sure that substrings in ruby are
> created with copy-on-write. That is, when you take a substring, a new
> block of memory isn't allocated to the new String, it references the
> same block of memory as the original string - the allocation of a new
> block of memory only occurs when one of the strings is modified.

Exactly.  It seems this would be the simplest solution.

    robert
6f37c32b62670cb9a5ac0330cb27269a?d=identicon&s=25 Steve [RubyTalk] (Guest)
on 2005-12-15 16:48
(Received via mailing list)
Robert Klemme wrote:
> Exactly.  It seems this would be the simplest solution.
>
>
That sounds like absolutely great news - a _very_ pleasant surprise. I
couldn't have hoped for anything better. (Thanks!)

I assume that if I do something like:

    # Assume offsets is a pre-computed array of positive integer
positions into the String originalstr.
    # with offsets[0]==0 and offsets[-1]==@originalstr.size
    @fields=Array.new (offsets.size-1)
    for i in 1..(offsets.size) do
      # I assume this next line is what is meant by a Ruby sub-string?
      @fields[i-1]=@originalstr[offsets[i-1]..offsets[i]]
    end

... and, assuming that @fields is exposed only as a read-only attribute,
that I can assume the memory it consumes to be independent of the length
of originalstr and dependent only upon numfields?

While I've no reason to doubt this confirmed answer, by any chance can
someone suggest a good way to demonstrate that this is the case without
resorting to either using very large strings and looking at VM usage of
the interpreter process... or resorting to reviewing the source to
Ruby's implementation?
5befe95e6648daec3dd5728cd36602d0?d=identicon&s=25 Robert Klemme (Guest)
on 2005-12-15 17:18
(Received via mailing list)
Steve [RubyTalk] wrote:
>>> strings is modified.
> positions into the String originalstr.
Care to unveil a bit of the nature of the computation that yields those
indexes?

>     # with offsets[0]==0 and offsets[-1]==@originalstr.size
>     @fields=Array.new (offsets.size-1)
>     for i in 1..(offsets.size) do
>       # I assume this next line is what is meant by a Ruby sub-string?
>       @fields[i-1]=@originalstr[offsets[i-1]..offsets[i]]
>     end
>
> .. and, assuming that @fields is exposed only as a read-only
> attribute, that I can assume the memory it consumes to be independent
> of the length of originalstr and dependent only upon numfields?

You can help keeping this read only be freezing all strings involved.

> While I've no reason to doubt this confirmed answer, by any chance can
> someone suggest a good way to demonstrate that this is the case
> without resorting to either using very large strings and looking at
> VM usage of the interpreter process... or resorting to reviewing the
> source to Ruby's implementation?

The only additional method of verification that comes to mind is to ask
Matz. :-)

Kind regards

    robert
6f37c32b62670cb9a5ac0330cb27269a?d=identicon&s=25 Steve [RubyTalk] (Guest)
on 2005-12-15 18:52
(Received via mailing list)
Robert Klemme wrote:
>>     # Assume offsets is a pre-computed array of positive integer
>> positions into the String originalstr.
>>
> Care to unveil a bit of the nature of the computation that yields those
> indexes?
>
It's not really relevant to the question I was asking - but I've no
problem saying more.  I've a domain specific (order-preserving and
extensible) 'type-system' which is imposed over otherwise opaque data
structures.  Given an instance of a 'type-signature' and a pointer, it
is possible to determine the number of bytes which represent each 'typed
value' - and (significantly) list construction drops out as being the
concatenation of the value representations and type-signatures.  The
type signatures range in complexity from the simplest constant 'N-bytes
interpreted as a natural number' through sentinel encodings (Null
terminated strings on steroids) and (in principle - if not frequently in
practice) arbitrary computation ranging over named integer values
occurring 'earlier' in the list.
At the moment I'm toying with the idea that I can memory-map the values
(using a C-implemented module) and do the computations on the mapped
values in Ruby - having presented opaque values and 'type-signatures' as
String objects to Ruby.  I expect that typical computations may involve
matching regular expressions; doing arithmetic; computing various hashes
and summations etc.  At the moment I'm concentrating on establishing if
Ruby is a suitable tool for the task at hand.
>>
> You can help keeping this read only be freezing all strings involved.
>
Yes - that sounds a good idea to me.
>> While I've no reason to doubt this confirmed answer, by any chance can
>> someone suggest a good way to demonstrate that this is the case
>> without resorting to either using very large strings and looking at
>> VM usage of the interpreter process... or resorting to reviewing the
>> source to Ruby's implementation?
>>
> The only additional method of verification that comes to mind is to ask
> Matz. :-)
>
Hmmm - a lack of profiling tools might prove something of a stumbling
block... I'll need to have a careful think about that.  Rather than
wanting to check up on fellow Rubyists, I really want to periodically
check that I make no invalid assumptions as I work forwards from this
basis towards an implementation.  I don't want to find out only after I
think I've finished that a resource leak or extravagant resource demands
will require a re-write before the software can be used against real
data.

Steve
A4327ebc799afb6896c5b224d76f9b9e?d=identicon&s=25 Ron Jeffries (Guest)
on 2005-12-16 00:20
(Received via mailing list)
On Thu, 15 Dec 2005 14:50:59 +0900, "Daniel Sheppard"
<daniels@pronto.com.au>
wrote:

>I might be wrong - but I'm pretty sure that substrings in ruby are
>created with copy-on-write. That is, when you take a substring, a new
>block of memory isn't allocated to the new String, it references the
>same block of memory as the original string - the allocation of a new
>block of memory only occurs when one of the strings is modified.

I asked about this a week or two ago, as part of my Extended Set Theory
thing,
and the impression I got from the answers then was that sub /arrays/ are
copy on
write but that sub /strings/ were not.

I wonder how one might confirm this, other than by asking matz or
reading the
compiler and libraries ...
5befe95e6648daec3dd5728cd36602d0?d=identicon&s=25 Robert Klemme (Guest)
on 2005-12-16 09:58
(Received via mailing list)
Steve [RubyTalk] wrote:
> structures.  Given an instance of a 'type-signature' and a pointer, it
> values (using a C-implemented module) and do the computations on the
> mapped values in Ruby - having presented opaque values and
> 'type-signatures' as String objects to Ruby.  I expect that typical
> computations may involve matching regular expressions; doing
> arithmetic; computing various hashes and summations etc.  At the
> moment I'm concentrating on establishing if Ruby is a suitable tool
> for the task at hand.

Certainly.  I don't know now complex your type calculations are.  I'd
probably do something along the line of:

class SpecialData

  def initialize(s)
    @data = s.frozen? ? s : s.dup.freeze
    @meta_data = calculate_meta(@data)
  end

  def size() @meta_data.size end

  def get(field_index)
    @meta_data[field_index].extract(@data)
  end

private

  def calculate_meta(data)
    md = []
    # your complex calculation here, sample:
    md << StringMeta.new(0..-1)
    md
  end

  class BaseMeta
    attr_accessor :range
    def initialize(range) @range = range end
    def base_extract(str) str[self.range].freeze end
  end

  class StringMeta < BaseMeta
    alias :extract :base_extract
  end

  class IntMeta < BaseMeta
    def extract(s) base_extract(s).unpack("i")[0] end
  end
end


>>> numfields?
>> The only additional method of verification that comes to mind is to
>> ask Matz. :-)
>>
> Hmmm - a lack of profiling tools might prove something of a stumbling
> block... I'll need to have a careful think about that.  Rather than
> wanting to check up on fellow Rubyists, I really want to periodically
> check that I make no invalid assumptions as I work forwards from this
> basis towards an implementation.  I don't want to find out only after
> I think I've finished that a resource leak or extravagant resource
> demands will require a re-write before the software can be used
> against real data.

Then just look at the sources.

Kind regards

    robert
6f37c32b62670cb9a5ac0330cb27269a?d=identicon&s=25 Steve [RubyTalk] (Guest)
on 2005-12-16 11:10
(Received via mailing list)
Ron Jeffries wrote:
>> I might be wrong - but I'm pretty sure that substrings in ruby are
>> created with copy-on-write. That is, when you take a substring, a new
>> block of memory isn't allocated to the new String, it references the
>> same block of memory as the original string - the allocation of a new
>> block of memory only occurs when one of the strings is modified.
>>
> I asked about this a week or two ago, as part of my Extended Set Theory thing,
> and the impression I got from the answers then was that sub /arrays/ are copy on
> write but that sub /strings/ were not.
>
Ah-ha... Hmmm - that suggestion "puts the cat among the pigeons" - as
far as I can tell, I can't have an array of bytes represented in
contiguous memory... (arrays are always arrays of arbitrarily typed
objects - aren't they?) - so if copy on write is limited to arrays, it
is back to the drawing board for me.  I think arrays of objects would
necessarily be copy on write as a consequence of the Ruby model of
references... Conversely String values are something entirely different
- while substrings could be done  'copy on write' - it would necessarily
use an entirely separate mechanism.  It is obvious that whole strings
behave as if they are "copy on write."
> I wonder how one might confirm this, other than by asking matz or reading the
> compiler and libraries ...
I'm coming towards the opinion that I will need to "ask Matz" - though I
hope to have exhausted all the other avenues before I resort to bugging
him in person.  I realise that I could (in principle) read the source
code for the ruby interpreter - but, ideally, I'd also like to know how
it is intended to work (in order that I can be confident that the
details will not change in a future release.)  I also presume it would
take me quite a while to get to grips with the interpreter source.
Dd76a12d66f843de5c5f8782668e7127?d=identicon&s=25 Mauricio Fernandez (Guest)
on 2005-12-16 11:34
(Received via mailing list)
On Fri, Dec 16, 2005 at 08:17:40AM +0900, Ron Jeffries wrote:
> and the impression I got from the answers then was that sub /arrays/ are copy on
> write but that sub /strings/ were not.
>
> I wonder how one might confirm this, other than by asking matz or reading the
> compiler and libraries ...

Just the following should do:

-----------------------------------------------------------------------------
VALUE
rb_str_substr(str, beg, len)
    VALUE str;
    long beg, len;
{
    VALUE str2;

    if (len < 0) return Qnil;
    if (beg > RSTRING(str)->len) return Qnil;
    if (beg < 0) {
	beg += RSTRING(str)->len;
	if (beg < 0) return Qnil;
    }
    if (beg + len > RSTRING(str)->len) {
	len = RSTRING(str)->len - beg;
    }
    if (len < 0) {
	len = 0;
    }
    if (len == 0) {
	str2 = rb_str_new5(str,0,0);
    }
    else if (len > sizeof(struct RString)/2 &&
	beg + len == RSTRING(str)->len && !FL_TEST(str, STR_ASSOC)) {
/*      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
*/
	str2 = rb_str_new3(rb_str_new4(str));
	RSTRING(str2)->ptr += RSTRING(str2)->len - len;
	RSTRING(str2)->len = len;
    }
    else {
	str2 = rb_str_new5(str, RSTRING(str)->ptr+beg, len);
    }
    OBJ_INFECT(str2, str);

    return str2;
}
-----------------------------------------------------------------------------



... (optionally) ...




-----------------------------------------------------------------------------
static VALUE
str_new3(klass, str)
    VALUE klass, str;
{
    VALUE str2 = str_alloc(klass);

    RSTRING(str2)->len = RSTRING(str)->len;
    RSTRING(str2)->ptr = RSTRING(str)->ptr;
    RSTRING(str2)->aux.shared = str;
    FL_SET(str2, ELTS_SHARED);
    OBJ_INFECT(str2, str);

    return str2;
}

VALUE
rb_str_new3(str)
    VALUE str;
{
    return str_new3(rb_obj_class(str), str);
}

static VALUE
str_new4(klass, str)
    VALUE klass, str;
{
    VALUE str2 = str_alloc(klass);

    RSTRING(str2)->len = RSTRING(str)->len;
    RSTRING(str2)->ptr = RSTRING(str)->ptr;
    if (FL_TEST(str, ELTS_SHARED)) {
	FL_SET(str2, ELTS_SHARED);
	RSTRING(str2)->aux.shared = RSTRING(str)->aux.shared;
    }
    else {
	FL_SET(str, ELTS_SHARED);
	RSTRING(str)->aux.shared = str2;
    }

    return str2;
}

VALUE
rb_str_new4(orig)
    VALUE orig;
{
    VALUE klass, str;

    if (OBJ_FROZEN(orig)) return orig;
    klass = rb_obj_class(orig);
    if (FL_TEST(orig, ELTS_SHARED) && (str = RSTRING(orig)->aux.shared)
&& klass == RBASIC(str)->klass) {
	long ofs;
	ofs = RSTRING(str)->len - RSTRING(orig)->len;
	if (ofs > 0) {
	    str = str_new3(klass, str);
	    RSTRING(str)->ptr += ofs;
	    RSTRING(str)->len -= ofs;
	}
    }
    else if (FL_TEST(orig, STR_ASSOC)) {
	str = str_new(klass, RSTRING(orig)->ptr, RSTRING(orig)->len);
    }
    else {
	str = str_new4(klass, orig);
    }
    OBJ_INFECT(str, orig);
    OBJ_FREEZE(str);
    return str;
}

VALUE
rb_str_new5(obj, ptr, len)
    VALUE obj;
    const char *ptr;
    long len;
{
    return str_new(rb_obj_class(obj), ptr, len);
}
A4327ebc799afb6896c5b224d76f9b9e?d=identicon&s=25 Ron Jeffries (Guest)
on 2005-12-16 19:49
(Received via mailing list)
On Fri, 16 Dec 2005 19:33:16 +0900, Mauricio Fernandez <mfp@acm.org>
wrote:

>ust the following should do:
>
>-----------------------------------------------------------------------------
>VALUE
>rb_str_substr(str, beg, len)
>    VALUE str;
>    long beg, len;
>{
>    VALUE str2;
> ... and so on ...
....

Easy for you to say, Maruicio ... but whanging a little C code in there
is out
of my current scope. ;->

I'd just like to know what's really going on, and am hoping, unfairly,
that
someone out there knows or will figure it out.

Thanks,
A4327ebc799afb6896c5b224d76f9b9e?d=identicon&s=25 Ron Jeffries (Guest)
on 2005-12-16 19:55
(Received via mailing list)
On Fri, 16 Dec 2005 19:09:17 +0900, "Steve [RubyTalk]"
<steve_rubytalk@shic.co.uk> wrote:

>I'm coming towards the opinion that I will need to "ask Matz" - though I
>hope to have exhausted all the other avenues before I resort to bugging
>him in person.

I'm with you ... when you find out, please be sure to let me know!
This topic is locked and can not be replied to.