Re: Question of reference and (sub)strings


#1

Given that I only want to compute the offsets once, an
obvious solution
would be to construct an Array of String - each element
representing a
sub-string of the original… but this would double memory use. What
would be the best way to avoid duplicating the character
sequences and
causing run-time bloat?

I might be wrong - but I’m pretty sure that substrings in ruby are
created with copy-on-write. That is, when you take a substring, a new
block of memory isn’t allocated to the new String, it references the
same block of memory as the original string - the allocation of a new
block of memory only occurs when one of the strings is modified.

#####################################################################################
This email has been scanned by MailMarshal, an email content filter.
#####################################################################################


#2

Daniel S. wrote:

Given that I only want to compute the offsets once, an
obvious solution
would be to construct an Array of String - each element
representing a
sub-string of the original… but this would double memory use. What
would be the best way to avoid duplicating the character
sequences and
causing run-time bloat?

I might be wrong

You’re not.

  • but I’m pretty sure that substrings in ruby are
    created with copy-on-write. That is, when you take a substring, a new
    block of memory isn’t allocated to the new String, it references the
    same block of memory as the original string - the allocation of a new
    block of memory only occurs when one of the strings is modified.

Exactly. It seems this would be the simplest solution.

robert

#3

Robert K. wrote:

Exactly. It seems this would be the simplest solution.

That sounds like absolutely great news - a very pleasant surprise. I
couldn’t have hoped for anything better. (Thanks!)

I assume that if I do something like:

# Assume offsets is a pre-computed array of positive integer

positions into the String originalstr.
# with offsets[0]==0 and offsets[-1]==@originalstr.size
@fields=Array.new (offsets.size-1)
for i in 1…(offsets.size) do
# I assume this next line is what is meant by a Ruby sub-string?
@fields[i-1]=@originalstr[offsets[i-1]…offsets[i]]
end

… and, assuming that @fields is exposed only as a read-only attribute,
that I can assume the memory it consumes to be independent of the length
of originalstr and dependent only upon numfields?

While I’ve no reason to doubt this confirmed answer, by any chance can
someone suggest a good way to demonstrate that this is the case without
resorting to either using very large strings and looking at VM usage of
the interpreter process… or resorting to reviewing the source to
Ruby’s implementation?


#4

Steve [RubyTalk] wrote:

strings is modified.
positions into the String originalstr.
Care to unveil a bit of the nature of the computation that yields those
indexes?

# with offsets[0]==0 and offsets[-1]==@originalstr.size
@fields=Array.new (offsets.size-1)
for i in 1..(offsets.size) do
  # I assume this next line is what is meant by a Ruby sub-string?
  @fields[i-1]=@originalstr[offsets[i-1]..offsets[i]]
end

… and, assuming that @fields is exposed only as a read-only
attribute, that I can assume the memory it consumes to be independent
of the length of originalstr and dependent only upon numfields?

You can help keeping this read only be freezing all strings involved.

While I’ve no reason to doubt this confirmed answer, by any chance can
someone suggest a good way to demonstrate that this is the case
without resorting to either using very large strings and looking at
VM usage of the interpreter process… or resorting to reviewing the
source to Ruby’s implementation?

The only additional method of verification that comes to mind is to ask
Matz. :slight_smile:

Kind regards

robert

#5

On Thu, 15 Dec 2005 14:50:59 +0900, “Daniel S.”
removed_email_address@domain.invalid
wrote:

I might be wrong - but I’m pretty sure that substrings in ruby are
created with copy-on-write. That is, when you take a substring, a new
block of memory isn’t allocated to the new String, it references the
same block of memory as the original string - the allocation of a new
block of memory only occurs when one of the strings is modified.

I asked about this a week or two ago, as part of my Extended Set Theory
thing,
and the impression I got from the answers then was that sub /arrays/ are
copy on
write but that sub /strings/ were not.

I wonder how one might confirm this, other than by asking matz or
reading the
compiler and libraries …


#6

Robert K. wrote:

# Assume offsets is a pre-computed array of positive integer

positions into the String originalstr.

Care to unveil a bit of the nature of the computation that yields those
indexes?

It’s not really relevant to the question I was asking - but I’ve no
problem saying more. I’ve a domain specific (order-preserving and
extensible) ‘type-system’ which is imposed over otherwise opaque data
structures. Given an instance of a ‘type-signature’ and a pointer, it
is possible to determine the number of bytes which represent each ‘typed
value’ - and (significantly) list construction drops out as being the
concatenation of the value representations and type-signatures. The
type signatures range in complexity from the simplest constant ‘N-bytes
interpreted as a natural number’ through sentinel encodings (Null
terminated strings on steroids) and (in principle - if not frequently in
practice) arbitrary computation ranging over named integer values
occurring ‘earlier’ in the list.
At the moment I’m toying with the idea that I can memory-map the values
(using a C-implemented module) and do the computations on the mapped
values in Ruby - having presented opaque values and ‘type-signatures’ as
String objects to Ruby. I expect that typical computations may involve
matching regular expressions; doing arithmetic; computing various hashes
and summations etc. At the moment I’m concentrating on establishing if
Ruby is a suitable tool for the task at hand.

You can help keeping this read only be freezing all strings involved.

Yes - that sounds a good idea to me.

While I’ve no reason to doubt this confirmed answer, by any chance can
someone suggest a good way to demonstrate that this is the case
without resorting to either using very large strings and looking at
VM usage of the interpreter process… or resorting to reviewing the
source to Ruby’s implementation?

The only additional method of verification that comes to mind is to ask
Matz. :slight_smile:

Hmmm - a lack of profiling tools might prove something of a stumbling
block… I’ll need to have a careful think about that. Rather than
wanting to check up on fellow Rubyists, I really want to periodically
check that I make no invalid assumptions as I work forwards from this
basis towards an implementation. I don’t want to find out only after I
think I’ve finished that a resource leak or extravagant resource demands
will require a re-write before the software can be used against real
data.

Steve


#7

Steve [RubyTalk] wrote:

structures. Given an instance of a ‘type-signature’ and a pointer, it
values (using a C-implemented module) and do the computations on the
mapped values in Ruby - having presented opaque values and
‘type-signatures’ as String objects to Ruby. I expect that typical
computations may involve matching regular expressions; doing
arithmetic; computing various hashes and summations etc. At the
moment I’m concentrating on establishing if Ruby is a suitable tool
for the task at hand.

Certainly. I don’t know now complex your type calculations are. I’d
probably do something along the line of:

class SpecialData

def initialize(s)
@data = s.frozen? ? s : s.dup.freeze
@meta_data = calculate_meta(@data)
end

def size() @meta_data.size end

def get(field_index)
@meta_data[field_index].extract(@data)
end

private

def calculate_meta(data)
md = []
# your complex calculation here, sample:
md << StringMeta.new(0…-1)
md
end

class BaseMeta
attr_accessor :range
def initialize(range) @range = range end
def base_extract(str) str[self.range].freeze end
end

class StringMeta < BaseMeta
alias :extract :base_extract
end

class IntMeta < BaseMeta
def extract(s) base_extract(s).unpack(“i”)[0] end
end
end

numfields?
The only additional method of verification that comes to mind is to
ask Matz. :slight_smile:

Hmmm - a lack of profiling tools might prove something of a stumbling
block… I’ll need to have a careful think about that. Rather than
wanting to check up on fellow Rubyists, I really want to periodically
check that I make no invalid assumptions as I work forwards from this
basis towards an implementation. I don’t want to find out only after
I think I’ve finished that a resource leak or extravagant resource
demands will require a re-write before the software can be used
against real data.

Then just look at the sources.

Kind regards

robert

#8

Ron J. wrote:

I might be wrong - but I’m pretty sure that substrings in ruby are
created with copy-on-write. That is, when you take a substring, a new
block of memory isn’t allocated to the new String, it references the
same block of memory as the original string - the allocation of a new
block of memory only occurs when one of the strings is modified.

I asked about this a week or two ago, as part of my Extended Set Theory thing,
and the impression I got from the answers then was that sub /arrays/ are copy on
write but that sub /strings/ were not.

Ah-ha… Hmmm - that suggestion “puts the cat among the pigeons” - as
far as I can tell, I can’t have an array of bytes represented in
contiguous memory… (arrays are always arrays of arbitrarily typed
objects - aren’t they?) - so if copy on write is limited to arrays, it
is back to the drawing board for me. I think arrays of objects would
necessarily be copy on write as a consequence of the Ruby model of
references… Conversely String values are something entirely different

  • while substrings could be done ‘copy on write’ - it would necessarily
    use an entirely separate mechanism. It is obvious that whole strings
    behave as if they are “copy on write.”

I wonder how one might confirm this, other than by asking matz or reading the
compiler and libraries …
I’m coming towards the opinion that I will need to “ask Matz” - though I
hope to have exhausted all the other avenues before I resort to bugging
him in person. I realise that I could (in principle) read the source
code for the ruby interpreter - but, ideally, I’d also like to know how
it is intended to work (in order that I can be confident that the
details will not change in a future release.) I also presume it would
take me quite a while to get to grips with the interpreter source.


#9

On Fri, Dec 16, 2005 at 08:17:40AM +0900, Ron J. wrote:

and the impression I got from the answers then was that sub /arrays/ are copy on
write but that sub /strings/ were not.

I wonder how one might confirm this, other than by asking matz or reading the
compiler and libraries …

Just the following should do:


VALUE
rb_str_substr(str, beg, len)
VALUE str;
long beg, len;
{
VALUE str2;

if (len < 0) return Qnil;
if (beg > RSTRING(str)->len) return Qnil;
if (beg < 0) {
beg += RSTRING(str)->len;
if (beg < 0) return Qnil;
}
if (beg + len > RSTRING(str)->len) {
len = RSTRING(str)->len - beg;
}
if (len < 0) {
len = 0;
}
if (len == 0) {
str2 = rb_str_new5(str,0,0);
}
else if (len > sizeof(struct RString)/2 &&
beg + len == RSTRING(str)->len && !FL_TEST(str, STR_ASSOC)) {

/* ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
*/
str2 = rb_str_new3(rb_str_new4(str));
RSTRING(str2)->ptr += RSTRING(str2)->len - len;
RSTRING(str2)->len = len;
}
else {
str2 = rb_str_new5(str, RSTRING(str)->ptr+beg, len);
}
OBJ_INFECT(str2, str);

return str2;

}

… (optionally) …


static VALUE
str_new3(klass, str)
VALUE klass, str;
{
VALUE str2 = str_alloc(klass);

RSTRING(str2)->len = RSTRING(str)->len;
RSTRING(str2)->ptr = RSTRING(str)->ptr;
RSTRING(str2)->aux.shared = str;
FL_SET(str2, ELTS_SHARED);
OBJ_INFECT(str2, str);

return str2;

}

VALUE
rb_str_new3(str)
VALUE str;
{
return str_new3(rb_obj_class(str), str);
}

static VALUE
str_new4(klass, str)
VALUE klass, str;
{
VALUE str2 = str_alloc(klass);

RSTRING(str2)->len = RSTRING(str)->len;
RSTRING(str2)->ptr = RSTRING(str)->ptr;
if (FL_TEST(str, ELTS_SHARED)) {
FL_SET(str2, ELTS_SHARED);
RSTRING(str2)->aux.shared = RSTRING(str)->aux.shared;
}
else {
FL_SET(str, ELTS_SHARED);
RSTRING(str)->aux.shared = str2;
}

return str2;

}

VALUE
rb_str_new4(orig)
VALUE orig;
{
VALUE klass, str;

if (OBJ_FROZEN(orig)) return orig;
klass = rb_obj_class(orig);
if (FL_TEST(orig, ELTS_SHARED) && (str = RSTRING(orig)->aux.shared) 

&& klass == RBASIC(str)->klass) {
long ofs;
ofs = RSTRING(str)->len - RSTRING(orig)->len;
if (ofs > 0) {
str = str_new3(klass, str);
RSTRING(str)->ptr += ofs;
RSTRING(str)->len -= ofs;
}
}
else if (FL_TEST(orig, STR_ASSOC)) {
str = str_new(klass, RSTRING(orig)->ptr, RSTRING(orig)->len);
}
else {
str = str_new4(klass, orig);
}
OBJ_INFECT(str, orig);
OBJ_FREEZE(str);
return str;
}

VALUE
rb_str_new5(obj, ptr, len)
VALUE obj;
const char *ptr;
long len;
{
return str_new(rb_obj_class(obj), ptr, len);
}


#10

On Fri, 16 Dec 2005 19:09:17 +0900, “Steve [RubyTalk]”
removed_email_address@domain.invalid wrote:

I’m coming towards the opinion that I will need to “ask Matz” - though I
hope to have exhausted all the other avenues before I resort to bugging
him in person.

I’m with you … when you find out, please be sure to let me know!


#11

On Fri, 16 Dec 2005 19:33:16 +0900, Mauricio F. removed_email_address@domain.invalid
wrote:

ust the following should do:


VALUE
rb_str_substr(str, beg, len)
VALUE str;
long beg, len;
{
VALUE str2;
… and so on …

Easy for you to say, Maruicio … but whanging a little C code in there
is out
of my current scope. ;->

I’d just like to know what’s really going on, and am hoping, unfairly,
that
someone out there knows or will figure it out.

Thanks,