Imagine comparing using C++ and Ruby to manipulate some strings. I would like to confirm some assumptions I am making about in-memory efficiency with Ruby. (This explains why the example seems contrived and I don't want to be told to tackle the problem a different way, such as splitting up the input as it is read from file/sockets etc... :-) ) Assume I've got a large in-memory sequence of bytes (this would be represented in C(++) using a malloc block) and in ruby as a String. I have a pre-defined (non-trivial, potentially computationally expensive) function which calculates a sequence of offsets into the String subject to some arbitrary criteria... and subsequently I wish to reference sub-strings (i.e. strings between two successive offsets) as if they were independent of the original string (though, of course, each having a fixed length.) N.B. This could be done 'cheaply' using pointers into the original string if using C/C++. Given that I only want to compute the offsets once, an obvious solution would be to construct an Array of String - each element representing a sub-string of the original... but this would double memory use. What would be the best way to avoid duplicating the character sequences and causing run-time bloat? By corollary, if I had a large number of Strings, what would be the most memory efficient way to represent their concatenation? If I had n mK stings, would I need another n*mK contiguous block of memory to represent their concatenation?
on 2005-12-14 16:33
on 2005-12-15 06:34
On 12/14/05, Steve [RubyTalk] <firstname.lastname@example.org> wrote: > > Given that I only want to compute the offsets once, an obvious solution > would be to construct an Array of String - each element representing a > sub-string of the original... but this would double memory use. What > would be the best way to avoid duplicating the character sequences and > causing run-time bloat? You don't need to create the substrings until you need them: Here's a test which only allocates memory for the substrings as needed. As far as I can tell, the substrings are released back to the GC as soon as you are done with them. class IndexedString < String def initialize s, &indexer super s @offsets = yield self self end def substring n (0...(@offsets.size-1)) === n ? self[@offsets[n]...@offsets[n+1]] : nil end end def build_index s indexes =  while (n=s.index('.',indexes.last+1)) do indexes << n+1 end indexes << s.length+1 end s = IndexedString.new( (("TESTINGTESTING!!"*1024)+".")*100) do |str| build_index(str) end #= +1600K s.substring 10; #+16K s.substring 95; #+16K > By corollary, if I had a large number of Strings, what would be the most > memory efficient way to represent their concatenation? If I had n mK > stings, would I need another n*mK contiguous block of memory to > represent their concatenation? Not necessarily: You can build a class which takes an array of strings and returns "superstrings" only as needed. Here are my passing test cases, but the class still needs a little work before I'll post it: def test_MegaString small_strings = %w( this is a collection of small strings) mega = MegaString.new(small_strings) assert_equal("t",mega) assert_equal("this",mega[0,4]) assert_equal("hisisaco",mega[1,8]) assert_equal("ollect",mega[8,6]) assert_equal("a", mega[6...7]) assert_equal("col", mega[7..10]) assert_equal("lstrings", mega[23,-1]) assert_equal("lstrings", mega[23,230]) assert_equal("thisisacollectionofsmallstrings", mega.to_s) end -Adam