We have a hybrid representation that converts content lazily as needed.
The code that’s currently checked in is a basic implementation I coded
in a day before RailsConf so it is pretty basic, is not tested
thoroughly and has bunch of bugs I already know about. I’m working on
some improvements right now.
Here’s the checkin comment that explains briefly how it works. Note that
some details are subject to change:
A new implementation for Ruby MutableString and Ruby regular expression
wrappers.
This is just the first pass, w/o optimizations and w/o encodings
(Default system encoding is used for all strings).
Many improvements and adjustments will come in future, some hacks will
be removed.
Basic architecture:
MutableString holds on Content and Encoding. Content is an abstract
class that has three subclasses:
-
StringContent
string. This is the default representation for strings coming from CLR
methods and for Ruby string literals.
content representation will cause implicit conversion of the
representation to StringBuilderContent.
BinaryContent using the Encoding stored on the owning MutableString.
-
StringBuilderContent
Unicode string.
BinaryContent representation.
unnecessary copying), we may consider to replace it with resizable
char[].
-
BinaryContent
StringBuilderContent representation.
very well. We should replace it by resizable byte[].
The content representation is changed based upon operations that are
performed on the mutable string. There is currently no limit on number
of content type switches, so if one alternates binary and textual
operations the conversion will take place for each one of them. Although
this shouldn’t be a common case we may consider to add some counters and
keep the representation binary/textual based upon their values.
The design assumes that the nature of operations implemented by library
methods is of two kinds: textual and binary. And that data that are once
treated as text are not usually treated as raw binary data later. Any
text in the IronRuby runtime is represented as a sequence of 16bit
Unicode characters (standard .NET representation). Each binary data
treated as text is converted to this representation, regardless of the
encoding used for storage representation in the file. The encoding is
remembered in the MutableString instance and the original representation
could be always recreated. Not all Unicode characters fit into 16 bits,
therefore some exotic ones are represented by multiple characters
(surrogates). If there is such a character in the string, some
operations (e.g. indexing) might not be precise anymore - the n-th item
in the char[] isn’t the n-th Unicode character in the string. We believe
this impreciseness is not a real world issue and is worth performance
gain and i
mplementation simplicity.
Tomas