Marshal formats

For certain unicode operations I need to store and read a fairly large
amounts of data. Currently I’m using Marshal.dump to generate the data,
but I’ve got the impression that the Marshal format isn’t just
different on different Ruby versions, but also on the various
platforms.

I’ve tried to use YAML and Ruby source files to store the data, but
this results in very large files and reading them takes forever.

I’ve also considered install time generation of the data, but that has
some practical problems for distribution.

Is there another marshal format I’ve missed or is there a better way to
do this?

Manfred Stienstra wrote:

For certain unicode operations I need to store and read a fairly large
amounts of data. Currently I’m using Marshal.dump to generate the data,
but I’ve got the impression that the Marshal format isn’t just
different on different Ruby versions, but also on the various
platforms.

Marshal format should be identical across platforms. (Otherwise, drb
wouldn’t be very useful.)

The marshal code has some degree of backwards compatibility:

$ ri Marshal | cat
--------------------------------------------------------- Class: Marshal
The marshaling library converts collections of Ruby objects into a
byte stream, allowing them to be stored outside the currently
active script. This data may subsequently be read and the original
objects reconstituted. Marshaled data has major and minor version
numbers stored along with the object information. In normal use,
marshaling can only load data written with the same major version
number and an equal or lower minor version number. If Ruby’s
``verbose’’ flag is set (normally using -d, -v, -w, or --verbose)
the major and minor numbers must match exactly. Marshal versioning
is independent of Ruby’s version numbers. You can extract the
version by reading the first two bytes of marshaled data.

      str = Marshal.dump("thing")
      RUBY_VERSION   #=> "1.8.0"
      str[0]         #=> 4
      str[1]         #=> 8

  Some objects cannot be dumped: if the objects to be dumped include
  bindings, procedure or method objects, instances of class IO, or
  singleton objects, a TypeError will be raised. If your class has
  special serialization needs (for example, if you want to serialize
  in some specific format), or if it contains objects that would
  otherwise not be serializable, you can implement your own
  serialization strategy by defining two methods, _dump and _load:
  The instance method _dump should return a String object containing
  all the information necessary to reconstitute objects of this
  class and all referenced objects up to a maximum depth given as an
  integer parameter (a value of -1 implies that you should disable
  depth checking). The class method _load should take a String and
  return an object of this class.

Joel VanderWerf wrote:

Marshal format should be identical across platforms. (Otherwise, drb
wouldn’t be very useful.)

Oops, yes, I guess you’re right. Just remember to open the Marshal file
in windows as binary (: