Rational about the Gem package format?

unknown · June 30, 2006, 12:23pm

I was just wondering about the rational behind the format of gem
packages. It seems rather odd that the package is a tar of

data.tar.gz
metadata.gz

Why not just have the metadata stored in with the data and not worry
about double layers? The only advantage I can figure is that it is
possilble to extract the metadata without uncompressing the eniter
package. Okay, but can’t a tar/gzip lib or tool do that anyway? Is
there some other reason?

Thanks,
T.

unknown · June 30, 2006, 6:50pm

On 6/30/06, [email protected] [email protected] wrote:

there some other reason?
I think it’s for the reason you mentioned. tar can extract specific
files out of the package without needing to extract the whole package,
but – if I recall correctly – the gzip compression format still
requires that the whole package be decompressed before the files could
be extracted. So compressing the metadata and actual data separately
means you only need to decompress the metadata if that’s all you’re
after.

Jacob F.

unknown · June 30, 2006, 11:49pm

On Fri, Jun 30, 2006 at 07:22:53PM +0900, [email protected] wrote:

there some other reason?
It’s a good solution in practice for many reasons; here’s the answer I
gave to
a similar question on [ruby-core:6258]:

[...] here are the pros I can think of:
* the format is extensible because it's possible to add new entries

in the
“outer” tarball. This has proved useful already: the package
originally
just contained metadata.gz and data.tar.gz, and recently
data.tar.gz.sig
and metadata.gz.sig have been added to support signatures.
* it is easy to extract the metadata without uncompressing the whole
tarball
* it’s possible to write data.tar.gz and generate the file lists and
other
information dynamically before writing metadata.gz, while
data.tar.gz is
being written. It is thus be possible to store for instance a
cryptographic digest of the data.tar.gz file in metadata.gz. This
would
be somewhat harder if the metadata were included in a single
tarball,
especially if we compressed it.
* it takes little time to locate metadata.gz inside the tarball
(we’d have
to go through many more entries if it were a flat tarball). While
access
is still O(n), n is the number of entries in the outer file (2
originally, now 4) instead of the normally much more numerous data
files.

Also, note that the code in package.rb was written carefully to

avoid
having to keep the full contents of the archive (or any contained
file) in
memory at any point in time (with the exception of metadata.gz, of
course). RubyGems doesn’t exploit that ability since the first thing
it
does before unpacking is uncompressing all the data and storing it
in an
array, but package.rb would have supported O(1) memory usage. That’s
why
metadata.gz comes after data.tar.gz inside the .gem.

Also, on [ruby-core:6251]:

The "nested tarball" format was inspired by Debian's .deb format.

The
latter uses ar for the outer layer, but I saw no reason to implement
another subformat. When I originally hacked the package format for
rpa-base, I used nested zip files; I changed that to use POSIX
tarballs
when I discovered that RubyZip triggered a bug in Tempfile that
would
cause ruby to use over 100MB RAM to create a 300KB .zip file. That
was
fixed quickly, but by then there was no reason to change the package
format again. Had this bug not been there, maybe RubyGems would be
using
zipfiles now

unknown · June 30, 2006, 9:59pm

On 6/30/06, Jacob F. [email protected] wrote:

package. Okay, but can’t a tar/gzip lib or tool do that anyway? Is
there some other reason?
I think it’s for the reason you mentioned. tar can extract specific
files out of the package without needing to extract the whole package,
but – if I recall correctly – the gzip compression format still
requires that the whole package be decompressed before the files could
be extracted. So compressing the metadata and actual data separately
means you only need to decompress the metadata if that’s all you’re
after.

It’s also the format of .deb files, more or less. (Debian files
actually have control.tar.gz and data.tar.gz.)

-austin