OT: Why are .zip files so much bigger than tar.gz files?

Trans · February 19, 2008, 12:54am

So I has started to offer .zip packages of my projects to make life a
little easier for Windows folks --seeing as it’s no sweat off the
backs of Linux folks either (unzip) --but then I noticed that zip
files are huge. I have a 1MB tar.gz that’s over 3MB as a .zip. Can
that really be right?

T.

Trans · February 19, 2008, 1:11am

On Feb 19, 2008 10:53 AM, Trans [email protected] wrote:

So I has started to offer .zip packages of my projects to make life a
little easier for Windows folks --seeing as it’s no sweat off the
backs of Linux folks either (unzip) --but then I noticed that zip
files are huge. I have a 1MB tar.gz that’s over 3MB as a .zip. Can
that really be right?

Certainly can. The ZIP algorithm isn’t as good as compressing things as
the
gzip algorithm. Have a look at bz2 – it’s typically even better than
gzip.

T.

Arlen

Trans · February 19, 2008, 1:23am

On Feb 18, 2008 6:53 PM, Trans [email protected] wrote:

So I has started to offer .zip packages of my projects to make life a
little easier for Windows folks --seeing as it’s no sweat off the
backs of Linux folks either (unzip) --but then I noticed that zip
files are huge. I have a 1MB tar.gz that’s over 3MB as a .zip. Can
that really be right?

Yes, it sounds like it can really be right. A zip archive compresses
each file indivually and then adds them to the archive, this makes it
easy to extract indivdual files later [1]. A tar.gz archive adds all
the files to the tar and then gzips everything at once, this takes
advantage of cross-file redundancy for a better overall compression
ratio [2].

[1] ZIP (file format) - Wikipedia
[2] tar (computing) - Wikipedia gzip - Wikipedia

Trans · February 19, 2008, 1:50am

Personally I am using mostly .tar.bz2 these days, simply because it
compresses better than gz, even though gz is faster and nicer to your
CPU. I have around 15 GIG of archivable material, if I would move this
to gz I assume I would come to around 18 GIG, and when transferring on
USB (via a ruby script automatically) every byte that is not transferred
matters - some computers only have USB 1.1 and transfer is boringly slow
already. But its mostly archival reasons here.

PS: There recently was a comment about 7zip compressing even better than
bzip.
Just unfortunately, 7zip is hardly known and I will stick to bzip for
now since its much more supported, known and also easy to handle (and
doesnt look as attached to company-development as is 7zip)

Trans · February 19, 2008, 2:49am

On Feb 18, 7:22 pm, [email protected] wrote:

easy to extract indivdual files later [1]. A tar.gz archive adds all
the files to the tar and then gzips everything at once, this takes
advantage of cross-file redundancy for a better overall compression
ratio [2].

That’s interesting. I created a utility a while back called rtar
(recursive tar). It drills-down to the bottom of each directory and
tars & compresses each directory and compressed each file, working
it’s way back up to the top. You end up with a compressed archive
similar to your explanation of zip in accessibility, but still with
the overall compression of a single tar. I thought it was pretty cool,
but basically trivial to implement. So I emailed the GNU maintainers
of tar asking them if it might make a nice option to add to tar
itself. Of course, they never responded

T.

Trans · February 19, 2008, 5:24am

I saw this on the front page of the programming section of reddit
tonight. Thought it fit with the conversation. Seems someone
benchmarked the 3 algorithms being discussed.

http://blogs.reucon.com/srt/2008/02/18/compression_gzip_vs_bzip2_vs_7_zip.html

Trans · February 19, 2008, 2:30am

Trans, 2008-02-19, 08:53:

So I has started to offer .zip packages of my projects to make life a
little easier for Windows folks --seeing as it’s no sweat off the
backs of Linux folks either (unzip) --but then I noticed that zip
files are huge. I have a 1MB tar.gz that’s over 3MB as a .zip. Can
that really be right?

Not if done correctly. Hack is as follows:

Create zip without using compression
Zip that one into a new zip using maximum compression

This does not deal with the issue that zip has an inferior compression
rate compared to bz2 but at least allows to make use of cross-file
redundancy (zip packs each file individually).

7zip and rar are superior to zip but less commonly supported.

HTH

Josef ‘Jupp’ Schugt

Trans · February 19, 2008, 5:46am

Trans wrote:

A zip archive compresses
each file indivually and then adds them to the archive, this makes it
easy to extract indivdual files later [1]. A tar.gz archive adds all
the files to the tar and then gzips everything at once, this takes
advantage of cross-file redundancy for a better overall compression
ratio [2].

That’s interesting. I created a utility a while back called rtar
(recursive tar).

Microsoft’s CAB file format has a mixed approach, using a proprietry
LZ compression. Basically they compress small files together in groups,
which gives you most of the advantages of zip and tar in one.

Another approach that could be taken is to flush the compressor for
every 32kB of input (since deflate can’t repeat input from further ago
anyhow), then append a manifest recording the output byte offset of the
flush point that preceeds each file. To grab a file in the middle, seek
to the offset that preceeds that file by two blocks, decompress the
32kB to use as history, then continue decompressing until you get to
your file.

The nice thing is that flushing the compressor leaves the output as a
valid deflate() stream, even for decompressors that don’t know why
you’ve
flushed. If the manifest looks like a normal file, a standard tar
utility
could still extract the whole archive.

I’ve done these sort of games with the zlib compressor - it’s easier
than you might think. With a bit of cunning, you can make the resultant
file rsync-able even in the face of localized changes, a fact I
discussed
with Andrew Tridgell some years back. I also wanted to add a long-range
predictor to zlib so deflate could repeat blocks from far back… you
can
use an rsync-style approach to do the prediction rather than an LZ
suffix
tree.

Clifford H…

Trans · February 19, 2008, 11:37am

2008/2/19, Trans [email protected]:

So I has started to offer .zip packages of my projects to make life a
little easier for Windows folks --seeing as it’s no sweat off the
backs of Linux folks either (unzip) --but then I noticed that zip
files are huge. I have a 1MB tar.gz that’s over 3MB as a .zip. Can
that really be right?

The reason is that in a ZIP all entries are compressed individually
while in a TGZ or TBZ the whole stream is compressed. The effect
shows especially when there are many small files with similar content
as is typical for source code.

However, I do have seen ZIP files that were similarly sized -
certainly not as much difference as you have observed. This may also
depend on the compression algorithm used in a ZIP (I believe IZArc for
example supports three different compression algorithms for ZIP).

Kind regards

robert