How to stream or write data into a tar.gz file as if the data were from files?

I have a gazillion little files in memory (each is really just a chunk
of data, but it represents what needs to be a single file) and I need
to throw them all into a .tar.gz archive. In this case, it must be
in .tar.gz format and it must unzip into actual files–although I pity
the fellow that actually has to unzip this monstrosity.

Here’s the solutions I’ve come up with so far:

  1. Not portable, extremely slow:
    write out all these “files” into a directory and make a system
    call to tar (tar -xzf …)

  2. Portable but still just as slow:
    write out all these “files” into a directory and use archive-tar-
    minitar to make the archive

  3. Not portable, but fast:
    stream information into tar/gzip to create the archive (without
    ever first writing out files)

I’ve been looking around on this and the closest I’ve come is this:
tar cvf - some_directory | gzip - > some_directory.tar.gz

Note that this would still require me to write the files to a
directory (which must be avoided at all costs), but at least the
problem now is how to write data into a tar file. I’ve been googling
and still haven’t turned up anything yet.

  1. Hack archive-tar-minitar to enable me to write my data directly
    into the format. Looking at the source code, this doesn’t seem
    terribly hard, but not terribly easy either. Am I missing a method
    already written for this kind of thing?

Others?

Right now, anything resembling #3 or #4 would work for me.

My feeling is that it shouldn’t be that hard to write data into
a .tar.gz format in either linux or ruby without actually having any
files (i.e., everything in memory or streamed in).

Thanks a lot for any suggestions or ideas!

Others?

Although it’s not what you’re asking for, as you mention “zipping” maybe
you could consider rubyzip:

require ‘zip/zipfilesystem’
Zip::ZipFile.open(“foo.zip”) { |zfs|
zfs.file.open(“member.txt”) { |f| f << data }
zfs.commit
}

zip is not tar, but it does have a some advantages - in particular the
ability to get random-access to any particular member without having to
read through the whole thing from the start.

My feeling is that it shouldn’t be that hard to write data into
a .tar.gz format in either linux or ruby without actually having any
files (i.e., everything in memory or streamed in).

When reading, rubyzip lets you spool directly out of the zip. When
writing, I think that behind the scenes it spools to a tempfile, and
when you commit it then packs this into the archive.

On 15.09.2008 20:35, bwv549 wrote:

I have a gazillion little files in memory (each is really just a chunk
of data, but it represents what needs to be a single file) and I need
to throw them all into a .tar.gz archive. In this case, it must be
in .tar.gz format and it must unzip into actual files–although I pity
the fellow that actually has to unzip this monstrosity.

and still haven’t turned up anything yet.
So why then do you say “without ever first writing out files”?

I’d say #3 (the original formulation) is the one to go. Googling for
“ruby tar” quickly turned up this:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/32588

And there is zlib which allows to read and write GZip streams. So, if
ruby-tar allows to write into any stream you got your solution.

Kind regards

robert

So why then do you say “without ever first writing out files”?

I’m just trying to show that if I can stream out a tar file, then I
can at least pipe it into gzip (on many OS’s). So, I’m really stuck
at making a tar file without actually having to write files to disk
first.

And there is zlib which allows to read and write GZip streams. So, if
ruby-tar allows to write into any stream you got your solution.

I looked at ruby-tar (on your suggestion) but ruby-tar turns out to
not have any write capabilities.

So, I’m still looking deeper into archive-tar-minitar. I also found
‘tarruby’ (bindings to the C libtar library) in rubyforge but it seems
more difficult to hack into than minitar.

As pointed out, the difficulty here has been narrowed down to writing
tar files without having to write files out to disk first.

Sincere thanks for the suggestions.

you could consider rubyzip:

require ‘zip/zipfilesystem’
Zip::ZipFile.open(“foo.zip”) { |zfs|
zfs.file.open(“member.txt”) { |f| f << data }
zfs.commit
}

This is exactly what I need to be able to do, except with .tar.gz
files. I will use this solution for now, even while still searching
for (or maybe writing) the .tar.gz equivalent. Short term, this will
get me by… [even though a .tar.gz equivalent would be really nice].

Thanks!!

And Googling for “ruby tar library” turns up:

http://raa.ruby-lang.org/project/minitar/

:: which looks pretty appropriate :slight_smile:

FWIW,

This is exactly what I need to be able to do, except with .tar.gz
files. I will use this solution for now

Do test it though. I tested it streaming large files in (100MB), and
found that it created a tempfile behind the scenes. If it does this for
all files, then it may not be any more efficient than using
archive-tar-minitar.

But it does have a simple API, which is essentially the same as File and
Dir. (Although unfortunately you can’t use it to open a zipfile which is
within a zipfile :slight_smile:

On Sep 15, 2008, at 1:38 PM, bwv549 wrote:

This is exactly what I need to be able to do, except with .tar.gz
files. I will use this solution for now, even while still searching
for (or maybe writing) the .tar.gz equivalent. Short term, this will
get me by… [even though a .tar.gz equivalent would be really nice].

Thanks!!

IO.popen ‘tar cfz -’, ‘w+’ do |pipe|

end

and just send files down the pipe

a @ http://codeforpeople.com/

Do test it though. I tested it streaming large files in (100MB), and

Yes, upon testing I saw that it was creating a bunch of temp files,
too. It’s too bad since the API is so clean! Perhaps it will be
reimplemented someday…


********************** A solution using Minitar *******************

So, I hacked on archive-tar-minitar for a while and came up with a
solution. Right now I add a class method that fits with the style of
the pack_file method (indeed, pilfers most of its code) and then I can
access it using the slightly lower level interface than ‘pack’:

require ‘archive/tar/minitar’
require ‘stringio’

module Archive::Tar::Minitar

entry may be a string (the name), or it may be a hash specifying

the

following:

:name (REQUIRED)

:mode 33188 (rw-r–r--) for files, 16877 (rwxr-xr-x) for dirs

(0O100644) (0O40755)

:uid nil

:gid nil

:mtime Time.now

if data == nil, then this is considered a directory!

(use an empty string for a normal empty file)

data should be something that can be opened by StringIO

def self.pack_as_file(entry, data, outputter) #:yields action, name,
stats:
outputter = outputter.tar if outputter.kind_of?
(Archive::Tar::Minitar::Output)

stats = {}
stats[:uid] = nil
stats[:gid] = nil
stats[:mtime] = Time.now

if data.nil?
  # a directory
  stats[:size] = 4096   # is this OK???
  stats[:mode] = 16877  # rwxr-xr-x
else
  stats[:size] = data.size
  stats[:mode] = 33188  # rw-r--r--
end

if entry.kind_of?(Hash)
  name = entry[:name]

  entry.each { |kk, vv| stats[kk] = vv unless vv.nil? }
else
  name = entry
end

if data.nil?  # a directory
  yield :dir, name, stats if block_given?
  outputter.mkdir(name, stats)
else          # a file
  outputter.add_file_simple(name, stats) do |os|
    stats[:current] = 0
    yield :file_start, name, stats if block_given?
    StringIO.open(data, "rb") do |ff|
      until ff.eof?
        stats[:currinc] = os.write(ff.read(4096))
        stats[:current] += stats[:currinc]
        yield :file_progress, name, stats if block_given?
      end
    end
    yield :file_done, name, stats if block_given?
  end
end

end
end

#####################################

Then to use it to make a .tgz file:

#####################################

require ‘zlib’

file_names = [‘a_dir/dorky1’, ‘dorky2’, ‘an_empty_dir’]
file_data_strings = [‘my data’, ‘my data also’, nil]

tgz = Zlib::GzipWriter.new(File.open(‘my_tar.tgz’, ‘wb’))

Archive::Tar::Minitar::Output.open(tgz) do |outp|
file_names.zip(file_data_strings) do |name, data|
Archive::Tar::Minitar.pack_as_file(name, data, outp)
end
end


So, not terribly pretty, but not too terrible either.

On Sep 16, 2008, at 3:30 AM, Brian C. wrote:

called
Try tar --help' or tar --usage’ for more information.

That’s for gnu tar, maybe others work differently. However, as far
as I
know, you can’t get tar to read the content of files on stdin - and
even if you could, how would you format them? That is, how would you
delimit the start and end of each file, and assign a name to each one?

Posted via http://www.ruby-forum.com/.

sorry. i misread the OPs question. tar can only unpack to stdout,
not create from stdin.

a @ http://codeforpeople.com/

Ara Howard wrote:

IO.popen ‘tar cfz -’, ‘w+’ do |pipe|

end

and just send files down the pipe

Uh??

“tar cfz -” creates a tarfile called “z” and tries to pack a file called
“-” in it.

“tar czf - file1 file2 file3” reads the named files from disk and sends
the output to stdout.

If you don’t specify any files, then nothing is created:

$ tar -czf -
tar: Cowardly refusing to create an empty archive
Try tar --help' ortar --usage’ for more information.

That’s for gnu tar, maybe others work differently. However, as far as I
know, you can’t get tar to read the content of files on stdin - and
even if you could, how would you format them? That is, how would you
delimit the start and end of each file, and assign a name to each one?

So, I hacked on archive-tar-minitar for a while and came up with a
solution.

You got me interested now.

I just installed the archive-tar-minitar gem and it looks pretty easy to
generate a tar file, without any patching of the library:

require ‘rubygems’
require ‘archive/tar/minitar’

src = {
“foo.txt” => “This is file foo”,
“bar.txt” => “This is file bar”,
}

File.open(“test.tar”,“w”) do |tarfile|
Archive::Tar::Minitar::Writer.open(tarfile) do |tar|
src.each do |name, data|
tar.add_file_simple(name, :size=>data.size, :mode=>0644) { |f|
f.write(data) }
end
end
end

All I did was a quick poke around the API (gem server --daemon; launch
web browser pointing at http://localhost:8808/) and look for something
called “Writer” :slight_smile:

HTH,

Brian.

On Sep 15, 1:35 pm, bwv549 [email protected] wrote:

I have a gazillion little files in memory (each is really just a chunk
of data, but it represents what needs to be a single file) and I need
to throw them all into a .tar.gz archive. In this case, it must be
in .tar.gz format and it must unzip into actual files–although I pity
the fellow that actually has to unzip this monstrosity.

This maybe be a little late, but better late than never.
Have you considered using #1 with a tmpfs and memory mapped files?
This isn’t exactly portable, but should be pretty fast since as far as
tar is concerned your in-memory files just look like a regular
filesystem thanks to tmpfs.