Zlib::GzipReader and multiple compressed blobs in a single stream

luislavena · January 29, 2011, 12:10am

Hi,

I’m trying to inflate a set of concatenated gzipped blobs stored in a
single
file. As it stands, Zlib::GzipReader only inflates the first blob. It
appears that the unused instance method would return the remaining data,
ready to be passed into Zlib::GzipReader, but it yields an error:

method `method_missing’ called on hidden T_STRING object

What could be going on here?

On a related note, Zlib::GzipReader#{pos,tell} returns the position in
the
output stream (zstream.total_out) whereas I am looking for the position
in
the input stream. I tried making zstream.total_in available but the
value
appears to be 18 bytes short in my test file, that is, the next header
is
found 18 bytes beyond what zstream.total_in reports.

Does anybody know how to make the library return the correct offset into
the
input stream so multiple compressed blobs can be handled?

Thanks,
Jos

Jos_B · January 30, 2011, 6:29pm

On 01/28/2011 05:09 PM, Jos B. wrote:

Hi,

I’m trying to inflate a set of concatenated gzipped blobs stored in a single
file. As it stands, Zlib::GzipReader only inflates the first blob. It
appears that the unused instance method would return the remaining data,
ready to be passed into Zlib::GzipReader, but it yields an error:

method `method_missing’ called on hidden T_STRING object

What could be going on here?

I’m not sure what’s going on, but I was hoping you could solve your
problem by running something like this:

File.open(‘gzipped.blobs’) do |f|
begin
loop do
Zlib::GzipReader.open(f) do |gz|
puts gz.read
end
end
rescue Zlib::GzipFile::Error
# End of file reached.
end
end

Unfortunately, Ruby 1.8 doesn’t appear to support passing anything other
than a file name to Zlib::GzipReader.open, and Ruby 1.9 seems to always
reset the file position to the beginning of the file prior to starting
extraction when you really need it to just start working from the
current position. So it doesn’t appear that you can do this with the
standard library.

As part of a ZIP library I wrote, there is a more general implementation
of a Zlib stream filter. Install the archive-zip gem and then try the
following:

gem ‘archive-zip’
require ‘archive/support/zlib’

File.open(‘gzipped.blobs’) do |f|
until f.eof? do
Zlib::ZReader.open(f, 15 + 16) do |gz|
gz.delegate_read_size = 1
puts gz.read
end
end
end

This isn’t super efficient because we have to hack the
delegate_read_size to be 1 byte in order to ensure that the trailing
gzip data isn’t sucked into the read buffer of the current ZReader
instance and hence lost between iterations. It shouldn’t be too bad
though since the File object should be handling its own buffering.

BTW, I wrote some pretty detailed documentation for Zlib::ZReader. It
should explain what the 15 + 16 is all about in the open method in case
you need to tweak things for your own streams.

On a related note, Zlib::GzipReader#{pos,tell} returns the position in the
output stream (zstream.total_out) whereas I am looking for the position in
the input stream. I tried making zstream.total_in available but the value
appears to be 18 bytes short in my test file, that is, the next header is
found 18 bytes beyond what zstream.total_in reports.

I think total_in is counting only the compressed data; however,
following the compressed data is a trailer as required for gzip blobs.
You could probably always add 18 to whatever you get, but as I noted
earlier, the implementation of GzipReader seems to always reset any file
object back to the beginning of the stream rather than start processing
it from an existing position. I can’t find any documentation listing a
way to force GzipReader to jump to any other file position after
initialization either.

Does anybody know how to make the library return the correct offset into the
input stream so multiple compressed blobs can be handled?

Hopefully, my solution will work for you because I don’t think the
current implementation in the standard library will do what you need.

-Jeremy

Jos_B · February 2, 2011, 8:47pm

Hi Jeremy,

Thanks for your reply.

On Mon, Jan 31, 2011 at 02:28:30AM +0900, Jeremy B. wrote:

On 01/28/2011 05:09 PM, Jos B. wrote:
[snip]

rescue Zlib::GzipFile::Error
# End of file reached.
end
end

I tried something like this but as you point out, it doesn’t work.

Unfortunately, Ruby 1.8 doesn’t appear to support passing anything other
than a file name to Zlib::GzipReader.open, and Ruby 1.9 seems to always
reset the file position to the beginning of the file prior to starting
extraction when you really need it to just start working from the
current position. So it doesn’t appear that you can do this with the
standard library.

That’s what it looks like, yes. Bummer.

  gz.delegate_read_size = 1
though since the File object should be handling its own buffering.
This works, but sadly it is very slow. Whereas zcat takes under a second
on my
test file, this code takes about 17 seconds.

BTW, I wrote some pretty detailed documentation for Zlib::ZReader. It
should explain what the 15 + 16 is all about in the open method in case
you need to tweak things for your own streams.

Great. But I didn’t have to tweak anything, it just worked

object back to the beginning of the stream rather than start processing
it from an existing position. I can’t find any documentation listing a
way to force GzipReader to jump to any other file position after
initialization either.

Yeah, you’d have to feed GZipReader the right part of the input stream
yourself and figure out how much it processed. Something tells me it’s
not
always 18 but depends on internal buffering, which would invalidate the
assumption of a fixed offset.

Does anybody know how to make the library return the correct offset into the
input stream so multiple compressed blobs can be handled?

Hopefully, my solution will work for you because I don’t think the
current implementation in the standard library will do what you need.

It does, but it’s very slow. Sigh.

Thanks again, Jeremy.

Cheers,
Jos

Jos_B · February 3, 2011, 2:43am

On Jan 28, 2011, at 15:09, Jos B. wrote:

I’m trying to inflate a set of concatenated gzipped blobs stored in a single
file. As it stands, Zlib::GzipReader only inflates the first blob. It
appears that the unused instance method would return the remaining data,
ready to be passed into Zlib::GzipReader, but it yields an error:

method `method_missing’ called on hidden T_STRING object

What could be going on here?

It’s a bug, the internal buffer that libz uses is dup’d, but this is not
enough to make it safe for use by ruby. I have filed a ticket and
attached a stupid patch:

http://redmine.ruby-lang.org/issues/show/4360

Jos_B · February 3, 2011, 6:04am

On 02/02/2011 07:33 PM, Eric H. wrote:

It’s a bug, the internal buffer that libz uses is dup’d, but this is not enough
to make it safe for use by ruby. I have filed a ticket and attached a stupid
patch:

http://redmine.ruby-lang.org/issues/show/4360

Once your fix is in place and GZipReader#unused works correctly, is
there any convenient way to take the returned string and continue
processing it along with the remaining file contents with an instance of
GzipReader?

From my testing, it appears that GzipReader.open in Ruby 1.9 always
rewinds any IO object you give it before inflating any data, so you
can’t use that method to create your instance if you need to start
reading from anywhere other than the beginning of the stream.
GzipReader.new doesn’t have that problem, but there isn’t any easy way
to make use of that unused data from the earlier processing along with
the remaining file contents. According to the documentation, you could
create an IO-like wrapper that will first feed in that unused data
followed by the real file data, and GzipReader.new should be able to use
that, but that’s a bit of a mess.

If all that really is a design limitation of GzipReader, having the
unused data isn’t very useful when attempting to inflate concatenated
gzip blobs as zcat does. You may be able to make it work with a little
judicious hacking, but it’s certainly more effort than it should be.
Maybe a ZcatReader is needed to plaster over things?

BTW, why do GzipReader.open and GzipReader.new behave so differently
with regard to the IO object you pass into them? They’re a little
closer in operation under Ruby 1.9 than they were under Ruby 1.8, but
the difference is still surprising given the idiom followed by File.open
and File.new where File.open is really just a simple wrapper around
File.new that can help ensure that File#close is called at the end of
your block.

-Jeremy

Jos_B · February 3, 2011, 11:07pm

On Thu, Feb 03, 2011 at 10:33:59AM +0900, Eric H. wrote:

It’s a bug, the internal buffer that libz uses is dup’d, but this is not
enough to make it safe for use by ruby. I have filed a ticket and attached
a stupid patch:

http://redmine.ruby-lang.org/issues/show/4360

Thanks, Eric!

Jos_B · February 3, 2011, 11:12pm

On Thu, Feb 03, 2011 at 02:03:49PM +0900, Jeremy B. wrote:

Once your fix is in place and GZipReader#unused works correctly, is
there any convenient way to take the returned string and continue
processing it along with the remaining file contents with an instance of
GzipReader?

Fwiw, with the changes just committed to trunk the following code works
for me
on a file with multiple gzipped blobs:

require 'stringio'
require 'zlib'

def inflate(filename)
  File.open(filename) do |file|

zio = StringIO.new(file.read)
loop do
io = Zlib::GzipReader.new zio
puts io.read
unused = io.unused
io.finish
break if unused.nil?
zio.pos -= unused.length
end
end
end

inflate "gz"

Thanks,
Jos

Jos_B · February 3, 2011, 11:50pm

On 2/3/2011 3:57 PM, Jos B. wrote:

require 'zlib'
zio.pos -= unused.length

end
end
end

inflate "gz"

That’s great! How does the performance compare to zcat with your data?

BTW, this implementation does require that you have enough memory to
hold all of the gzipped file data at once. That will be a problem with
sufficiently large files or constrained resources.

-Jeremy

Jos_B · February 2, 2011, 9:15pm

On 2/2/2011 1:37 PM, Jos B. wrote:

It does, but it’s very slow. Sigh.

While I don’t think you’ll be able to make it as fast as zcat, given
that zcat is 100% native code, you might be able to take the
implementation of Zlib::ZReader and tweak it to avoid the need to read
only 1 byte at a time from the delegate stream. Doing so should speed
things up quite a bit. The existing code really isn’t very involved.
Most of the logic you would need to tweak is in the
Zlib::ZReader#unbuffered_read method, which is actually fairly short.

When @inflater reports that it has finished, it looks like you should be
able to get whatever is left in its input buffer using
@inflater.flush_next_in (from Zlib::ZStream). Then you can initialize a
new Zlib::Inflater instance and pass that remaining data as the first
input buffer to process. You would repeat this process every time the
inflater reports it has finished until the end of the delegate is
reached and there is no further data returned by flush_next_in.

If I get some time this evening, I’ll look into creating a sample
implementation. No promises though.

-Jeremy

Jos_B · February 4, 2011, 1:16am

On Fri, Feb 04, 2011 at 07:38:04AM +0900, Jeremy B. wrote:

That’s great! How does the performance compare to zcat with your data?

Comparable:

% time zcat gz > /dev/null
zcat gz > /dev/null 0.29s user 0.00s system 99% cpu 0.296 total
% time ./gzr > /dev/null
./gzr > /dev/null 0.31s user 0.07s system 99% cpu 0.383 total
%

BTW, this implementation does require that you have enough memory to
hold all of the gzipped file data at once. That will be a problem with
sufficiently large files or constrained resources.

Using the file directly should avoid that. Since we have a File, we
don’t need
the StringIO object:

require 'stringio'
require 'zlib'

def inflate(filename)
  File.open(filename) do |file|

zio = file
loop do
io = Zlib::GzipReader.new zio
puts io.read
unused = io.unused
io.finish
break if unused.nil?
zio.pos -= unused.length
end
end
end

inflate "gz"

Cheers,
Jos

Jos_B · February 4, 2011, 2:10am

On 02/03/2011 06:12 PM, Jos B. wrote:

On Fri, Feb 04, 2011 at 07:38:04AM +0900, Jeremy B. wrote:

That’s great! How does the performance compare to zcat with your data?

Comparable:

% time zcat gz > /dev/null
zcat gz > /dev/null 0.29s user 0.00s system 99% cpu 0.296 total
% time ./gzr > /dev/null
./gzr > /dev/null 0.31s user 0.07s system 99% cpu 0.383 total
%

Excellent.

def inflate(filename)
  end
end

inflate "gz"

The only case where I could see this failing now is if you were given a
non-seekable IO such as a socket or a pipe from which to read. Of
course, I apparently haven’t been thinking of solutions to these
problems myself very well, but you’ll probably figure out something
pretty quick.

-Jeremy