[rubyzip + open-uri] reading zipfiles from a url?

Hi everyone!

I’m trying to read a zipfile directly from a url. However, open-uri and
rubyzip don’t seem to cooperate very well:

require ‘zip/zip’
require ‘open-uri’

url = “http://www.cibiv.at/~phuong/vien/8a375.zip
zip_file = Zip::ZipFile.open(url)

This code produces a ZipError, saying it can’t find the zip file.
open-uri obviously does not support Zip::ZipFile.open. I’m pretty new to
Ruby (and programming in general), is there any other way to open the
zipfile directly?

Writing all zipfiles to a new file on the hd would be a nightmare
perfomance wise, as I only need a few very small files out of every
zipfile, but I need to process a few hundred zipfiles…

Thanks in advance!
Janus

-------- Original-Nachricht --------

Datum: Sun, 13 Jul 2008 22:50:21 +0900
Von: David M. [email protected]
An: [email protected]
Betreff: Re: [rubyzip + open-uri] reading zipfiles from a url?

probably

When I last tried this, performance was pretty terrible – no caching, and
it
would fetch the file in blocks, so many separate HTTP requests. But it
would probably do what you want.

Hi —

there is also rio (http://rio.rubyforge.org/):

Copy a file from a ftp server into a local file un-gzipping it

rio(‘ftp://host/afile.gz’).gzip > rio(‘afile’)

I have no idea what its performance would be.

Best regards,

Axel

On Wednesday 02 July 2008 06:49:26 Janus B. wrote:

url = “http://www.cibiv.at/~phuong/vien/8a375.zip
zip_file = Zip::ZipFile.open(url)

Maybe there is a right way to do this…

I’m going to argue that it would be difficult at best. Zip files store
metadata, such as compressed file location, at the end of the file.
After
that, you’d want to seek somewhere in the middle. Since open-uri is
probably
meant to issue a single, straightforward HTTP request (that is, ask for
the
whole file, from beginning to end), I’m not sure this would work well.

But that’s just an educated guess.

Depending on how portable you need this to be, you might consider the
FUSE-based HTTPFS:

http://httpfs.sourceforge.net/

When I last tried this, performance was pretty terrible – no caching,
and it
would fetch the file in blocks, so many separate HTTP requests. But it
would probably do what you want.

On 13/07/2008, David M. [email protected] wrote:

that, you’d want to seek somewhere in the middle. Since open-uri is probably
When I last tried this, performance was pretty terrible – no caching, and it
would fetch the file in blocks, so many separate HTTP requests. But it
would probably do what you want.

This is how FUSE is designed. The caching is supposed to happen in the
kernel upper layers. The requests are sent as received from the kernel
so there is nothing httpfs can do about the granularity. In practice
the kernel requests up to ~16k chunks but probably only when the
application does large block reads.

I tried keep-alive which should speed up subsequent requests. However,
the sockets can then hang, and it takes time to detect that. Still
this should happen only when there are network problems anyway.

It is very nice for mounting CD or DVD images, extracting a single
file from a zip could also be faster. If you want all the files anyway
it’s probably better to just download the zip.

Also last time I looked the httpfs at SF was broken, there was a bug
around the SSL #ifdefs in read/write. You may need to define or
undefine USE_SSL or fix the code.

FUSE is theoretically portable to *BSD but not the code there because
it uses undefined behaviour of directory operations to make the
underlying directory still visible after the mount.

Thanks

Michal