Hi Jeremy,

Thanks for your reply.

On Mon, Jan 31, 2011 at 02:28:30AM +0900, Jeremy Bopp wrote:
> On 01/28/2011 05:09 PM, Jos Backus wrote:
[snip]
> > Hi,
> > 
> > I'm trying to inflate a set of concatenated gzipped blobs stored in a single
> > file. As it stands, Zlib::GzipReader only inflates the first blob. It
> > appears that the unused instance method would return the remaining data,
> > ready to be passed into Zlib::GzipReader, but it yields an error:
> > 
> > method `method_missing' called on hidden T_STRING object
> > 
> > What could be going on here?
> 
> I'm not sure what's going on, but I was hoping you could solve your
> problem by running something like this:
> 
> File.open('gzipped.blobs') do |f|
>   begin
>     loop do
>       Zlib::GzipReader.open(f) do |gz|
>         puts gz.read
>       end
>     end
>   rescue Zlib::GzipFile::Error
>     # End of file reached.
>   end
> end

I tried something like this but as you point out, it doesn't work.

> Unfortunately, Ruby 1.8 doesn't appear to support passing anything other
> than a file name to Zlib::GzipReader.open, and Ruby 1.9 seems to always
> reset the file position to the beginning of the file prior to starting
> extraction when you really need it to just start working from the
> current position.  So it doesn't appear that you can do this with the
> standard library.
 
That's what it looks like, yes. Bummer.

> As part of a ZIP library I wrote, there is a more general implementation
> of a Zlib stream filter.  Install the archive-zip gem and then try the
> following:
> 
> gem 'archive-zip'
> require 'archive/support/zlib'
> 
> File.open('gzipped.blobs') do |f|
>   until f.eof? do
>     Zlib::ZReader.open(f, 15 + 16) do |gz|
>       gz.delegate_read_size = 1
>       puts gz.read
>     end
>   end
> end
> 
> 
> This isn't super efficient because we have to hack the
> delegate_read_size to be 1 byte in order to ensure that the trailing
> gzip data isn't sucked into the read buffer of the current ZReader
> instance and hence lost between iterations.  It shouldn't be too bad
> though since the File object should be handling its own buffering.

This works, but sadly it is very slow. Whereas zcat takes under a second on my
test file, this code takes about 17 seconds.

> BTW, I wrote some pretty detailed documentation for Zlib::ZReader.  It
> should explain what the 15 + 16 is all about in the open method in case
> you need to tweak things for your own streams.

Great. But I didn't have to tweak anything, it just worked :)

> > On a related note, Zlib::GzipReader#{pos,tell} returns the position in the
> > output stream (zstream.total_out) whereas I am looking for the position in
> > the input stream. I tried making zstream.total_in available but the value
> > appears to be 18 bytes short in my test file, that is, the next header is
> > found 18 bytes beyond what zstream.total_in reports.
> 
> I think total_in is counting only the compressed data; however,
> following the compressed data is a trailer as required for gzip blobs.
> You could probably always add 18 to whatever you get, but as I noted
> earlier, the implementation of GzipReader seems to always reset any file
> object back to the beginning of the stream rather than start processing
> it from an existing position.  I can't find any documentation listing a
> way to force GzipReader to jump to any other file position after
> initialization either.

Yeah, you'd have to feed GZipReader the right part of the input stream
yourself and figure out how much it processed. Something tells me it's not
always 18 but depends on internal buffering, which would invalidate the
assumption of a fixed offset.

> > Does anybody know how to make the library return the correct offset into the
> > input stream so multiple compressed blobs can be handled?
> 
> Hopefully, my solution will work for you because I don't think the
> current implementation in the standard library will do what you need.
 
It does, but it's very slow. Sigh.

Thanks again, Jeremy.

Cheers,
Jos
-- 
Jos Backus
jos at catnook.com