Bug #4278: Ruby Zlib::GzipReader Consistently Fails When Large Uncompressed File Size
http://redmine.ruby-lang.org/issues/show/4278

Author: Andrew R Jackson
Status: Open, Priority: Normal
Category: ext, Target version: 1.9.1
ruby -v: ruby 1.9.1p429 (2010-07-02 revision 28523) [x86_64-linux]

PROBLEM:

If a .gz file was created from an input larger than 2^32-1 bytes (e.g. a 4.3GB file was gzipped), then Ruby's Zlib::GzipReader will consistently fail with "invalid compressed data -- length error".

This is due to a bug in ext/zlib.c that comes with Ruby. Specifically, the bug is in gzfile_check_footer().

The .gz file will be correctly uncompressable by the gunzip command. This is a Ruby-specific bug.


SCOPE:

This bug has been verified to affect both 1.9.1p429 and various patch levels for Ruby 1.8.*.

It is likely to affect older versions as well.


TO REPLICATE:

On a 64-bit OS with LARGE_FILE support:
* create raw file which is >4.2GB
* gzip the raw file
* in Ruby, use Zlib::GzipReader to loop over each line; it will fail as it approaches 4.2Gb of uncompressed bytes read
* verify gunzip of the raw file works just fine


CAUSE: gzfile_check_footer() in ext/zlib.c

Within this function, gzfile_get32() is used to get the length field from the gzip footer (aka trailer).

The length field in the gzip footer (aka trailer) is a 4-byte field that: "contains the size of the original (uncompressed) input data modulo 2^32".

But "z.stream.total_out" is a uLong that tracks the total uncompressed bytes streamed out. It is maintained by the zlib itself as its routines are used to stream out uncompressed data from the gzip file.
* uLong is zlib typedef for unsigned long and is 8 bytes on a 64-bit OS
* thus, it can keep track the streaming of of more than 4.2GB of uncompressed output
* total_out - "total nb of bytes output so far"

Thus, I believe it is incorrect to compare "length" and "z.stream.total_out" as is done in gzfile_check_footer().
* This will work only when: (a) less than 4.2GB are streamed from the file or (b) the original file was <4.2GB
* It fails when there is more than 4.2GB of uncompressed data streamed


static void
gzfile_check_footer(struct gzfile *gz)
{
    unsigned long crc, length;

/* ...SNIP...following will always call rb_raise when uncompressed data was >4.2GB */

    if (gz->z.stream.total_out != length) {
        rb_raise(cLengthError, "invalid compressed data -- length error");
    }
}

SOLUTION:

I think it would be correct to compare: z.stream.total_out __modulo 2^32__ AGAINST length
* I don't think it is correct to compare them directly without the modulo operation, which would fail when total_out > 4.2GB


----------------------------------------
http://redmine.ruby-lang.org