On Fri, Jun 30, 2006 at 07:22:53PM +0900, transfire / gmail.com wrote:
> I was just wondering about the rational behind the format of gem
> packages. It seems rather odd that the package is a tar of
> 
>   data.tar.gz
>   metadata.gz
> 
> Why not just have the metadata stored in with the data and not worry
> about double layers? The only advantage I can figure is that it is
> possilble to extract the metadata without uncompressing the eniter
> package. Okay, but can't a tar/gzip lib or tool do that anyway? Is
> there some other reason?

It's a good solution in practice for many reasons; here's the answer I gave to
a similar question on [ruby-core:6258]:

    [...] here are the pros I can think of:
    * the format is extensible because it's possible to add new entries in the
      "outer" tarball. This has proved useful already: the package originally
      just contained metadata.gz and data.tar.gz, and recently data.tar.gz.sig
      and metadata.gz.sig have been added to support signatures.
    * it is easy to extract the metadata without uncompressing the whole
      tarball
    * it's possible to write data.tar.gz and generate the file lists and other
      information dynamically before writing metadata.gz, while data.tar.gz is
      being written. It is thus be possible to store for instance a
      cryptographic digest of the data.tar.gz file in metadata.gz. This would
      be somewhat harder if the metadata were included in a single tarball,
      especially if we compressed it.
    * it takes little time to locate metadata.gz inside the tarball (we'd have
      to go through many more entries if it were a flat tarball). While access
      is still O(n), n is the number of entries in the outer file (2
      originally, now 4) instead of the normally much more numerous data
      files.

    Also, note that the code in package.rb was written carefully to avoid
    having to keep the full contents of the archive (or any contained file) in
    memory at any point in time (with the exception of metadata.gz, of
    course). RubyGems doesn't exploit that ability since the first thing it
    does before unpacking is uncompressing all the data and storing it in an
    array, but package.rb would have supported O(1) memory usage. That's why
    metadata.gz comes after data.tar.gz inside the .gem.

Also, on [ruby-core:6251]:

    The "nested tarball" format was inspired by Debian's .deb format. The
    latter uses ar for the outer layer, but I saw no reason to implement
    another subformat. When I originally hacked the package format for
    rpa-base, I used nested zip files; I changed that to use POSIX tarballs
    when I discovered that RubyZip triggered a bug in Tempfile that would
    cause ruby to use over 100MB RAM to create a 300KB .zip file. That was
    fixed quickly, but by then there was no reason to change the package
    format again. Had this bug not been there, maybe RubyGems would be using
    zipfiles now :-)

-- 
Mauricio Fernandez  -   http://eigenclass.org   -  singular Ruby