[sorry for the very late reply; I left this message in +postponed and forgot
about it. I'm posting it because there has been no discussion about the role
of RubyGems as a format for upstream releases and the associated requirements]

[summary: under some circumstances, a .gem-only release leads to information
loss in the sense that we cannot create and distribute modified
versions, repackage or continue the development. It is possible to avoid that,
while addressing the repackaging issue, without affecting RubyGems'
functionality]

On Thu, Sep 29, 2005 at 09:46:45AM +0900, Jim Weirich wrote:
> On Wednesday 28 September 2005 07:35 pm, Mauricio FernŠŌdez wrote:
> > *  [... gem contents depending on gem software ... particularly require_gem
> >    and DATADIR issues ...] 
> 
> Other than these two issues, are there other kinds gem references that cause 
> problems ... or is this the bulk of it?

These are the most important issues, I believe.
There is another more general problem I will try to describe below. It can
be very, very important in some situations, but there is a way to avoid it
which doesn't involve changes in the existing RubyGems functionality.

> It is important to differentiate between the gems runtime and the gems package 
> format.  AFAIK, there is nothing in the package format that makes it any 
> worse than, say a tar file.  Yes, there are things to make it even easier 
> (e.g. more metadata, better identification of arch dependent vs arch 
> independent files), but it seems to me that starting with a gem should be 
> easier than starting with a raw tar file.

As I wrote previously (please forgive the self-quotation):

"While the .gem package format now allows for easy extraction of its contents,
the contents themselves might depend on RubyGems, and the way RubyGems
installs them, to execute correctly. It's all about the contents."

So, fundamentally, the .gem *format* is essentially equivalent to a
tarball. Indeed, I hacked it to be a tarball! [1] ;-) So, as I said,
it's the contents that matter, and how we use them. On the basis of the
package format alone, repackagers might have a very small preference
towards .tar.gz because it doesn't require any change in existent tools,
but that matters very little compared to the actual contents of the
archive.

> Take the rake project for example.  Rake is distributed as both a gem and a 
> tar file.  Given that the same software is delivered in both package formats, 
> would you prefer to start with the tar file?  And if so, why?  What specific 
> changes would make it easier for repackagers?

I've been looking at Rake and the way it is packaged. It looks like a
very good example of how we'd like all .gem packages to be :-)
Indeed, it features two important properties:
(1) the source code contained in the .gem and the tarball is the same, and  
    the latter can be used to install & run on a system without depending on
    RubyGems (i.e. no direct dependency on RubyGems and no assumptions
    regarding the directory layout)
(2) idempotence relative to the packaging process

So far, we've been talking about (1). I will now try to explain (2), why I
think we should care, and how it relates to what we've discussed before.

(2) means the .gem includes all what's needed to rebuild the .gem. In other
words, that the .gem archive is essentially equivalent to the upstream,
pristine sources, as found (in this case) in RubyForge's CVS repository.
Since the tarball includes the very same files as the .gem, given the
tarball, I can build the gem, which when unpacked can "build" the
tarball, and so on ad infinitum.

You might ask, why care about idempotence at all? It is desirable
because it means that, for the package we're considering, the .gem
archive is self-sufficient. In other words, if all other tarballs and
the CVS repository (or even the author, God forbid) were to disappear,
anybody could still take rake-0.6.0.gem, unpack its contents, and work
with that just like the upstream author (hi Jim :). This includes
being able to release derived versions in .gem and .tar.gz format. [In
practice, we cannot work exactly like the original author because we
lose access to the revision history stored in his VCS, but having the
pristine sources he was working with goes a long way in that direction.]

So, idempotence of a .gem file (I'm abusing the language; maybe
I should rather talk of "fixed point-ness" of the .gem ;) is certainly
desirable, but how much does it matter in practice? It could be argued
it actually doesn't, as long as a tarball which can be used to generate
the .gem is available somewhere. However, in practice things like these
happen (I've often bumped into them):
* people releasing only in .gem format, without pointing to an online
  repository
* .gem packages that don't include the Rakefile/gemspec used to build
  the .gem
* source code being processed before the .gem is generated (code
  generation, fixing internal references, etc.)
* assumptions in the code that can be attributed to RubyGems' 1-gem-1-dir
  install layout, which require modifications to the software if it is to
  be installed differently

When the first condition is met, we run into problems depending on which
of the other ones concur. Indeed, it can become difficult to:
* create new versions of the software with small modifications (including, but
  not limited to, backports of security fixes),  in RubyGems format.
* repackage the software (possibly modified) so it can be installed and used
  without depending on RubyGems
* continue the development of the sw. from some given version (branching,
  maintenance...)

for anybody but the original author.

To sum up, if the "RubyGems packaging process" is not idempotent, we
can lose the information required at later stages to continue the
development or create and distribute derivative versions.
This is not something RubyGems, the software, is solely to be blame for.
But it's something RubyGems can prevent easily.

This is what I think we can do to prevent such problems:
(1) make sure that the source code contained in the .gem is the same we'd have
    in a separate, non-RubyGems-dependent tarball, meaning that it can be
    installed and used without relying on RubyGems; in practical terms, that it
    will run when unpacked in sitelibdir, that there are no DATADIR problems,
    etc.
(2) either
    (a) make sure that .gem files can always be rebuilt with the data used in
        the .gem itself
    or
    (b) ensure that the data used to generate the .gem originally is available

(1) is what we've been talking about all the time.
(2)(a) would require deep changes in RubyGems: a generic build phase would
have to be added, and RubyGems would operate with "ports" (pristine sources +
RubyGems build info) in addition to packages. This is doable (I already wrote
such a thing once), but it's probably more than the current RubyGems team
would feel inclined to do.
I think (b) is the way to go. Some functionality can be added to
RubyGems so that an independently installable tarball is generated
during the "gemification" process, and something like gem lint would
help upstream developers make sure that their code hasn't become hard to
repackage.

> BTW, let me just add that I am a Debian user and one of the things that drew 
> me to the distro was its packaging system.  I certainly don't want to do 
> anything to discourage the packaging of ruby software (gem or otherwise) as 
> debian packages.  If someone wants to manage their system entirely as debian 
> packages, more power to them ... and I hope that gemified ruby software is 
> readily available in that format (and that goes for RPMs and FreeBSD ports as 
> well).  

It is reassuring to hear that, especially since some people have been
expressing their disdain towards repackagers and their work in recent
threads [I think such contempt originates from painful experiences and
lack of understanding of their work]. We can work together to solve the
problems experienced with RubyGems as it stands right now, and that won't
require any loss of functionality.

[1] One can extract the data with 
   
 tar Oxf foo-0.0.1.gem data.tar.gz | (mkdir foo-0.0.1 && tar -C foo-0.0.1 -zxf -)

The "nested tarball" format was inspired by Debian's .deb format. The latter
uses ar for the outer layer, but I saw no reason to implement another
subformat. When I originally hacked the package format for rpa-base,
I used nested zip files; I changed that to use POSIX tarballs when I
discovered that RubyZip triggered a bug in Tempfile that would cause
ruby to use over 100MB RAM to create a 300KB .zip file. That was fixed
quickly, but by then there was no reason to change the package format
again. Had this bug not been there, maybe RubyGems would be using zipfiles now :-)

-- 
Mauricio Fernandez