[Synopsis: a C library I work with uses opaque integer handles to
refer to internal toolkit objects.  It also requires correct
deallocation order for some of the objects.  I can write a
wrapper layer for the C implementation of Python to have it do the
correct automatic garbage collection, but can't figure out how to
use Ruby for the same task, because finalization order isn't
guaranteed and because it assumes C extension types are always
through pointers.]

Hello,

  I posted this on c.l.py but Matz asked that I repost it here.
This is the more appropriate group, but I'm a long time Python
developer and c.l.py people nearly always do a good job of describing
the pros and cons of different languages.  Plus, posting here meant
I needed to reread the FAQ and the back newsgroups postings.

  The topic was on C/C++ integration.  From my admittedly poor
understanding of Ruby, I don't follow how I could use it for a
system I worked on called PyDaylight.

  The Daylight toolkit is a library for chemical informatics.
It contains data types like "molecule", "atom", "bond", "pattern"
and "reaction."  It is written in C but exposes a consistent
API for both C and Fortran programmers.  This API uses opaque
object handles to refer to internal objects.  These are represented
as integers - starting with 1 - because Fortran doesn't have a
pointer data type.  The internal data model is object oriented,
but it is hidden behind that API.

  For example, using the SWIG'ged Python interface to the C code
(this is from memory, commentary on the right)

>>> from dayswig_python import *
>>> dt_smilin("CO")                 # Create a molecule
1                                   #   the molecule handle is 1
>>> dt_typename(1)                  # Get the toolkit's name for this type
'molecule'
>>> dt_stream(1, TYP_ATOM)          # Create an iterator over the atoms
2                                   #   this is a new object
>>> dt_next(2)                      # Get the first object in the iterator
3                                   #   another new objet
>>> dt_typename(3)                  # What is it?
'atom'                              #   an atom
>>> dt_symbol(3)                    # What kind of atom is it?
'C'                                 #   carbon
>>> dt_next(2)                      # Next atom
4
>>> dt_symbol(4)
'O'                                 #   is an oxygen
>>> dt_next(2)                      # Next atom?
0                                   #   nope, finished with the iteration
>>> dt_dealloc(2)                   # Remove the iterator
1
>>> dt_dealloc(1)                   # Remove the molecule
1
>>>

The interface layer I wrote in Python hides this low-level interface
to allow the following

>>> from daylight import Smiles
>>> for atom in Smiles.smilin("CO"):
...     print atom.typename, atom.symbol
...
atom C
atom O
>>>

It does this by:
  - wrapping the integer handle inside of a class instance, as in

      class dayobject:
          def __init__(self, handle):
              self.handle = handle
          def __int__(self):
              return int(self.handle)
           ...
      class Atom(dayobject):
           ...

      atom = Atom(3)     # where 3 is an atom handle

    The __int__ method allows a dayobject instance to be coerced into
    the value expected by the SWIG interface.  The int(self.handle)
    is needed for reasons discussed below.

  - converting attribute lookup to function calls via a getattr hook,
      which lets me do 'atom.symbol' instead of 'dt_symbol(atom.handle)'
      (In Python, the getattr hook lets the instance define how to
      resolve attribute lookups if the attribute isn't otherwise found.
      Ruby has a similar method, if my memory serves me correct.)

  - converting the toolkit iterator model to a Python one, either
      by direct conversion to a list (eg, bond.atoms returns a list
      with two Atom instances) or through a lazy interface (eg,
      an iterator through all the compounds in a database).  I
      know Ruby does iterators well.

  - doing the appropriate garbage collection.

I'm not sure how well Ruby handles this last part.  Let me
explain in even more detail.

The lifetime of a toolkit object may be dependent on another
object (its parent).  For example, the lifetime of an atom is
dependent on the molecule.  If the molecule is deallocated, then
all of its atoms are deallocated.  (If an atom is deallocated,
it is deleted from the molecule, but the molecule persists.)
The lifetime doesn't even always depend on the object type.  For
example, the molecule may be created on its own, or may be part
of a "reaction" data type.  (In the reaction "[OH-] + [H+] -> [H2O]"
there are three molecules.)

The only place which knows the lifetime of the object is the
function used to create it.  By reading the documentation and
experimenting I found what they were.  These create an integer
handle which I wrap with a new object, something like:

  class smart_ptr:
    def __init__(self, handle):
      self.handle = handle
    def __int__(self):
      return self.handle
    def __del__(self):
      dayswig_python.dt_dealloc(self.handle)

Here's where you see why dayobject's __int__ calls int(self.handle) -
if it's a smart_ptr, it still needs to be converted into an integer.

Consider the molecule.  If I create a molecule from scratch
then I return a Molecule wrapping a smart_ptr wrapping the handle

   def smilin(smiles):
      mol_handle = dayswig_python.dt_smilin(smiles)
      return Molecule(smart_ptr(mol_handle))

If instead I return a molecule which is a component of a reaction
I use:

      mol_handle = ... # code not shown because it's too complicated
                       # and irrelevant to this discussion
      return Molecule(mol_handle)

This approach works because of the __del__ method, which is how
Python does finalization.  In the C implementation, it is called
when the reference count goes to 0.  (It is not called when
the garbage collecter finds and removed non-accessible cycles.)

This lets me use Python's garbage collector for all toolkit objects.

Things get trickier.  Some objects manage their own lifetime
but are also dependent on another object.  One such is the
'MatchObject' used for substructure searches.  If the molecule
is deleted, the toolkit invalidates all of the handles used
in any MatchObject related to the molecule.

Consider

  class MatchObject(daylight.dayobject):
    def __init__(self, mol, smarts, match_handle, flags):
      daylight.dayobject.__init__(self, match_handle)
      self.mol = mol
      ...
    def __del__(self):
      del self.handle
      del self.mol

This saves both the handle for the MatchObject ('match_handle')
*and* the handle for the molecule ('mol').  Because it keeps a
reference to the molecule, that molecule will not be garbage
collected until all of the MatchObjects are also removed.  From
tests, this is the expected behaviour.

But notice that the finalizer is careful about the order in
which objects are deleted.  The match object is removed before
deleting the molecule.  If the order was reversed, then the
ref count for the molecule goes to 0 so gets dt_dealloc'ed
by the smart_ptr's __del__.  The toolkit then invalidates
all of the match objects associated with the molecule.

Python doesn't know the match object handles are invalid.  When
the 'del self.handle' occurs, the smart_ptr for the match
object calls its __del__, which tries to deallocate the
invalidated match object.  This is not allowed by the toolkit,
although thankfully it returns an error message rather than
core dumping.  By deleting the objects in the correct order
I ensure the library calls are done as needed by the toolkit.


I understand that Ruby also has a way to do finialization for
an instance, but I'm concerned about several things.

 1. The size of a toolkit object can be large - up to 64K atoms,
which is a couple of MB.  Because Ruby doesn't know anything
about that memory, how does it know when to do garbage collection?
Under the C implementation of Python, this object is gc'ed when
it is no longer needed - when the ref count hits 0.

It can't be subtle and search memory for pointer-like values
because the toolkit stores a table mapping the integer handles to
the internal pointer.

 2. The Ruby way to do finialization seems to be with a
'ObjectSpace.define_finalizer' method, which associates a finalizer
with each object instance.  Does that mean each instance I
create needs to be registered?  (As compared to Python where
I define the finalization in the class definition - not with
each instance.)

 3. That lets me implement the 'smart_ptr' behaviour, but I
still don't understand how to define finalization relationships
between two Ruby objects, so that I can guarantee one object is
removed before the other.  In pure Ruby code this isn't needed,
but I need to match the semantics expected by the C library.  In
Python I did it by defining the order in the __del__ (specifically
in MatchObject.__del__)

I cannot find a pure Ruby solution to this problem, which means
working at the C level.  I know just how complicated the Python
code was to ensure the correct dependencies - I'm glad I could code
everthing up in Python.  Is a pure Ruby solution possible?

 I read in http://www.rubycentral.com/book/ext_ruby.html how
to manage a C pointer with Data_Wrap_Struct.  This interface
allows you to tell the gc which associated objects should
be marked as "in use."  In some sense this is what I'm looking
for excepting two things:

  a. I still don't know how to tell it which objects are to be
    deleted first.  Are those mark relationships stored so the
    acyclic components are removed in the correct order?

  b. I don't have a C pointer.  All I have are integers.  I guess
    I could cast them to a pointer value, but there's always the
    chance it could collide either with a real pointer or with
    another library which also uses integer handles.


If my observations are correct, then there is a category of C
libraries which do not work well under Ruby but do work well
under the C implementation of Python.

(I say C implementation of Python because the __del__ semantics
are implementation dependent.  The Daylight toolkit is also
available in Java via JNI.  My code can talk to it using Jython.
But Jython doesn't run the __del__ methods, instead leaving
gc up to the Java runtime.  But the JVM doesn't know the internals
of the toolkit, so I end up leaking memory all over the place.)

BTW, if you wish, the source code for this Python package is available
at http://starship.python.net/crew/dalke/PyDaylight-0.7.tar.gz and
a description of some of the implementation details is available in
the Jan. 2001 Dr. Dobbs.

Sincerely,

                    Andrew
                    dalke / acm.org
P.S.
  Any errors in interpretation of Ruby or Python are purely my
fault.  My background is in physics and my interest these days
is software applications development for computational chemistry
and biology.  That means I may not use the right computer science
words for certain topics and that I do not have the experience
to readily understand how Ruby works.