--------------Boundary-00H9FERK07BXTNFUOMP2H
Content-Type: text/plain;
  charsetso-8859-1"
Content-Transfer-Encoding: 8bit

I've come up with my first Ruby program (below). Actually, it's a
library module that implements the Hunt/McIlroy "diff" algorithm. It was
pretty fast to write (I more or less translated it from my Perl CPAN
module), but it may not have  been a great choice. (I'm going to post it
to RAA, so I'll have more questions about the right way to package it).

I'm a bit frustrated with the speed. On a pair of 4000 line arrays of
text, it runs on my machine in 35 seconds. My Perl code, on the other
hand, does the same job in 25 seconds. Of course, it's fast enough on
smaller arrays.

I've looked at it quite a bit, and even wrote a line profiler, but can't
manage to make it much faster.

The single biggest chunk of time is spent in the binary search routine
in replaceNextLargerWith(). I found that I could make it a tiny bit
faster by doing a single <and then using equal? to compare with
literal -1 and 1.

Anyway, I'd appreciate it if someone could give me some advice as to how
to make it more competitive with Perl's speed. I'd rather keep it pure
Ruby, instead of making a C extension.

Another question is the API. Note that there is a keyGen and a compare
method (that call hash() and eql?() respectively by default); I'd expect
people using this module with specialized data types  to override these
if necessary.

But in traverseSequences, there are up to 5 callbacks so that user code
can be called at the right times. Should this require inheritance or
overriding, or is the Ruby Way to pass in multiple Method or Proc
instances?

Thanks,
-- 
Ned Konz
currently: Stanwood, WA
email:     ned / bike-nomad.com
homepage:  http://bike-nomad.com





--------------Boundary-00H9FERK07BXTNFUOMP2H
Content-Type: text/plain;
  charsetso-8859-1";
  name
ifftest.rb"
Content-Transfer-Encoding: 8bit
Content-Disposition: attachment; filename="difftest.rb"

require 'diff'

$,   '
diff  iff.new
a  w(a b c e h j l m n p)
b  w(b c d e f j k l m r s t)
correctResult  w(b c e j l m)

result  iff.lcs(a, b)
print "a: ", a, "\n"
print "b: ", b, "\n"
print "correctResult: ", correctResult, "\n"
print "result: ", result, "\n"

result  iff.diff(a, b)
print "diff output:", result.inspect

--------------Boundary-00H9FERK07BXTNFUOMP2H
Content-Type: text/plain;
  charsetso-8859-1";
  name
iff.rb"
Content-Transfer-Encoding: 8bit
Content-Disposition: attachment; filename="diff.rb"

# Hunt/McIlroy diff algorithm for Ruby
# by Ned Konz, ned / bike-nomad.com
class Diff
  private
  # return a hash mapping keys to arrays of line numbers
  def withPositionsOfInInterval(aCollection, istart, iend)
    d  }
    (istart .. iend).each do |index|
      element  Collection[index]
      key  eyGen(element) 
      if d.has_key?(key)
        d[key].unshift(index)
      else
        d[key]  index]
      end
    end
    return d
  end

  # This is where the bulk of the time (40%) is spent in this module,
  # so try to make it fast!
  def replaceNextLargerWith(thresh, aValue, high)
    high  hresh.length - 1  if high.nil? or high 0
    # off the end?
    if high -1 or aValue > thresh[-1]
      thresh.push(aValue)
      return high + 1
    end

    # binary search for insertion point...
    low  
    while low < igh
      index  high + low) >> 1
      result  Value <thresh[index]
      if result.equal?(1)
        low  ndex + 1
      elsif result.equal?(-1)
        high  ndex - 1
      else
        return nil 
      end
    end

    # now insertion point is in low.
    thresh[low]  Value  # overwrite next larger
    return low
  end

  # This method computes the longest common subsequence in a and b.
  def longestCommonSubsequence(a, b)
    aStart  
    aFinish  .length - 1
    bStart  
    bFinish  .length - 1
    matchVector  ]

    # First we prune off any common elements at the beginning
    while aStart < Finish and
      bStart < Finish and
      compare(a[aStart], b[bStart])
      matchVector[aStart]  Start
      aStart + 
      bStart + 
    end

    # now the end
    while aStart < Finish and
      bStart < Finish and
      compare(a[aFinish], b[bFinish])
      matchVector[aFinish]  Finish
      aFinish  Finish - 1
      bFinish  Finish - 1
    end

    # Now compute the equivalence classes of positions of elements
    bMatches  ithPositionsOfInInterval(b, bStart, bFinish)
    thresh  ]
    links  ]
    (aStart .. aFinish).each do |i|
      ai  eyGen(a[i])
      if bMatches.has_key?(ai)
        k  
        bMatches[ai].each do |j|
          # optimization: most of the time this will be true
          if k and k > 0 and thresh[k] > j and thresh[k - 1] < j
            thresh[k]  
          else
            k  eplaceNextLargerWith(thresh, j, k)
          end
          unless k.nil?
            links[k]  (k > 0 ? links[k - 1] : nil), i, j]
          end
        end
      end
    end

    if thresh.length
      link  inks[thresh.length - 1]
      until link.nil?
        matchVector[link[1]]  ink[2]
        link  ink[0]
      end
    end

    return matchVector
  end

  protected
  # override these two methods to provide
  # different comparison semantics

  def keyGen(s)
    s.hash
  end

  def compare(aa,ab)
    aa.eql?(ab)
  end

  public

  def traverse_sequences(a, b, callbacks)
    # these should default to nil
    finishedACallback   allbacks['finished_a']
    finishedBCallback   allbacks['finished_b']

    # and these to a no-op
    callbacks.default  roc.new { |args| }
    matchCallback      allbacks['match']
    discardACallback   allbacks['discard_a']
    discardBCallback   allbacks['discard_b']

    matchVector  ongestCommonSubsequence(a, b)

    bi  ; ai  
    (0 ... matchVector.length).each do |ai|
      bLine  atchVector[ai]
      if bLine.nil?
        discardACallback.call(ai, bi)
      else
        while bi < bLine
          discardBCallback.call(ai, bi)
          bi + 
        end
        matchCallback.call(ai, bi)
        bi + 
      end
    end
    ai + 

    # the last entry (if any) processed was a match.

    if not finishedBCallback.nil? and ai < a.length
      finishedBCallback.call(bi)
    else
      while ai < a.length
        discardACallback.call(ai, bi)
        ai + 
      end
    end

    if not finishedACallback.nil? and bi < b.length
      finishedACallback.call(ai)
    else
      while bi < b.length
        discardBCallback.call(ai, bi)
        bi + 
      end
    end
  end

  def lcs(a, b)
    matchVector  ongestCommonSubsequence(a, b)
    retval  ]
    i  
    while i < matchVector.length
      retval.push(a[i]) unless matchVector[i].nil?
      i + 
    end
    return retval
  end

  def diff(a, b)
    retval  ]
    hunk  ]
    discard  roc.new { |ai,bi|
      hunk.push(['-', ai, a[ai]])
    }
    add  roc.new { |ai,bi|
      hunk.push(['+', bi, b[bi]])
    }
    match  roc.new { |ai,bi|
      if hunk.length > 0
        retval.push(hunk)
        hunk  ]
      end
    }
    traverse_sequences(a, b, {
      'match' match,
      'discard_a' discard,
      'discard_b' add
    })
    match.call(0, 0)
    return retval
  end
end

--------------Boundary-00H9FERK07BXTNFUOMP2H--