Hi,

I'm working on a librairie to analyse repport of the prelude IDS on a MySQL
database. I need set operations (- and intresection) on quite huge liste
(several thousand of elements) and i've discovered that the - operator seems
to be always in O(n*m) (where length of list are n and m).
In fact when the list is already sorted, this operation should be O(n+m)
which is much better.
I've realised a little Set class which ilustrate this pb, i've code a O(n+m)
solution in pure Ruby code and this solution is really faster than the Ruby
one when the length is important.
This is an example, i've created two list t1 and t2 of 10000 random number,
i apply sort! and uniq! and them i test the two algo. (source code is at the
end of mail).
cedric@lagavulin:~/projet/prelude$ ./set.rb
sorting... 0.053775
size t1:9535, t2:9523
diff... 69.653139
my new diff... 0.263852
check if we get the same result with operator <=>
ok
cedric@lagavulin:~/projet/prelude$ ruby -v
ruby 1.6.7 (2002-03-19) [i386-linux]

So the hard coded - operation spends 69s and my one 0.26 s.

So i propose few solution:
1) In the - method, test if the list are allready sorted (which is a O(n+m)
operation), them adapt the algo (ie O(n+m) or O(n*m)).
2) In the - method do a sort first (which is a O(n*ln n+m*ln m) operation)
and them do the - with the O(n+m) algo. Because the - operator is a set
operator, the order doesn't really make sense...
3) Add a Set class (like my one but hard coded) where the order doesn't make
a sense and so what we manipulate is internaly alway sorted. - operation is
alway in O(n+m).

Sorry for my english, it's not my mother language.

Regards

remark: Union operation should be in O(n+m) because list are sorted (like in
a merge sort). I did a O((n+m) ln (n+m)) because i'm lazzy but it's quite
trivial to improve that.

#!/usr/bin/ruby

class Set
  def initialize(tab)
    @set = tab.sort
    @set.uniq!
  end

  def [](i)
    return @set[i]
  end

  def length
    return @set.length
  end

  def -(aSet)
    res = []
    i = 0
    for j in 0..aSet.length-1
      while i < @set.length and @set[i] < aSet[j] do
 res << @set[i]
 i+=1
      end
      i += 1 if @set[i] == aSet[j]
    end
    res.concat(@set[i.. / set.length-1]) if i<@set.length
    return Set.new(res)
  end

  def +(aSet)
    tmp = @set+aSet.to_a
    tmp.sort!
    tmp.uniq!
    return Set.new(tmp)
  end

  def &(aSet)
    return (self-aSet)+(aSet-self)
  end

  def to_a
    return @set
  end

  def to_s
    return "{"+@set.join(', ')+"}"
  end
end


t1 = []
t2 = []
for i in 0..10000
  t1[i]=rand(100000)
  t2[i]=rand(100000)
end
print "sorting... "
time1 = Time.new
t1.sort!
t1.uniq!
t2.sort!
t2.uniq!
time2 = Time.new
print time2-time1,"\n"
puts "size t1:#{t1.length}, t2:#{t2.length}"
print "diff... "
tmp1 = t2-t1
time3 = Time.new
print time3-time2,"\n"
print "my new diff... "
tmp2 = (Set.new(t2)-Set.new(t1)).to_a
time4 = Time.new
print time4-time3,"\n"
puts "check if we get the same result with operator <=>"
puts "ok" if (tmp1 <=> tmp2) == 0



--
Cedric Foll
courriel: cedric dot foll at laposte dot net