Yusuke ENDOH wrote:
> 2010/3/22 Bill Kelly <billk / cts.com>:
>> RBTree and MultiRBTree provide functionality which, with its
>> worst-case O(log N) search, insert, and delete complexity for
>> a sorted pair associative container can't be readily duplicated
>> with Array, Hash, or Set. ?(As far as I know.)
> 
> Hash has amortized O(1) search, insert, and delete complexity,
> I think.  Indeed, it becomes O(N) at worst-case (when rehash
> occurs).  Does anyone have a concrete problem due to rehash?

Agreed: for Hash I would expect O(1) search, and amortized O(1)
insert and delete complexity.

To avoid rehash, a Hash#reserve(size) method might be nice, but,
for me, that is all separate from why I am interested in RBTree.

> I think this feature request is very tough because it can be
> substituted by Hash in many cases.  So I think you guys should
> appeal the difference.  It would be good to show some real-world
> case where Hash cannot be used and RBTree is really needed.

Some differences:

Hash is not maintained in key-sorted order.

Hash does not offer upper_bound(key) or lower_bound(key)
or bound(key1, key2) in O(log N) time.

Hash doesn't provide fast search for partial string key.


An example, indexing words in documents, and doing
partial keyword searches.

(Note: MultiRBTree#bound seems to be broken.)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
require 'rbtree'

ful_ = %w(
fulcra
fulcrum
fulcrums
fulfil
fulfill
fulfilled
fulfilling
fulfillment
fulfills
fulfilment
fulfils
full
fullback
fullbacks
fulled
fuller
fullest
fulling
fullness
fulls
fully
fulminate
fulminated
fulminates
fulminating
fulmination
fulminations
fulsome
)

multi_ = %w(
multicolored
multicultural
multiculturalism
multidimensional
multifaceted
multifarious
multifariousness
multilateral
multilingual
multimedia
multimillionaire
multimillionaires
multinational
multinationals
multiple
multiples
multiplex
multiplexed
multiplexer
multiplexers
multiplexes
multiplexing
multiplicand
multiplicands
multiplication
multiplications
multiplicative
multiplicities
multiplicity
multiplied
multiplier
multipliers
multiplies
multiply
multiplying
multiprocessing
multipurpose
multiracial
multitasking
multitude
multitudes
multitudinous
multivariate
multivitamin
multivitamins
)

dis_ = %w(
distortions
distorts
distract
distracted
distracting
distraction
distractions
distracts
distrait
distraught
distress
distressed
distresses
distressful
distressing
distressingly
distribute
distributed
distributes
distributing
distribution
distributions
distributive
distributor
distributors
district
districts
distrust
distrusted
distrustful
distrustfully
distrusting
distrusts
disturb
disturbance
disturbances
disturbed
disturbing
disturbingly
disturbs
)


doc1 = [ "foo/doc1.txt", ful_   + multi_ ]
doc2 = [ "bar/doc2.txt", multi_ + dis_   ]
doc3 = [ "baz/doc3.txt", ful_   + dis_   ]


dict = MultiRBTree.new

[doc1, doc2, doc3].each do |docpath, words|
  words.each do |w|
    dict.store(w, docpath)
  end
end


puts dict.lower_bound("mult")  # => ["multicolored", "foo/doc1.txt"]
puts dict.upper_bound("mult")  # => ["fulsome", "baz/doc3.txt"]
puts dict.bound("mult")        # <-- broken
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Note: The documentation for RBTree#bound reads:

 * call-seq:
 *   rbtree.bound(key1, key2 = key1)                      => array
 *   rbtree.bound(key1, key2 = key1) {|key, value| block} => rbtree
 *
 * Returns an array containing key-value pairs between the result of
 * MultiRBTree#lower_bound and MultiRBTree#upper_bound. If a block is
 * given it calls the block once for each pair.

So I expected dict.bound("mult") to return all elements from:

  dict.lower_bound("mult")  => ["multicolored", "foo/doc1.txt"]

through:

  dict.upper_bound("mult")  => ["fulsome", "baz/doc3.txt"]


However, #bound just retunrs []  :(


I consider this a bug.


  *  *  *


A comment:

Even if MultiRBTree#bound worked as expected, I must concede a
significant liability of MultiRBTree's API compared to C++
std::multimap, is the lack of iterators.

With std::multimap, I can find dict.lower_bound("mult"), and
then iterate over as many or as few subsequent elements in sorted
order as I choose.  (When building a typedown menu, for example,
I may only want the first 10 results.)

I suppose rbtree.bound(key1, key2, _limit_) would be one way to
provide equivalent functionality; or perhaps support for 1.9
Enumerable would be another.


  *  *  *


Anyway, issues with MultiRBTree#bound aside, other ways I've
used a Sorted Pair Associative Container include implementing
various kinds of priority queues.  (Sorted integer keys.)


  *  *  *


I do think that for RBTree and MultiRBTree to be as generally
useful as C++ std::map and std::multimap, there should be
versions of methods like bound, lower_bound, upper_bound, that
return an enumerator.

Also, I think it should be possible to unambiguously delete
a specific element from a MultiRBTree.

Currently:

  t = MultiRBTree.new
  t.store "foo", "456"
  t.store "foo", "123"
  t.delete "foo"  # <-- which is deleted?  foo/123 or foo/456 ?

It appears that MultiRBTree#delete deletes the oldest key/value
pair matching the supplied key, so it would be foo/456.

As far as I can tell, there's no way to delete foo/123 from t
without first deleting foo/456.  So that is another limitation
when compared to std::multimap.


  *  *  *


Hmm.

So it seems that even though RBTree and MultiRBTree are
internally equivalent to C++ std::map and std::multimap, the
interface exposed to the programmer is less flexible than the
C++ versions.

I think RBTree and MultiRBTree would be more useful it were
possible to obtain enumerators from bound and lower_bound,
and to be able to delete arbitrary elements in a MultiRBTree.



Sorry this email is so long.  I didn't expect to encounter
these issues.



Regards,

Bill