"Xiangrong Fang" <xrfang / hotmail.com> wrote in message
news:20030526135729.583F.XRFANG / hotmail.com...
> Hi ruby fans,
>
> I have a question about algorithm efficiency.

OK

> I have used hash in my ruby program. It worked fine. However, I found
> that the memory consumed by a hash is tremendous. While the hash file is
> stored on disk (using pstore) it is merely 100-200KB, but in memory, it
> consumes apporximately 5-10 Megabytes! I don't know if it is related to
> the cache algorithm used by ruby?

I'm assuming 32 bit architecture below.

I don't know how the hash works in Ruby, or how you dumped it on disk.
But if the disk is non-indexed such that it is loaded into memory before
becoming useful, you can save an awful lot of memory.
A hash can be quite memory consuming. Each entry in the hash typically has:
hashkey, reference to real key, reference to data, reference to next
collision.

This is typically 4 32bit words plus the actually key.

For afficient allocation you typically have 25% extra entries but this can
vary, so you actually use 5 x 32bit in this case.
You also need a bucket table. A bucket table could be that array of entries
directly, but it is better to have a separate table. The bucket table takes
up on reference to the entry. The bucket table is preferably more empty than
full (which is why it is better to not use the entry table directly.
How much overcapacity you use in the bucket table is a design decision, but
for at least half empty you need two references.

So we are totalling 7 x 32bit per stored entry, not counting any keys
(strings or whatever). Keys are typically short, even for strings, so one or
two 32bit words probably.

A total rough estimate is therefore 8 32 bit words per stored entry in the
hash table.
If you have 2 32 bit words worth of data stored in a flat file unindexed,
you use 4 times that memory in a in-memory hash.
This suggests that Ruby should use around 1-2MB if your on disk
datastructure uses 100-200KB. However, Ruby probably has a lot of extra
per-object overhead for keys.

>
> I tried to reduce this memory requirment and thought of using Binary
> Tree. Just did some test in Delphi, it seems good. I loaded an English
> dictionary of about 1M (90000+ lines) it just took about 6-8M of memory,
> seems much better than hash.

Binary trees are typically implemented as red-black trees. The have
reference to data, reference to left and right child, often also reference
to parent. And they have some bits to store its color. This can be done
tagging a bit in one of the pointers, but typically a node useds 5 32 bit
words and the keys. If we use 2 32 bit words for storing keys, we also end
up using 7 32bit words per entry.
That is only marginally better than hash tables.

> Now my question is, since hash is O(1) and BTree is O(LOG(N)), does it
> mean that hash is born to be less storage-efficient than BTree? How can
> I get the best of both BTree and hash?

Don't worry too much about the Log(N) in B-Trees. It is not log2(N)
but logM(N), where M is in the range of 10 to 1000. For most practical
purposes
the B-Tree can be viewed as close to O(1). However, the constant factor
may be somewhat larger than for hash tables. It just means a B-Tree scales
well.
A binary tree does not scale nearly as well. Here you truly get log(N)
performance.

A binary tree and a B-Tree keeps an order relation, a hash does not
- or rather:

Usually you do not get to keep insertion order in hashtables, but it is in
fact possible to do so without paying any significant time/space overhead.
I have written such a hashtable but only implemented it for 32bit keys
(which
therefor can be stored in-place in the hash entry. This is not sorting, but
it is
extra ordering information.

Searching Binary trees are usually faster than most things for small (<100)
items.
This assumes the tree allocates tree node from a common pool so all nodes
are close
to each other (for cache performance).
Arrays are faster for (<10) items (linear search, not binary search).

A hash table generally performs very well for medium to large amounts of
data.
One problem is the memory consumption which hurts cache efficiency. Another
problem is the large bucket table with random access pattern. This is also
bad for
cache performance. However, despite these problems, a good hash so fast that
it still manages to beat more clever datastructures except in the
competition for size.

A B-Tree uses [reference to key, reference to data]. It also uses additional
memory
for internal tree nodes (assuming all data is stored at leaf entries).
Had the B-Tree been binary, it would use the same amount of memory for
internal nodes
as for leaf storage, but the branching is higher, so the memory consuption
drops. I don't
have any exact mesures here, but the cost of internal nodes are limited.
A B-Tree operates at about 2/3 of allocated memory. So you need 3 32 bit
words plus internal
node overhead. A rough estimate is therefore 4 entries. Then you need to add
the overhead
for any keys you store externally. Thus we get to about 5 bit words per
entry.

The best way to work with in-memory B-Trees is to ensure large branchfactor
that still is cache friendly.
Furthermore, we want to scan a B-Tree node linearly and not using binary
search becuase it is too slow
for anything below 100 entries. A good choice for in memory nodes appear to
be around 7-15 entries,
but that depends on processor and the data stored. If the key is external,
it may be relevant to use
binary seach to reduce the number of comparions.
The B-Tree generally has very good cache performance, except it can be hurt
be external keys.

The Judy tree / trie is even more complex than the B-Tree but is more memory
efficient than B-Trees
and may be slightly more cache friendly.

The B-Tree and Judy tree both scales well to very large datastructures,
whereas a hash table gets into
trouble as soon as you start trashing to disk. A Binary tree is just not as
efficient long before memory
becomes a problem.

The Judy trie does not have a problem a problem with external keys. I have
developed a B-Trie which are
stacked B-Trees each holding a part of the key. It uses too much memory
becuase there are many
mostly empty B-Tree nodes, but this can fixed by optimizing B-Trees for very
small collections.

A believe the B-Tree is easier to update than Judy trees and is better for
combined in-memory and on-disk
datastructures. However, for strictly in-memory operation the Judy tree may
be better.

Another datastructure, the Skip-list datastructure has been very popular
becuase it is 10 times
easier to implement than an efficient in-memory B-Tree. It has been claimed
to better than B-Trees.
This is not true. A skip-list jumps all over the memory and has poor cache
performance. It suffers from
pointer chasing. I have done some benchmarks, and I 've seen one other
person do some similar
investigations reaching the same conclusion.
This does not mean you shouldn't use skip-lists. They are still useful
datastructures - they may be better than than red-black trees which are
typically used for map
implementations.

B-Trees are very kind to efficient allocation of memory. Judy trees and skip
lists allocate all sorts of
different memory sizes. A B-Tree only allocates one node size (and possibly
one smaller node size
for the optimized special case of very small collections).

Mikkel