"Xiangrong Fang" <xrfang / hotmail.com> wrote in message news:20030526135729.583F.XRFANG / hotmail.com... > Hi ruby fans, > > I have a question about algorithm efficiency. OK > I have used hash in my ruby program. It worked fine. However, I found > that the memory consumed by a hash is tremendous. While the hash file is > stored on disk (using pstore) it is merely 100-200KB, but in memory, it > consumes apporximately 5-10 Megabytes! I don't know if it is related to > the cache algorithm used by ruby? I'm assuming 32 bit architecture below. I don't know how the hash works in Ruby, or how you dumped it on disk. But if the disk is non-indexed such that it is loaded into memory before becoming useful, you can save an awful lot of memory. A hash can be quite memory consuming. Each entry in the hash typically has: hashkey, reference to real key, reference to data, reference to next collision. This is typically 4 32bit words plus the actually key. For afficient allocation you typically have 25% extra entries but this can vary, so you actually use 5 x 32bit in this case. You also need a bucket table. A bucket table could be that array of entries directly, but it is better to have a separate table. The bucket table takes up on reference to the entry. The bucket table is preferably more empty than full (which is why it is better to not use the entry table directly. How much overcapacity you use in the bucket table is a design decision, but for at least half empty you need two references. So we are totalling 7 x 32bit per stored entry, not counting any keys (strings or whatever). Keys are typically short, even for strings, so one or two 32bit words probably. A total rough estimate is therefore 8 32 bit words per stored entry in the hash table. If you have 2 32 bit words worth of data stored in a flat file unindexed, you use 4 times that memory in a in-memory hash. This suggests that Ruby should use around 1-2MB if your on disk datastructure uses 100-200KB. However, Ruby probably has a lot of extra per-object overhead for keys. > > I tried to reduce this memory requirment and thought of using Binary > Tree. Just did some test in Delphi, it seems good. I loaded an English > dictionary of about 1M (90000+ lines) it just took about 6-8M of memory, > seems much better than hash. Binary trees are typically implemented as red-black trees. The have reference to data, reference to left and right child, often also reference to parent. And they have some bits to store its color. This can be done tagging a bit in one of the pointers, but typically a node useds 5 32 bit words and the keys. If we use 2 32 bit words for storing keys, we also end up using 7 32bit words per entry. That is only marginally better than hash tables. > Now my question is, since hash is O(1) and BTree is O(LOG(N)), does it > mean that hash is born to be less storage-efficient than BTree? How can > I get the best of both BTree and hash? Don't worry too much about the Log(N) in B-Trees. It is not log2(N) but logM(N), where M is in the range of 10 to 1000. For most practical purposes the B-Tree can be viewed as close to O(1). However, the constant factor may be somewhat larger than for hash tables. It just means a B-Tree scales well. A binary tree does not scale nearly as well. Here you truly get log(N) performance. A binary tree and a B-Tree keeps an order relation, a hash does not - or rather: Usually you do not get to keep insertion order in hashtables, but it is in fact possible to do so without paying any significant time/space overhead. I have written such a hashtable but only implemented it for 32bit keys (which therefor can be stored in-place in the hash entry. This is not sorting, but it is extra ordering information. Searching Binary trees are usually faster than most things for small (<100) items. This assumes the tree allocates tree node from a common pool so all nodes are close to each other (for cache performance). Arrays are faster for (<10) items (linear search, not binary search). A hash table generally performs very well for medium to large amounts of data. One problem is the memory consumption which hurts cache efficiency. Another problem is the large bucket table with random access pattern. This is also bad for cache performance. However, despite these problems, a good hash so fast that it still manages to beat more clever datastructures except in the competition for size. A B-Tree uses [reference to key, reference to data]. It also uses additional memory for internal tree nodes (assuming all data is stored at leaf entries). Had the B-Tree been binary, it would use the same amount of memory for internal nodes as for leaf storage, but the branching is higher, so the memory consuption drops. I don't have any exact mesures here, but the cost of internal nodes are limited. A B-Tree operates at about 2/3 of allocated memory. So you need 3 32 bit words plus internal node overhead. A rough estimate is therefore 4 entries. Then you need to add the overhead for any keys you store externally. Thus we get to about 5 bit words per entry. The best way to work with in-memory B-Trees is to ensure large branchfactor that still is cache friendly. Furthermore, we want to scan a B-Tree node linearly and not using binary search becuase it is too slow for anything below 100 entries. A good choice for in memory nodes appear to be around 7-15 entries, but that depends on processor and the data stored. If the key is external, it may be relevant to use binary seach to reduce the number of comparions. The B-Tree generally has very good cache performance, except it can be hurt be external keys. The Judy tree / trie is even more complex than the B-Tree but is more memory efficient than B-Trees and may be slightly more cache friendly. The B-Tree and Judy tree both scales well to very large datastructures, whereas a hash table gets into trouble as soon as you start trashing to disk. A Binary tree is just not as efficient long before memory becomes a problem. The Judy trie does not have a problem a problem with external keys. I have developed a B-Trie which are stacked B-Trees each holding a part of the key. It uses too much memory becuase there are many mostly empty B-Tree nodes, but this can fixed by optimizing B-Trees for very small collections. A believe the B-Tree is easier to update than Judy trees and is better for combined in-memory and on-disk datastructures. However, for strictly in-memory operation the Judy tree may be better. Another datastructure, the Skip-list datastructure has been very popular becuase it is 10 times easier to implement than an efficient in-memory B-Tree. It has been claimed to better than B-Trees. This is not true. A skip-list jumps all over the memory and has poor cache performance. It suffers from pointer chasing. I have done some benchmarks, and I 've seen one other person do some similar investigations reaching the same conclusion. This does not mean you shouldn't use skip-lists. They are still useful datastructures - they may be better than than red-black trees which are typically used for map implementations. B-Trees are very kind to efficient allocation of memory. Judy trees and skip lists allocate all sorts of different memory sizes. A B-Tree only allocates one node size (and possibly one smaller node size for the optimized special case of very small collections). Mikkel