Robert Klemme wrote:
> On Mon, Nov 12, 2012 at 10:12 PM, Charles Hixson
> <charleshixsn / earthlink.net>  wrote:
>> Brian Candler wrote:
>>> As others have said: using a Hash may be more efficient, if there are
>>> large gaps - but then you may have to sort the keys if you want to
>>> iterate over it in order. It depends on your use case.
>>>
>> Hashes are slow compared to direct indexing.
> Did you measure it?  If not, please do.  It's easy to arrive at wrong
> conclusions by assuming what may be truth in other contexts in true in
> a new context as well.
>
>> I'm expecting this array to eventually use over a GB of storage, so
> Are you talking about memory or number of instances?  I am asking
> because both are not the same and it may be more difficult to asses
> memory consumption than you think.
I'm talking about data size.  The number of instances would probably be 
about 1/20 of the number of items.  (Varies because the items themselves 
contain Arrays that vary in size.)  But please note that this is a rough 
estimate, and could be an underestimate (though I've tried to make a 
generous estimate).
> efficiency considerations are somewhat significant.  gdbm solves some of
> these, by not dealing with the entire array at once, but only with the
> recently active instances.  But it's slower, which I was trying to avoid by
> using an Array.  However, I don't know just how many items the Array will
> need to hold, it could easily be in the millions, and will at least be in
> the hundreds of thousands.
> So how do you know we are talking about "over a GB of storage"?
It's my best estimate before I actually build the thing.  I hope it's a 
generous estimate, as smaller amounts of data would be easier to handle, 
but I don't want to say that I haven't underestimated.
>
>>   So pre-allocating isn't very desirable, but
>> neither is adding instance by instance.  (Besides, they won't necessarily
>> come in order.  It could well be that I'll get the 100,000th entry
>> immediately after getting the 21st.  No problem for a hash, but if I'm going
>> to slow down to hash speed, I should probably just go directly to gdbm.)
> Again: measure before you judge.  It's really much better to base
> decisions on facts than on speculation.
>
> Cheers
>
> robert
>
> --
> remember.guy do |as, often| as.you_can - without end
> http://blog.rubybestpractices.com/
I've measured it in several languages, and occasionally implemented the 
hash algorithm from scratch in C (two or three different ways of 
handling collisions...but Unicode means that I don't want to use C, 
though pointers was my original reason).  I'll admit I haven't measured 
it yet in Ruby, but hash cannot be fast, though direct indexing can be 
slow.  It's in the nature of the algorithm.  It can be quite fast 
compared to a tree lookup, but not compared to a direct index.
FWIW, the answer I was looking for is that Arrays are automatically 
extended to the last index inserted into.  This probably implies that 
Arrays use some sort of linked block index, though another possibility 
is that it just allocates a honking big chunk of memory, and does a copy 
whenever it needs to resize the array.  This seems unlikely.  Also 
silly.  So I'll bet on a linked block index.  (N.B.:  Some languages DO 
just allocate large blocks of RAM for their arrays, but most of the ones 
that I know do that are compiler languages, and they've got reasonable 
grounds...big blocks are more efficient to access, and if you're a 
compiler the code you compile can be as efficient as you are.  So the 
trade-offs are different.  Even then they try to avoid copying.)

P.S.:  Developing this in Ruby *IS* a measurement test.  If I can do it 
in Ruby, then I've guessed the right size.  But picking a bad algorithm 
to implement wouldn't prove much of anything.  If it doesn't work, I'll 
need to fall back to a faster language.  Probably D.  That's one reason 
I want to avoid gdbm.  Not only is RAM faster, but it's an easier 
conversion to another language if it turns out that I need a faster 
language.
Ruby's a lot faster to develop in, so it's not a bad decision even if 
Ruby turns out to be too slow (or not able to handle the required amount 
of memory), because then I'll have a pretty much written application 
that I can translate to a faster language.  (One reason for D is that 
there are lots of places where I can execute loops in parallel easily.  
That would be harder in Ruby, Python, etc., though in most ways they are 
easier.)