Brian,


On 2011-03-20 08:31, Brian Candler wrote:
> Philip Rhoades wrote in post #988264:
>>> Your sample code looks like it's handling numeric-style data (although I
>>> realise this is just a test case for the problems you're having).
>>> Integers in the range -2^30..+2^30 (or larger in on a 64-bit machine)
>>> have their values encoded within the reference, so no memory allocation
>>> is done.
>>
>>
>> Are you talking about the hash key or the hash values?
>
> Either.


Right - for the following in my test script:

h1[ "#{a}.#{b}.#{c}.#{d}" ] = Array.new(2){ Array.new(1){ Array.new( 20, 
rand(100) ) } }
h1[ "#{a}.#{b}.#{c}.#{d}".freeze ] = Array.new(2){ Array.new(1){ 
Array.new( 20, rand(100) ) } }
h1[ "#{a}.#{b}.#{c}.#{d}".to_i ] = Array.new(2){ Array.new(1){ 
Array.new( 20, rand(100) ) } }
h1[ "#{a}.#{b}.#{c}.#{d}".to_i.freeze ] = Array.new(2){ Array.new(1){ 
Array.new( 20, rand(100) ) } }

I get the following times:

18.350s
18.113s
  4.724s
  4.896s

So I guess I should live with the slight decrease of readability when 
searching for particular results in the JSON output file by using ints 
instead of strings for the hash keys.


>> - the values in
>> the real script will all be floats . .
>
> Then they will be allocated on the heap, just like strings. I presume
> you're aware of the inherent inaccuracy of floats (in any language), and
> are OK with this.
>
>>> 1.0/2.0 == 1.0 + 1.0/2.0 - 1.0
> =>  true
>>> 1.0/10.0 == 1.0 + 1.0/10.0 - 1.0
> =>  false


I suppose I could convert them all to six or eight digit ints . . they 
are measures of biological diversity and changing them backwards and 
forwards is a bit of a hassle but maybe it is worth doing for the speed 
advantage? - I will try my test script with floats and see what happens.


>>> Or, if you're handling a relatively small set of unique values, you
>>> could use symbols instead of strings. Each symbol reference again
>>> doesn't allocate any memory; it just points to the entry in the symbol
>>> table.
>>
>> Not sure what you mean - example?
>
> a = []
> a[0] = :foo
> a[1] = :foo
> a[2] = :foo
>
> puts a[0].object_id
> puts a[1].object_id
> puts a[2].object_id


The reason the numbers are called "seeds" is that they correspond to the 
seed for the random dumber generator in the C/C++ simulation program - 
so they are all unique for each of the 32,000 simulations.


>>> Or you could use frozen strings and share the references.
>>>
>>> LABEL1 = "00".freeze
>>> LABEL2 = "01".freeze
>>> MAP = {LABEL1 =>   LABEL1, LABEL2=>LABEL2}
>>> a = MAP["00"]
>>> puts a.object_id
>>> puts LABEL1.object_id
>>
>>
>> I ran that code but I don't understand how it helps . .
>
> It uses less memory if you have (say) millions of identical strings.


Not the case for the keys and unlikely for the values.


> It
> may help garbage collection performance, but not much else
>
>>> Although that's more work than symbols, it might be useful depending on
>>> your use case. For example, you could replace a subset of the values you
>>> see with these frozen strings (which covers the majority of the data),
>>> whilst still allowing arbitrary other strings.
>>
>>
>> Still not clear - examples?
>
> Suppose the strings "foo" and "bar" comprise 80% of your hash keys or
> values. Then mapping them to the same frozen string means that you only
> have one instance of string "foo" and one instance of string "bar" in
> the system, instead of (say) millions of distinct strings. You can still
> use individual strings for the other 20%.


Unfortunately this doesn't correspond to my case . .


> This is really an edge optimisation though, you really shouldn't need to
> be worrying about these things - if they are significant, then perhaps
> ruby is the wrong language for the problem in hand.
>
>> The other thing that occurred to me was that on my 64-bit machine maybe
>> I could run 2-3 threads for inserting into the hash table?
>
> Noooo..... even in ruby 1.9, there is a global interpreter lock.
> Multiple threads gain you nothing really, except for threads which are
> blocked on I/O.


Right.


> Even if there were not, having multiple threads contending on the same
> hash (and controlling access via, say, a mutex) would be pretty much
> guaranteed to make performance worse not better.


OK - oh well it was worth a thought!

Many thanks!

Regards,

Phil.
-- 
Philip Rhoades

GPO Box 3411
Sydney NSW	2001
Australia
E-mail:  phil / pricom.com.au