People,

To add to my last post:


On 2011-03-20 16:44, Philip Rhoades wrote:
> Brian,
>
>
> On 2011-03-20 08:31, Brian Candler wrote:
>> Philip Rhoades wrote in post #988264:
>>>> Your sample code looks like it's handling numeric-style data
>>>> (although I
>>>> realise this is just a test case for the problems you're having).
>>>> Integers in the range -2^30..+2^30 (or larger in on a 64-bit machine)
>>>> have their values encoded within the reference, so no memory allocation
>>>> is done.
>>>
>>>
>>> Are you talking about the hash key or the hash values?
>>
>> Either.
>
>
> Right - for the following in my test script:
>
> h1[ "#{a}.#{b}.#{c}.#{d}" ] = Array.new(2){ Array.new(1){ Array.new( 20,
> rand(100) ) } }
> h1[ "#{a}.#{b}.#{c}.#{d}".freeze ] = Array.new(2){ Array.new(1){
> Array.new( 20, rand(100) ) } }
> h1[ "#{a}.#{b}.#{c}.#{d}".to_i ] = Array.new(2){ Array.new(1){
> Array.new( 20, rand(100) ) } }
> h1[ "#{a}.#{b}.#{c}.#{d}".to_i.freeze ] = Array.new(2){ Array.new(1){
> Array.new( 20, rand(100) ) } }
>
> I get the following times:
>
> 18.350s
> 18.113s
> 4.724s
> 4.896s
>
> So I guess I should live with the slight decrease of readability when
> searching for particular results in the JSON output file by using ints
> instead of strings for the hash keys.
>
>
>>> - the values in
>>> the real script will all be floats . .
>>
>> Then they will be allocated on the heap, just like strings. I presume
>> you're aware of the inherent inaccuracy of floats (in any language), and
>> are OK with this.
>>
>>>> 1.0/2.0 == 1.0 + 1.0/2.0 - 1.0
>> => true
>>>> 1.0/10.0 == 1.0 + 1.0/10.0 - 1.0
>> => false
>
>
> I suppose I could convert them all to six or eight digit ints . . they
> are measures of biological diversity and changing them backwards and
> forwards is a bit of a hassle but maybe it is worth doing for the speed
> advantage? - I will try my test script with floats and see what happens.
>
>
>>>> Or, if you're handling a relatively small set of unique values, you
>>>> could use symbols instead of strings. Each symbol reference again
>>>> doesn't allocate any memory; it just points to the entry in the symbol
>>>> table.
>>>
>>> Not sure what you mean - example?
>>
>> a = []
>> a[0] = :foo
>> a[1] = :foo
>> a[2] = :foo
>>
>> puts a[0].object_id
>> puts a[1].object_id
>> puts a[2].object_id
>
>
> The reason the numbers are called "seeds" is that they correspond to the
> seed for the random dumber generator in the C/C++ simulation program -
> so they are all unique for each of the 32,000 simulations.
>
>
>>>> Or you could use frozen strings and share the references.
>>>>
>>>> LABEL1 = "00".freeze
>>>> LABEL2 = "01".freeze
>>>> MAP = {LABEL1 => LABEL1, LABEL2=>LABEL2}
>>>> a = MAP["00"]
>>>> puts a.object_id
>>>> puts LABEL1.object_id
>>>
>>>
>>> I ran that code but I don't understand how it helps . .
>>
>> It uses less memory if you have (say) millions of identical strings.
>
>
> Not the case for the keys and unlikely for the values.
>
>
>> It
>> may help garbage collection performance, but not much else
>>
>>>> Although that's more work than symbols, it might be useful depending on
>>>> your use case. For example, you could replace a subset of the values
>>>> you
>>>> see with these frozen strings (which covers the majority of the data),
>>>> whilst still allowing arbitrary other strings.
>>>
>>>
>>> Still not clear - examples?
>>
>> Suppose the strings "foo" and "bar" comprise 80% of your hash keys or
>> values. Then mapping them to the same frozen string means that you only
>> have one instance of string "foo" and one instance of string "bar" in
>> the system, instead of (say) millions of distinct strings. You can still
>> use individual strings for the other 20%.
>
>
> Unfortunately this doesn't correspond to my case . .
>
>
>> This is really an edge optimisation though, you really shouldn't need to
>> be worrying about these things - if they are significant, then perhaps
>> ruby is the wrong language for the problem in hand.
>>
>>> The other thing that occurred to me was that on my 64-bit machine maybe
>>> I could run 2-3 threads for inserting into the hash table?
>>
>> Noooo..... even in ruby 1.9, there is a global interpreter lock.
>> Multiple threads gain you nothing really, except for threads which are
>> blocked on I/O.
>
>
> Right.
>
>
>> Even if there were not, having multiple threads contending on the same
>> hash (and controlling access via, say, a mutex) would be pretty much
>> guaranteed to make performance worse not better.
>
>
> OK - oh well it was worth a thought!


I ran the normal number of the inner two loops and only one of the 
outermost loop (ie 1/50th of the total) and got the following with 
different versions of Ruby:

ruby-1.8.7-p334 [ x86_64 ]	=  31.41
ruby-1.9.2-p180 [ x86_64 ]	=  19.53
rbx-head [ ]			= 128.42

Many thanks!

Regards,

Phil.

-- 
Philip Rhoades

GPO Box 3411
Sydney NSW	2001
Australia
E-mail:  phil / pricom.com.au