Issue #16996 has been updated by Eregon (Benoit Daloze). Completely agreed, `Hash#dup` should not rehash (and it already doesn't on = TruffleRuby). ---------------------------------------- Bug #16996: Hash should avoid doing unnecessary rehash https://bugs.ruby-lang.org/issues/16996#change-86360 * Author: marcandre (Marc-Andre Lafortune) * Status: Open * Priority: Normal * Backport: 2.5: UNKNOWN, 2.6: UNKNOWN, 2.7: UNKNOWN ---------------------------------------- Pop quiz: Which is the fastest way to get a copy of a Hash `h`? If, like me, you thought `h.dup` (of course, right?), you are actually wron= g. The fastest way is to call `h.merge`. Try it: ``` require 'benchmark/ips' lengths =3D 1..50 h =3D lengths.to_h { |i| ['x' * i, nil] } Benchmark.ips do |x| x.report("dup") { h.dup } x.report("merge") { h.merge } end ``` I get ``` Calculating ------------------------------------- dup 259.233k (=B1 9.2%) i/s - 1.285M in 5.013445s merge 944.095k (=B1 8.2%) i/s - 4.693M in 5.005315s ``` Yup, it's *3.5x faster* with this example!! Why? Because `Hash#dup` does a rehash, and `merge` does not. Pop quiz 2: which methods of `Hash` that produce a new hash do a rehash? Answer: it depends on the method and on the Ruby version ``` +---------------------------------+------+-----+-----+-----+-----+-----+---= --+-----+-----+ | Does this rehash? | head | 2.7 | 2.6 | 2.5 | 2.4 | 2.3 | 2.= 2 | 2.1 | 2.0 | +---------------------------------+------+-----+-----+-----+-----+-----+---= --+-----+-----+ | h.dup / clone | Yes | Yes | Yes | Yes | Yes | Yes | Ye= s | Yes | Yes | +---------------------------------+------+-----+-----+-----+-----+-----+---= --+-----+-----+ | h.select{true} / reject{false} | Yes | Yes | Yes | Yes | Yes | Yes | Ye= s | Yes | Yes | +---------------------------------+------+-----+-----+-----+-----+-----+---= --+-----+-----+ | h.select!{true} / reject!{false}| =D8 | =D8 | =D8 | =D8 | =D8 = | =D8 | =D8 | =D8 | =D8 | +---------------------------------+------+-----+-----+-----+-----+-----+---= --+-----+-----+ | sub_h.to_h | =D8 | =D8 | =D8 | =D8 | =D8 = | =D8 | =D8 | =D8 | =D8 | +---------------------------------+------+-----+-----+-----+-----+-----+---= --+-----+-----+ | h.merge({}) | =D8 | =D8 | =D8 | =D8 | Yes | = Yes | Yes | Yes | Yes | +---------------------------------+------+-----+-----+-----+-----+-----+---= --+-----+-----+ | h.merge | =D8 | =D8 | =D8 | n/= a | +---------------------------------+------+-----+-----+-----+-----+-----+---= --+-----+-----+ | h.transform_values(&:itself) | =D8 | =D8 | Yes | Yes | Yes | = n/a | +---------------------------------+------+-----+-----+-----+-----+-----+---= --+-----+-----+ (where `sub_h =3D Class.new(Hash).replace(h)`, =D8 =3D no rehash) ``` So in Ruby head, doing `h.merge({})` or even `h.transform_values(&:itself)`= will be much faster than `h.dup` (but slower in Ruby 2.4) (*) Notice that `select` rehashes, but `select!` doesn't, so the fastest way to= do a `select` in Ruby is... not to call select and instead to actually do = a `merge.select!`! (*) *: on hashes with non-negligible hash functions ```ruby class Hash def fast_select(&block) merge.select!(&block) # don't call dup because it's slow end end Benchmark.ips do |x| x.report("select") { h.select{true} } x.report("fast_select") { h.fast_select{true} } end ``` On my test case above, `fast_select` is *2.5x faster* than `select`. `fast_= select` will always return exactly the same result (unless the receiver nee= ded a rehash). Pop quiz 3: Is this a bug or a feature? It should be clear that no feature of Ruby should be re-implementable in Ru= by with a 3.5x / 2.5x speed gain, so many would think "of course it's a bug= ". Well, https://bugs.ruby-lang.org/issues/16121 seems to think that `Hash#dup= `'s rehash is a feature... Why? Because there is actually a test that `dup` does a rehash Why? Because a test of `Set` was failing otherwise! Commit: https://github.com/ruby/ruby/commit/a34a3c2caae4c1fbd Short discussion: http://blade.nagaokaut.ac.jp/cgi-bin/vframe.rb/ruby/ruby-= core/48040?47945-48527 Actual test: https://github.com/ruby/ruby/blob/master/test/test_set.rb#L621= -L625 Why? This test construct a `Set` that needs to be rehashed (by mutating an eleme= nt of the set after it is added), and then checks that `rehash_me =3D=3D re= hash_me.clone`. That test is bogus. It passes for obscure and undocumented reasons, and `re= hash_me.clone =3D=3D rehash_me` doesn't pass. Today, it is official that sets with elements that are later mutated must b= e `Set#reset`, so it is official that this should not be relied upon. Probably more clear is the case of `select/reject` (but I didn't check for = failing test), and even more clear that `merge` changed in Ruby 2.5 and `tr= ansform_values` in 2.7, but not a single `NEWS` file mentions the word "reh= ash". My conclusion is that Hash should avoid doing an unnecessary rehash: `dup`/= `clone`/`select`/`reject`. We probably should add a reminder in the `NEWS` = that if anyone mutates a key of a Hash, or an element of a Set and does not= call `rehash`/`reset`, improper behavior should be expected. Let's make `Hash#dup/clone/select/reject` fast please. Any objection? -- = https://bugs.ruby-lang.org/ Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=3Dunsubscribe> <http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>