On Sat, Jun 04, 2011 at 02:17:28AM +0900, Piotr Szotkowski wrote:
> // Apologies for the delayed reply  it takes
> // a bit to digest such a detailed response! :)

Oh, don't apologize - my fault for being way too elaborate and taking
so much of your time.

The topic got me really thinking on some concepts.

Here is an overview:

  1. Regarding coding issues only, still I don't see the difference
  between Hash and RBTree as feature. I don't see #hash +#eql? as
  being superior in this regard than #<=>.

  Hash API is YAGNI category for users, if you ask me.

  RBTree is a good reference on how a hash can work with #<=>.

  (RBTree wasn't included because it wasn't mature enough at the
  time).

  2. I patched Ruby to warn about cases where key type mixing takes
  place. The cases that popped up didn't justify the need for Hash's
  generic behavior (though I only checked a few things).

  The result of this "experiment" however convinced me that adding
  "Ruby best practice" warnings is both valuable and easy. If only it
  were easier to turn them on and off in code...

  The interesting conclusion is that such changes don't have to be in
  the standard MRI to be useful.

  There could even be a patched "lint" version of Ruby, that could
  warn about not using the short hand hash syntax.

  3. Just for reference: maybe I didn't make myself clear, but the
  last thing I want is to have symbols compared to strings. HWIA
  handles a Rails specific case for convenience, so it doesn't count.

  The idea that '3' + 3 doesn't work is something I find very useful.
  Likewise, if an integer keyed hash didn't merge with a string keyed
  hash I would find such a case very similar.

  The PHP way would be to discover that "Array is a special case of
  Hash (implementation aside), with integers as keys, so not why
  create a general array/hash class, call it array and have one class
  less for novices?" That didn't turn out nice IMHO.

  The PHP way would be to drop symbols because they are too difficult
  to grasp - or make them coerce to one another, which is probably
  worse.

  As an analogy, my approach with making Hash more strict seems to be
  like making the hurdles more difficult to jump over, but pasting on
  them instructions about what you need to learn to clear them.

  Hardly the PHP approach if you ask me.

  So, from my point of view, slightly putting the generalized behavior
  "out of view" (but not out of reach) would get people to sit back
  and think more about design, and not just reach for what they know.

  Much better conditions for learning than debugging.

> I°«d argue it is useful in that it°«s a very simple model

By contrast, RBTree also seems simple - at least to me. Although the
name doesn't suggest how similar it is to Hash.

> Also, Ruby is not known for treating the °∆does it cause more harm
> than good°« question as a benchmark

True. With so many interesting languages popping up, syntax will
probably become more important for Ruby's success in the future.

We already have two very successful Rubies: 1.8.7 and 1.9.2. With such
a long history already, it is now easier to make harder decisions
about the language and syntax with less risk.

> Yes, but the domain is usually specific, and I don°«t think
> enforcing any parts of it on all Hashes is a good idea.

I thought so too and I'm not saying it definitely is - but I still
cannot think of practical reasons why.

> (maybe that°«s what you want? 'abc'.hash == :abc.hash when used
> in certain contexts? but that°«d be even bigger a hack, IMHO).

No, that is the case I would like to prevent from occurring! I'm
guessing a lack of #<=> could be worked around by using #hash, #eql?
and using object_id to determine order predictably.

But I didn't really think this through and I haven't looked that
deeply into RBTree.

> (...) as I can undefine #<=> on any Hash key at a whim.

Not sure what you mean. You can undefine #hash also. Breaking things
is ok, as long as fixing them is quick and simple IMHO. Unit tests are
for great from keeping broken things from leaving one's file system.

> Why are you against subclassing Hash and coming up with a NameHash
> (or MonoKeyHash)?

It still seems like treating just the symptom. And suggests too much
duplication - the differences are just slight behavior differences.
And it feels to Java'ish for Ruby.
Maybe I feel like subclassing Hash is more work than it should be.

Consider the following as alternatives from a design perspective:

  # filter (ignore garbage) + sort, convert from array
  Hash[{z:0, a:1, 'b' =>2}.select {|x| x.is_a?(Symbol)}.sort]
  => {:a=>1, :z=>0}

  # filter (validate) + sort, convert from array
  Hash[{z:0, a:1, 'b' =>2}.each {|k,_|
  raise ArgumentError unless k.is_a?(Symbol); k}.sort]
  #=> ArgumentError

And the following:

  # no filtering, always sorted, no invalid state, remains RBTree
  RBTree[{z:0, a:1, 'b' =>2}]  #=> ArgumentError


I'd probably prefer mixins that can be included in Hash. But I'm
unsure how that would turn out. Again, refinements come to mind, but I
wonder if the current API is easily ... "refinable".

And maybe allow for optimizations.

Here is an example of what I mean:

  a = {}.add_option(inserting: {order: :sort_key, duplicates: :raise})
    .set_option(default: 'X')
    .add_option(inserting: {|a| !a.is_a? Symbol} => :raise)

  a.merge(z: 3, a: 1)  #=> {a:1, z:3}
  a.merge(z: 3, a: 1).sort  #=> {a:1, z:3} (no sorting required)

  a.keys.to_a.sort #=> [:a, :z]  (no sorting required)

  a[:foo] #=> 'X', works like block given to Hash
  a['foo'] #=> raises an ArgumentError

  [].add_option(inserting: {duplicates: :merge})  #=> effectively a Set

Refinements would minimize the need for this.

The only problem I can see now with Ruby API is that people want to
override behavior and not methods - this makes subclassing more
difficult than it should be IMHO.

For example #[], #[]= and merge can add items, but you cannot just
override 'add_item' (st_insert() I believe).

> Again, while agreeing with both of the above, I still
> think coming up with NameHash is a much better solution
> than trying to make Hash outsmart the programmer.

If I could do {a: 3}.to_symhash I guess that would work out ok.

> I agree that in 99% of the cases all Strings share
> the same methods, but changing fundamental classes (like Hash)
> unfortunately is all about handling the edge cases.

If I had more control over what can be in a hash, I have a lot less
edge cases to worry about. Same with other types.

> The discussion about warning a sloppy developer is similar to
> whether '1' + 2 should work, and if so, whether it should be
> 3 or '12'.

These examples are obvious errors. For Hash compatibility I proposed
just a warning or make Hash mixing deprecated. But that assumes
restricting Hash is actually valuable - which I am unsure of.

> Note that Rails monkey-patches NilClass to make the errors on
> nil.<method> more obvious; maybe that°«s the way to go?

It is why I preferred to hack rb_hash instead of subclassing. Simple
task and handles internal calls to rb_hash as well.

> The Ruby approach in this case is to have enough test coverage
> (ideally: upfront) so that the problem is quite obvious. ;)

Aggressive TDD is how I learned root cause analysis (I hope). Adding a
touch of Design by Contract may help reduce some unnecessary edge
cases without resorting to too much intelligence.

> I understand what you mean by the °∆experts°« remark, but I°«m not sure
> that this case falls on the °∆expert°« side of the border; understanding
> how Hashes work is quite crucial

Sure, but not necessarily on the first page of a Ruby tutorial. With
disciplined TDD you get actually quite far IMHO without understanding
details. Refactoring is actually a good time for learning such things.

And warnings are a good way to focus more deeply on a given subject.

> Well, you want a particular kind of a Hash

More like just a particular behavior, but I'm otherwise nodding my
head reading your comments.

> But are the mistakes really that common? It should be doable
> to add guarding code to Hash#initialize if it°«s really needed.
> You could also argue for getting Rails°« HWIA into Ruby core
> (I°«m not sure whether it was proposed before or not).

Yes it was. The reasoning behind arguments for including suggested
exactly that - that mistakes are common.

> Then make Hash#initialize smarter if you need.

Hash already has a block for default values. If I could define a block
called for every implicitly added item and have the block working with
#merge, it might be a good solution.

  >> a = Hash.new {|_, key| raise unless key.is_a?(Symbol)}
  >> a['a']  #=> RuntimeError
  >> a.merge(3 => 4) #=> {3=>4} (no error)

> Well, if you come up with a practical NameHash
> then I think there°«s a chance it°«ll end up in core.

It becomes more practical once it is in core ;)
Chicken and egg problem.

Proving it *is* practical may be a problem. Proving it wouldn't be the
easiest - if I knew how. And it would result in much shorter threads
on ruby-core...

> As for the {} syntax: (a) NameHash could have its own
> syntax sugar

I has too much in common with Hash - the {} syntax is one of the
reasons I started considering replacing Hash. As for alternatives,
what is left? ('a' => 3),  %h{a: 3}, ... ?

> or (b) as I wrote above, you can try abusing Hash#initialize to
> create NameHash if all the keys are Strings/Symbols (but I can see
> this blowing up eventually).

Actually blowing up (if I understand you correctly) is better than
silent failure and long hours of debugging through a great big
metaprogramming jungle.

I don't want the extreme of making Ruby interpret code a mind bending
puzzle challenge (those experienced in strongly typed languages may
find this familiar), but on the other side - I don't think Ruby has
reached the sweet spot yet.

> > {'a' => 3, :a => 3}/so  # s = strict, o = ordered

> Hm, I°«d rather have Hash#/ as a valid method (say, for
> splitting Hashes into shards?) than a syntax construct.

Sure. That was just random brainstorming - but I don't really like it
myself. It is too specific. But then again - hashes and arrays won't
change dramatically over time.

> Interestingly, regular expressions reminded me of that fact
> that it might be convenient to have both Regexps and Strings
> as keys in the same Hash (for matching purposes). :)

That sounds crazy but you have a point. I wonder if after 10 years of
abusing hashes people will reinvent LISP as a result. Or everyone will
be configuring their favorite Ruby syntax upon installation.

> >> Let°«s say I have a Hash with keys being instances of People,
> >> Employees and Volunteers (with Employees and Volunteers being
> >> subclasses of People). Should they all be allowed as keys in a
> >> single MonoKeyHash or not?

I'm not sure about the actual use case, so try it with RBTree and see
for yourself.

> I°«m still not sure what you mean by that (and what happens if I
> remove #<=> from a random key).

Not sure here too. Try it with RBTree.

> Say, I have a graph with nodes being various subclasses of Person
> and various subclasses of Event and I want the graph to track
> Person/Person and Person/Event relations  I really want to be able
> to use the various People and Event subclasses as keys in my Hash.

Yes, but why in the same hash? Why not two different hashes? What is
the common behavior between Event and Person? You are probably going
to iterate the graph in order to ... ?

You can always add a level of indirection, then your graph will become
more generic and reusable.

And effectively, you are hashing object contents - I'm not sure that
is really what you want. My intuition tells me such the case you
describe is refactorable.

> Hm, I think I totally disagree  #hash and #eql? are
> the public interface of Hash, and the contract is that
> anything that implements these can be used as a Hash key.

True, but I was referring to a higher level of abstraction of an assoc
array, which Hash is intended for (but not limited to):

  a[b] = x   (association)
  a[b]       (referencing)

At this level, both Hash, RBTree and even Array are identical.

#hash and #eql? are assoc array implementation specific. RBTree
doesn't use hashing, but serves the same purpose. The difference is
the implementation restricts the items available for keys.

> I strongly disagree here; a simple model which, in addition, is fairly
> easy to explain (it°«s only #hash and #eql?, really), is much better
> than a complex model carried around only for the sake of novices.

Could you say what exactly is complex? I always thought an RBTree was
simpler to understand than "Hash", which to me initially worked
"magically" and sometimes I still get hashes and identities mixed up.

By analogy, an even more stricter "hash" - Array - is even less
confusing:

  a = [1,2,3]
  a[0] # => 1

  >> a[nil] # => TypeError (!)

And it is limited to integers specifically. And I don't have to run
'ri' or look into array.c to work it out.

We could discuss if a[nil] fails for common sense reasons or
implementation reasons. Maybe in the same way I'm not getting that in
practice, association arrays are always hash based and it is obvious
for everyone but me.

> I see where PHP ended up with °∆novice-friendly°« approach and
> it°«s awful

I totally agree.

>  and I strongly believe a simple and consistent model is actually
> more novice-friendly in the long run than wondering why '0' == false
> and false == null but '0' != null.

Handling these cases says it all (isset, isnull, etc). For me PHP is
both incredibly difficult to learn and even take. The only hope for
PHP at this point would be to start undoing "novice helping" and start
generating errors and warnings.

In case of double, a clear common syntax is the best criteria if you
ask me.  With an error or warning you at least have a question to
start with.

I'm not sure what you mean by model and how in what way it is
novice-friendly. Do you mean easier to understand implementation? If
personally think that clear syntax wins in the long run. How intent
maps to code.

The underlying model can change many time and be as complicated as
possible and I wouldn't really care. Probably because I spend more
time in Ruby and none in C.

> > Even if it results in an overly complex parser and implementation,
> > I think only good will come from going out of one's way to make
> > Ruby users lives easier.
>
> Definitely  it°«s just we disagree on what is easier (in the long
> run).

I'll be quick to correct myself: easier for users to become productive
and happy (and rich?) experts delivering valuable software.

> I agree it might be useful to have NameHash for name Ę™ object
> mappings,

I would just stick with SymbolHash and have Hash for strings.

> MonoKeyHash that keeps the keys in check

This would probably be a copy of rb_hash implementation, where type
checking is both simple and cheap. Not sure how to handle objects
though.

> and/or a #<=>-based ComparisonHash

meaning basically to help RBTree get adopted - maybe with a nicer,
less scary name

> and I encourage you to implement them and push for them
> to be included in the core

I still think I lack the necessary understanding, so I'll spend some
more time researching actual hash usage along with external libraries
in this area (AS, facets, extlib).

> I°«m simply very grateful for the extremely well though-out and
> versatile Hash we have now and I°«d rather it°«s not made more
> complicated (or dumbed-down) for the sake of a (granted, popular)
> single use-case.

Could you explain that using Array and RBTree as examples?
Is Array a dumbed-down hash? Is RBTree overcomplicated?

> I really like this discussion as well! Thanks for bringing this up.

Thanks for your time. Thanks to you I made a lot of new distinctions
between Ruby core concepts! Initially I wanted to contribute, but I
ended with just increasing my own knowledge for now.

If I find some interesting patterns how Hash (or Ruby in general) is
(ab)used, I'll post them as a new thread with possible ways Ruby could
help simplify/fix things.

P.S. Looking at my reply ... I'm sure even the mail server deserves a
break after this.

--
Cezary Baginski