On Tue, May 31, 2011 at 05:55:39AM +0900, Piotr Szotkowski wrote:
> Cezary:

First of all, thanks Piotr for taking the time to discuss this.
My original ideas for solving the problem or their descriptions
sucked, but I left your comments because they still apply or provide
good examples.

I'm trying to get an idea of how the implementation decisions behind
hashes affect the general use of hashes in Ruby and if something could
be slightly changed in favor improving the user's experience with the
language without too much sacrifice in other areas. 

I believe Hash was designed with efficiency and speed in mind and the
recent Hash syntax changes suggest that all the current ways people use
Hash in Ruby is way beyond scope of the original concept.

Refinements may minimize the need for changes here, but even still, I
think this is a good time to consider what Hash is used for and how
syntax changes can help users better express their ideas instead of
just being able to choose only between an array, a very, very general
associative array or 3rd party gems that have no syntax support.

I hope I am not going overboard with this topic. I have serious doubts
that the slight changes in Hash behavior presented won't cause
problems, but I cannot think of any serious downsides, especially if
only a warning is emitted. And with such a usability upside, I must be
missing a big flaw in the idea or a big gain from the current
behavior.

If this topic does not contribute to Ruby from the user's perspective
I am ready to drop the subject entirely.

> > I though exactly the same thing, until I realized
> > that having keys of different types in a Hash
> > isn't really part of the general Hash concept.
>
> Why? [citation needed]

My wording isn't correct.

First, a Hash in ruby is an associative array that I read about here:

  http://en.wikipedia.org/wiki/Associative_array

And from this:

"From the perspective of a computer programmer, an associative array
can be viewed as a generalization of an array. While a regular array
maps an integer key (index) to a value of arbitrary data type, an
associative array's keys can also be arbitrarily typed. In some
programming languages, such as Python, the keys of an associative
array do not even need to be of the same type."

The type of the key can be anything. Keys can even be different types
with a single instance. The latter is not a requirement of every
possible associative array implementation and this is what I meant.

It can be implementation specific, for example - an rbtree requires
ordering of keys. In this specific case, you cannot have a symbol and
string in such an associative array, because you cannot compare them.

But since Hash uses a hash table, it is possible to have a wider range
of key types, including both symbol and string together. The
implementation allows it, but my question is: is it *that* useful in
the real world? Or does it cause more harm than good?

> > { nil => 0, :foo => 1, 'foo' => 2 }
>
> > Conceptually, people expect Hash keys to be of the same type,
> > except maybe for "hacks" like that nil above that can simplify code.
>
> Well, they either do or don°«t, then. :)

Right. What I wrote isn't correct. I think people expect hash keys to
match a given domain to consider them valid. Just like every variable
should have a value within bounds or raise at the first possible
opportunity. Unless the cause of a problem is otherwise trivial to
find and fix.

I don't recommend the example with nil above. Better alternatives IMHO:

  { :'' => 0, :foo => 1 }[ some_key || :'' ]

  or

  { :foo => 1 }[some_key] || 0

  or set the default in Hash

  Hash.new(0).merge( :foo => 1 )[some_key]

That is why I called it a hack - using a Hash key to get default
values.

> Hm, IMHO °∆any object can be a key, just as any object can be
> a value°« is the general case, and °∆I want my Strings and Symbols
> to be treated the same when they°«re similar, oh, and maybe with
> the nil handled separately for convenience°« is the specialised case.

Exactly. The specialized case is obviously bad. But the general case
turned out not to be too great. I am thinking about third solution:
generic, but within a specified domain - ideally were the differences
between string and symbol stop them from unintentionally being in the
same Hash without being too specialized. And without subclassing.

Even by just a warning that is emitted when a Hash becomes unsortable,
we are not breaking the association array concept while *still*
supporting 99% or more actual real world use cases. And not making any
type-specific assumptions you presented.

As a side effect, if a user writes {'foo': 123}.merge('foo' => 456),
they will get a warning instead of just a hash with two pairs.

Such a warning most likely will help find design flaws and make
difficult to debug errors less often when refactoring. And hopefully
encourage a better design or just think a little more about the
current one.

> > In Ruby "foo" + 123 raises a TypeError. Adding a string
> > key to a symbol-keyed Hash doesn't even show a warning.
>
> I don°«t see why it should  as long as it still
> responds to #hash and #eql?, it°«s a valid Hash key.

Both methods are specific to Ruby's association array's internals 
which uses a hash table. Users generally care only about their
string->symbol problems until they realize that using strings for keys
is generally not a good thing because of problems and debugging time.

Implementation wise I think Hash is great. However, the flexibility
along with symbol/string similarities and more ingenious uses of Hash
will probably cause only more problems over time.

Example:

Python doesn't have symbols and has named arguments. In Ruby we use a
symbol keyed Hash to simulate the latter which is great, but if the
hash is not symbol key based, there is no quick, standard way to
handle that. Sure, you can ignore or raise or convert, but why handle
something you should be able to prevent?

Ignoring keys you don't know seems like a good idea, but the result is
not very helpful in debugging obscure error messages. And lets face
it: most of the Ruby code people work on is not their own.

The only people who don't need to care are the experts who already
have the right habits and understanding that allows them to avoid
problems without too much thought. The rest have to learn the
hard way.

> Hashes in Ruby serve a lot of purposes (they even maintain insertion
> order); if you want to limit their functionality, feel free to subclass.

Why do I have to subclass Hash to get a useful named arguments
equivalent in Ruby? Why would I want object instances for argument
names? Why can't I choose *not* to have them in a simple way?

The overhead and effort required to maintain and use a subclass
becomes a good enough reason to give up on writing robust code.

Which is probably what most rubists do.

We have RBTree and HashWithIndifferentAccess. Neither really helps in
creating good APIs for many of the wrong reasons:

  - HWIA is for Rails specific cases but is usually abused to avoid
    costly string/symbol mistakes

  - RBTree is a gem most people don't know about and stick with Hash
    anyway. It adds an ordering requirement but that seems like a side
    effect. It was proposed to be added in Ruby 1.9, but I don't
    remember why it ultimately didn't

  - the {} notation is too convenient to lose in the case of
    subclassing, especially when Hash is used for method parameters

  - in practice, you can only use the subclass in your own code

> There°«s nothing preventing you from subclassing Hash to
> create StringKeyHash, SymbolKeyHash or even MonoKeyHash
> that would limit the keys°« class to the first one defined.

I thought about that exactly to avoid subclassing: by having an
alternative to the current Hash already as a standard Ruby collection.

But now it think the idea is too limiting to be practical. From the
user's perspective, having Hash restrict its behavior the way RBTree
does would save people a lot of grief.

If Hash changed its behavior in the way described, most of the
existing code would work as usual. Manually replacing {} with a
subclass in a large project is a waste of time. Hashes are used too
often to even consider subclassing.

Consider regular expressions: you can specify options to a regexp,
defining its behavior. Having the same for hashes could be cool:

   {'a' => 3, :a => 3}/so  # s = strict, o = ordered

As examples, we could also have:

  r = uses RBTree for the Hash (and so implies 's')

  i = indifferent access, but not recommended (actually, I personally
  wouldn't want this as an option)

> How would you treat subclasses? Let°«s say I have a Hash with
> keys being instances of People, Employees and Volunteers (with
> Employees ans Volunteers being subclasses of People). Should
> they all be allowed as keys in a single MonoKeyHash or not?

Good example of using a Hash to associate values with (even random)
objects!

Since having keys orderable already answers the part about allowing
into the Hash, I'll concentrate on the case where items are of
different types.

How about an array of objects and a hash of object id's instead?

  [ person1, person2, ...]
  { person1.object_id => some_value, ... }

Or just use the results of #hash as the keys if it is about object
contents. This makes your intention more explicit.

  { person1.hash => some_value, ... }

If you really need different types as a way of associating values with
random objects, you could create a Hash of types and each type would
have object instances:

{
  Fixnum => { 1 => "one", 2 => "two" },
  String => { "1" => "one", "2" => "two" },
}

Then you can use hash[some_key.class][some_key] for access if you
*really* need the current behavior.

Not much harder to handle, but you have much more control over the hash
contents. You probably need to know about used types in the structure
anyway to handle its contents (domain).

> What about String-only keys, but with different
> keys having their own different singleton methods?
>
> (For discussion°«s sake: what about if a couple of the Strings
> had redefined #hash and #eql? methods, on an instance level?)

That's relying heavily on implementation specific details - like
counting on Ruby hashes preserving order or not. That changed
actually, yes. I don't really remember what was the main reason
though.

#hash and #eql? are called by Hash internally - if there is a good
reason for redefining these, there is probably a good way to do it
without relying on Hash internals.

If for some fictional reason Ruby used an rbtree internally for Hash,
#<=> would be used instead of #hash + #eql. Everything else would be
the same except for allowed key values.

> > I think the meaning of symbols and hashes are too similar for such
> > different types to be allowed as keys in the same Hash instance.
>
> But that would introduce a huge exception in the current
> very simple model. Ruby is complicated enough; IMHO we
> should strive to make it less complicated, not more.

Novice users find symbols, strings and Hashes complicated and confusing.
Changing this is my focus here. A complex model that is easily
discoverable is probably better than a simple model that requires
complex solutions from the users to do a great job.

I know it takes hard work and countless hours to keep Ruby a fun
and great language as it is and I think it pays off, nevertheless.
If the goal was to create a simple language with a simple
implementation, we would might have had another Java instead.

Even if it results in an overly complex parser and implementation, I
think only good will come from going out of one's way to make Ruby
users lives easier.

Which is why I really appreciate your input and for giving me the
motivation to understand the topic and Ruby internals better.

Thanks!

--
Cezary Baginski