Issue #15940 has been updated by byroot (Jean Boussier).


> If I understand your patch correctly

Yes you do.

> I feel this is an inconsistent and confusing behavior change. Am I wrong?

I don't know if you are wrong, but at least we don't agree.

My reasoning is as follow:

  - Simple symbols (read pure ASCII) have to be coerced into a common encoding so that `# encoding: euc-jp :foo == # encoding: iso-8601-1 :foo`
  - UTF-8 is a strict super set of ASCII. Any valid ASCII is valid UTF-8.
  - Simple symbols being UTF-8 encoded isn't any weirder than them being ASCII encoded to me.
  - UTF-8 being the default ruby source encoding, it makes sense for it to be the default internal symbol encoding.
  - If like most Ruby users my source is UTF-8 encoded, then it removes one source of surprise.


> Besides that, I am not sure if this change worth saving 147KB or even 1.4MB in the apps that might consume a few hundred GB of memory.

That is entirely your call. I personally don't see any downside to this change, hence why the minor memory saving is welcome to me, but if you see some downside to it then I agree it's not a big enough saving to justify it.

Also small nitpick,  the 1.4MB saving, it's for an app consuming hundreds of MB not GB.


----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-79358

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```





-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>