Issue #15940 has been updated by duerst (Martin D=FCrst).


naruse (Yui NARUSE) wrote:
> Note that an incompatibility which is caused by the change of string enco=
ding is `String#<<(integer)`.
>
> Maybe String#<<(n) should be deprecated if n > 127 and explicitly specify=
 the encoding argument.

If I understand this correctly, the proposal is to change the encoding of S=
ymbols from ASCII to UTF-8. So if such a symbol is converted to a String (w=
hich in itself may not be that frequent), and then an Integer is 'shifted' =
into that String with `<<`, then the only incompatibility that we get is th=
at until now, it was an error to do that with a number > 127.

So the overall consequence is that something that produced an error up to n=
ow doesn't produce an error anymore. I guess that's an incompatibility that=
 we should be able to tolerate. It's much more of a problem if something th=
at worked until now stops to work, or if something that worked one way sudd=
enly works another way.

As for explicitly specifying an encoding argument for `String#<<`, I unders=
tand that it may be the conceptually correct thing to do (we are using the =
Integer as a character number, so we better knew what encoding this charact=
er number was expressed in). But the encoding is already available from the=
 string, and in most cases will be the source encoding or so anyway, which =
will be UTF-8 in most cases. Also, because `<<` is a binary operator, it wo=
uld be difficult to add additional parameters.


----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII =
to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-78910

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: =

* Target version: =

----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' =3D> name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their i=
nternal fstring to ASCII when possible, the above snippet will actually kee=
p two instances of `"name"` in the fstring registry. One in ASCII, the othe=
r in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols f=
strings as UTF-8 instead makes no significant difference, but allows in mos=
t cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant =
compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't =
know if they can be changed or not: https://github.com/ruby/spec/commit/a73=
a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider chang=
ing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseud=
o code:

```ruby
def to_s
  str =3D fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```





-- =

https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=3Dunsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>