Issue #14975 has been updated by duerst (Martin Drst).


As mentioned, the general idea of Ruby m17n is that strings that only contain ASCII bytes (8th bits are all 0) are treated as US-ASCII and can be combined with any ASCII-compatible encoding (taking on that encoding as a result).

The problem with this is that the encoding of the first truly non-ASCII string wins, and the second such string (assuming it's in a different encoding) produces an error. I'm not exactly sure this is consistent with the higher-level policy of failing early in case of encoding mixes, but I think it may be difficult to change.

In any way, when adding stuff to a (BINARY) buffer, the right thing conceptually is to change all the pieces to BINARY, not to rely on some of the pieces (be it the first or another) to be BINARY.

The problem with that is that it either changes the appended string's encoding (with `.force_encoding 'BINARY'`) or needs another copy (with `.b`). What I always wished we had is a method that forces the encoding of a string only locally, without leaking.

One way to realize this might be the following (Ruby as pseudocode):

```ruby
class String
  alias :old_force_encoding :force_encoding

  def force_encoding (encoding)
    if block_given?
      enc = encoding         # remember current encoding
      old_force_encoding encoding
      yield
      old_force_encoding enc # set encoding back
    else
      old_force_encoding encoding
    end
  end
end
```

This then would be used e.g. as follows:

```ruby
b = 'a'.force_encoding('BINARY')
u = "\u00ff".force_encoding('UTF-8')   # aside: force_encoding here is a no-op,
                                       # because \u automatically produces UTF-8
u.force_encoding('BINARY') do
  b << u
end
```

That in my opinion would be the conceptually right way to do things.

What remains is that `<<` for buffers is in many cases not very efficient; it can be way more efficient to collect the `String`s to be appended in an array and then do a join. So we should not only think about this issue for `<<`/`concat`/`append`, but also for more wholesale methods.

----------------------------------------
Feature #14975: String#append without changing receiver's encoding
https://bugs.ruby-lang.org/issues/14975#change-73784

* Author: ioquatix (Samuel Williams)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
I'm not sure where this fits in, but in order to avoid garbage and superfluous function calls, is it possible that `String#<<`, `String#concat` or the (proposed) `String#append` can avoid changing the encoding of the receiver?

Right now it's very tricky to do this in a way that doesn't require extra allocations. Here is what I do:

```ruby
class Buffer < String
	BINARY = Encoding::BINARY
	
	def initialize
		super
		
		force_encoding(BINARY)
	end
	
	def << string
		if string.encoding == BINARY
			super(string)
		else
			super(string.b) # Requires extra allocation.
		end
		
		return self
	end
	
	alias concat <<
end
```

When the receiver is binary, but contains byte sequences, appending UTF_8 can fail:

```
"Foobar".b << "Fbar"
=> "FoobarFbar"

> "Fbar".b << "Fbar"
Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8
```

So, it's not possible to append data, generally, and then call `force_encoding(Encoding::BINARY)`. One must ensure the string is binary before appending it.

It would be nice if there was a solution which didn't require additional allocations/copies/linear scans for what should basically be a `memcpy`.

See also: https://bugs.ruby-lang.org/issues/14033 and https://bugs.ruby-lang.org/issues/13626#note-3

There are two options to fix this:

1/ Don't change receiver encoding in any case.
2/ Apply 1, but only when receiver is using `Encoding::BINARY`




-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>