I am currently working on porting String#scrub and String#scrub! to
Rubinius (https://github.com/rubinius/rubinius/issues/2901). Looking at
the source code of this method in MRI
(https://github.com/ruby/ruby/blob/trunk/string.c#L8022) and the
corresponding tests there are several different paths the code takes.
For example, if I'm reading it correctly it will use different
replacement values depending on the input encoding.

Since my C knowledge and the understanding of the MRI internals is
limited I'd like to request some clarification on the behaviour of these
methods. In particular, I'd like to know the following:

* In what cases are certain replacement values used when no custom one
  is given?

* How exactly are groups of invalid sequences determined and replaced?
  It seems that in some cases two invalid characters are replaced
  separately whereas in other cases they are replaced as a group.

* When exactly would Encoding::CompatibilityError be raised? When both
  the input String and replacement are in non matching encodings?

To clarify the second item, consider the following snippet:

    "\xE3\x80".scrub('-') # => "-"

Here the two sequences get replaced as a group, resulting in only one
instance of "-". However, in the following snippet they are replaced
separately:

    "\x80\x80".scrub('-') # => "--"

Maybe I'm not fully understanding Unicode but it would be nice if this
behaviour was documented somewhere as right now it's not clear whether
this is intentional or a bug.

The closest thing to a spec of the behaviour I could find is
https://bugs.ruby-lang.org/issues/6752 but most of this is in Japanese,
a language I sadly can't read.

Thanks for the info!