Hi,

> * In what cases are certain replacement values used when no custom one
>   is given?

Current CRuby uses:
Unicode family: U+FFFD
others: ?

> * How exactly are groups of invalid sequences determined and replaced?
>   It seems that in some cases two invalid characters are replaced
>   separately whereas in other cases they are replaced as a group.

It follows Unicode spec (5.22 Best Practice for U+FFFD Substitution)
http://www.unicode.org/versions/Unicode6.2.0/ch05.pdf
The practice says "The maximal subpart should be replaced".

> * When exactly would Encoding::CompatibilityError be raised? When both
>  the input String and replacement are in non matching encodings?

Following logic.

if the replacement string is broken
  raise ArgumentError
else if the coderange of the replacement is 7bit
  if the input is not ASCII compatible
    raise Encoding::CompatibilityError.
  end
else
  if the encoding of the input and the encoding of the replacement is different
    raise Encoding::CompatibilityError.
  end
end

Thanks,


2014-01-24 Yorick Peterse <yorickpeterse / gmail.com>:
> I am currently working on porting String#scrub and String#scrub! to
> Rubinius (https://github.com/rubinius/rubinius/issues/2901). Looking at
> the source code of this method in MRI
> (https://github.com/ruby/ruby/blob/trunk/string.c#L8022) and the
> corresponding tests there are several different paths the code takes.
> For example, if I'm reading it correctly it will use different
> replacement values depending on the input encoding.
>
> Since my C knowledge and the understanding of the MRI internals is
> limited I'd like to request some clarification on the behaviour of these
> methods. In particular, I'd like to know the following:
>
> * In what cases are certain replacement values used when no custom one
>   is given?
>
> * How exactly are groups of invalid sequences determined and replaced?
>   It seems that in some cases two invalid characters are replaced
>   separately whereas in other cases they are replaced as a group.
>
> * When exactly would Encoding::CompatibilityError be raised? When both
>   the input String and replacement are in non matching encodings?
>
> To clarify the second item, consider the following snippet:
>
>     "\xE3\x80".scrub('-') # => "-"
>
> Here the two sequences get replaced as a group, resulting in only one
> instance of "-". However, in the following snippet they are replaced
> separately:
>
>     "\x80\x80".scrub('-') # => "--"
>
> Maybe I'm not fully understanding Unicode but it would be nice if this
> behaviour was documented somewhere as right now it's not clear whether
> this is intentional or a bug.
>
> The closest thing to a spec of the behaviour I could find is
> https://bugs.ruby-lang.org/issues/6752 but most of this is in Japanese,
> a language I sadly can't read.
>
> Thanks for the info!



-- 
NARUSE, Yui  <naruse / airemix.jp>