Issue #10770 has been updated by Masaki Kagaya.


This issue comes from discussion about mruby's behavior (https://github.com/mruby/mruby/issues/2708).

----------------------------------------
Feature #10770: chr and ord behavior for ill-formed byte sequences and surrogate code points
https://bugs.ruby-lang.org/issues/10770#change-51165

* Author: Masaki Kagaya
* Status: Open
* Priority: Normal
* Assignee: 
----------------------------------------
ord raise error when meeting ill-formed byte sequences, thus the difference of atttiute exists beteween each_char and each_codepoint.

<pre><code class="ruby">
str = "a\x80bc"
str.each_char {|c| puts c }
 # no error
str.each_codepoint {|c| puts c }
 # invalid byte sequence in UTF-8 (ArgumentError)
</code></pre>

The one way of keeping consistency is change ord to return substitute code point such as 0xFFFD adopted by scrub.

Another problem about consitency is surrogate code points. Althouh CRuby allows to use surrogate code points in unicode literal, ord and chr dont't allow them.

<pre><code class="ruby">
"\uD800".ord
 # invalid byte sequence in UTF-8 (ArgumentError)

0xD800.chr('UTF-8')
 # invalid codepoint 0xD800 in UTF-8 (RangeError)
</code></pre>

How about remove the restriction? The one example of using surrogate code points is converting a 4-byte character to a pair of 3-byte characters for MySQL/MariaDB's utf8mb3.

<pre><code class="ruby">
str = "\u{1F436}" # DOG FACE
cp = str.ord

if cp > 0x10000 then
  # http://unicode.org/faq/utf_bom.html#utf16-4
  lead = 0xD800 - (0x10000 >> 10) + (cp >> 10)
  trail = 0xDC00 + (cp & 0x3FF)
  ret = lead.chr('UTF-8') + trail.chr('UTF-8')
end
</code></pre>



-- 
https://bugs.ruby-lang.org/