Issue #14934 has been updated by duerst (Martin Drst).


I think I have figured things out:

The patch is technically correct. While LBASE and VBASE are the values of the first actual leading and vowel jamos, the value of TBASE is one smaller than the first actual trailing jamo at 0x11A8. This is to account for the fact that the lowest value of the "trailing digit" of the Hangul syllable representation indicates the absence of a trailing jamo. So in contrast to the <= tests related to LBASE and VBASE, it is indeed technically correct to have a < comparison operator in the comparison related to TBASE.

However, I have also figured out why this apparent bug doesn't actually affect Ruby. The reason is that we use regular expressions to extract "normalization runs" from the string to be normalized. We know that a U+11A7 character can never participate in a normalization operation because it is a classical Hangul Jamo not used in modern Hangul. So U+11A7 never appears in a normalization run, and there's thus no error.

----------------------------------------
Bug #14934: Unicode: Hangul normalize bug
https://bugs.ruby-lang.org/issues/14934#change-73156

* Author: MaLin (Lin Ma)
* Status: Open
* Priority: Normal
* Assignee: duerst (Martin Drst)
* Target version: 
* ruby -v: 
* Backport: 2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN
----------------------------------------
I was involved to fix a similar bug in Python, I found Ruby also has bug code.

We should fix this line[1] like this:
[1] https://github.com/ruby/ruby/blob/96db72ce38b27799dd8e80ca00696e41234db6ba/lib/unicode_normalize/normalize.rb#L73

-if length>2 and 0 <= (trail=string[2].ord-TBASE) and trail < TCOUNT
+if length>2 and 0 < (trail=string[2].ord-TBASE) and trail < TCOUNT

-------
There was a change of Unicode Standard's demonstration code.

Before Unicode 4.1.0 (draft), here is: TBase <= code <= TBase+TCount
see: http://www.unicode.org/reports/tr15/tr15-24.html#hangul_composition

After Unicode 4.1.0, here is TBase < code < TBase+TCount, which in line with Unicode 10.0
see: http://www.unicode.org/reports/tr15/tr15-25.html#hangul_composition

This change happened in 2005.

Please note: The normalize algorithm didn't changed, only the demonstration code changed, see this discussion[2] about this point.
[2] https://bugs.python.org/issue29456

-------
Here is some test code[3] for Python, maybe useful for this fix.
[3] https://github.com/python/cpython/commit/d134809cd3764c6a634eab7bb8995e3e2eff14d5



-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>