Bug #2130: incorrect UTF8 encoding in CGI.unescapeHTML
http://redmine.ruby-lang.org/issues/show/2130

Author: Larry Kyrala
Status: Open, Priority: Normal
ruby -v: ruby 1.8.6 (2009-06-08 patchlevel 369) [x86_64-linux]

In CGI.unescapeHTML() in cgi.rb note that the html literal encoding is translated thus:
(from http://stdlib.rubyonrails.org/libdoc/cgi/rdoc/classes/CGI.html#M000105)

      when /\A#x([0-9a-f]+)\z/ni then
        if $1.hex < 256
          $1.hex.chr
        else
          if $1.hex < 65536 and ($KCODE[0] == ?u or $KCODE[0] == ?U)
            [$1.hex].pack("U")

The second line should be:
        if $1.hex < 128

in order to conform with standards.

Explanation: 
The inputs of the unescapeHTML() method are assumed to be valid HTML.  Outputs are apparently intended to be valid UTF-8 ruby strings (see Array.pack("U")). However, for hex values 80-FF, pack is bypassed ($1.hex < 256 above), so these characters are incorrectly unescaped.  

According to the 4.01 spec, single-byte hex entity encodings from 80-FF are valid HTML since they conform to the "ISO 10646 hexadecimal character number H". While this is a valid HTML entity, it is important to note that one-byte encodings above 7F are not valid UTF-8 encodings unless they are converted to their two-byte equivalents as per the UTF-8 specification (U+H).  (Note that one-byte encodings from 80-FF are also not valid XML, since the XML spec requires entity encodings to be valid UTF-8 sequences.)

Background:
I found this error while debugging a java-based webservice that returns HTML escaped entities.  The bug is partly on the webservice (since the webservice is XML-based, not HTML-based), but it led me to find the CGI.unescapeHTML bug while trying to implement a workaround. This is a borderline pedantic issue, but I figured it might help other people having this problem.  Also, I might have made a mistake somewhere in the interpretation or the intent of the code, so feel free to comment. Thanks! 


References:
http://www.w3.org/TR/html401/charset.html#h-5.3.1
http://www.w3.org/TR/2008/REC-xml-20081126/#sec-external-ent
http://en.wikipedia.org/wiki/UTF-8#Description
http://en.wikipedia.org/wiki/ISO_10646
http://corelib.rubyonrails.org/classes/Array.html#M000460
http://stdlib.rubyonrails.org/libdoc/cgi/rdoc/classes/CGI.html#M000105


----------------------------------------
http://redmine.ruby-lang.org