Issue #7090 has been updated by stefan (Stefan Lang).


UTF-16BE

irb(main):003:0> s = "".force_encoding('utf-16be')
=> ""
irb(main):004:0> s << 0x20
=> "\u0000"
irb(main):005:0> s << 0x300
=> "\u0000\u0300"
----------------------------------------
Bug #7090: UTF-16LE String#<< append 0x0 for certain codepoints
https://bugs.ruby-lang.org/issues/7090#change-29807

Author: stefan (Stefan Lang)
Status: Open
Priority: Normal
Assignee: 
Category: 
Target version: 
ruby -v: ruby 1.9.3p194 (2012-04-20) [x86_64-linux]


  $ irb193 -r unicode_utils/u
  irb(main):001:0> RUBY_VERSION
  => "1.9.3"
  irb(main):002:0> s1 = "".force_encoding('utf-16le')
  => ""
  irb(main):003:0> s1 << 0x20
  => " "
  irb(main):004:0> s1 << 0x300
  => " \u0000"
  irb(main):005:0> U.debug s1
   Char | Ordinal | Sid   | General Category | UTF-8
  ------+---------+-------+------------------+-------
   " "  |      20 | SPACE | Space_Separator  | 20
   N/A  |       0 | NULL  | Control          | 00
  => nil
  irb(main):006:0> s2 = "".force_encoding('utf-8')
  => ""
  irb(main):007:0> s2 << 0x20
  => " "
  irb(main):008:0> s2 << 0x300
  => " ??"
  irb(main):009:0> U.debug s2
   Char | Ordinal | Sid                    | General Category | UTF-8
  ------+---------+------------------------+------------------+-------
   " "  |      20 | SPACE                  | Space_Separator  | 20
   N/A  |     300 | COMBINING GRAVE ACCENT | Nonspacing_Mark  | CC 80
  => nil

IMO, the behaviour with the UTF-8 string is correct.

  $ ri193 'String#<<'
  = String#<<

  (from ruby core)
  ------------------------------------------------------------------------------
    str << integer       -> str
    str.concat(integer)  -> str
    str << obj           -> str
    str.concat(obj)      -> str
     

  ------------------------------------------------------------------------------

  Append---Concatenates the given object to str. If the object is a
  Integer, it is considered as a codepoint, and is converted to a character
  before concatenation.

    a = "hello "
    a << "world"   #=> "hello world"
    a.concat(33)   #=> "hello world!"

AFAIK, a Ruby 1.9 string can be viewed as either 1) a sequence of raw bytes,
or 2) a sequence of codepoints.

Except for maybe regexes, Ruby has no higher level concept of a "character"
than a codepoint. Insofar I don't know what the "and is converted to
a character before concatenation" means.

If we take the sequence of codepoints view, than "str << integer" is simply
appending a codepoint.

If we take the sequence of bytes view, then "str << integer" is converting
the codepoint into a sequence of bytes that correspond to the codepoint
in str.encoding and appending that sequence of bytes.


-- 
http://bugs.ruby-lang.org/