Issue #8516 has been reported by bbxiao1 (Xiao Ba).

----------------------------------------
Bug #8516: IO#readchar returns wrong codepoints when converting encoding
https://bugs.ruby-lang.org/issues/8516

Author: bbxiao1 (Xiao Ba)
Status: Open
Priority: Normal
Assignee: 
Category: 
Target version: 
ruby -v: ruby 1.9.3p429 (2013-05-15 revision 40747) [x86_64-darwin11.4.2]
Backport: 1.9.3: UNKNOWN, 2.0.0: UNKNOWN


I am trying to parse plain text files with various encodings that will ultimately be converted to UTF-8 strings. Non-ascii characters work fine with a file encoded as UTF-8, but problems come up with non-UTF-8 files.

$ file -i utf_8.txt
utf_8.txt: text/plain; charset=utf-8

$ file -i iso_8859_1.txt
iso_8859_1.txt: text/plain; charset=iso-8859-1

Code:
utf_8_file = "utf_8.txt"
iso_file = "iso_8859_1.txt"

puts "Processing #{utf_8_file}"
File.open(utf_8_file) do |io|
  line, char = "", nil

  until io.eof? || char == ?\n || char == ?\r
    char = io.readchar
    puts "Character #{char} has #{char.each_codepoint.count} codepoints"
    puts "Character #{char} codepoints: #{char.each_codepoint.to_a.join}"
    puts "SLICE FAIL" unless char == char.slice(0,1)
    line << char
  end

  line
end
puts "\n" 
puts "Processing #{iso_file}"
File.open(iso_file) do |io|
  io.set_encoding("#{Encoding::ISO_8859_1}:#{Encoding::UTF_8}")
  line, char = "", nil

  until io.eof? || char == ?\n || char == ?\r
    char = io.readchar
    puts "Character #{char} has #{char.each_codepoint.count} codepoints"
    puts "Character #{char} codepoints: #{char.each_codepoint.to_a.join(', ')}"
    puts "SLICE FAIL" unless char == char.slice(0,1)
    line << char
  end

  line
end

Output:
Processing utf_8.txt
Character á has 1 codepoints
Character á codepoints: 225
Character  has 1 codepoints
Character  codepoints: 193
Character ð has 1 codepoints
Character ð codepoints: 240
Character 
 has 1 codepoints
Character 
 codepoints: 10

Processing iso_8859_1.txt
Character á has 2 codepoints
Character á codepoints: 195, 161
SLICE FAIL
Character  has 2 codepoints
Character  codepoints: 195, 129
SLICE FAIL
Character ð has 2 codepoints
Character ð codepoints: 195, 176
SLICE FAIL
Character 
 has 1 codepoints
Character 
 codepoints: 10

With the ISO-8859-1 encoded file, readchar is returning the character bytes when I would expect UTF-8 codepoints.


-- 
http://bugs.ruby-lang.org/