Issue #9028 has been reported by whitehat101 (Jeremy Ebler).

----------------------------------------
Bug #9028: Make SSLSocket Support Encodings
https://bugs.ruby-lang.org/issues/9028

Author: whitehat101 (Jeremy Ebler)
Status: Open
Priority: Normal
Assignee: drbrain (Eric Hodel)
Category: 
Target version: 
ruby -v: 1.9.3, 2.0.0-p0
Backport: 1.9.3: UNKNOWN, 2.0.0: UNKNOWN


I was working on a bug in the xmpp4r project that caused REXML exceptions when receiving UTF-8 Strings.
https://github.com/xmpp4r/xmpp4r/issues/13

The issue ended up being that SSLSocket#readline didn't always return strings with the same encoding. It gave plain ASCII strings an encoding of UTF-8, and UTF-8 strings an encoding of ASCII-8BIT. We were passing the SSLSocket directly to REXML::Parsers::SAX2Parser and REXML throws exceptions when the input is not UTF-8.

Our solution, wrap the socket and always return consistently encoded strings:

class SSLSocketUtf8 < OpenSSL::SSL::SSLSocket
  def sysread *args
    super.force_encoding ::Encoding::UTF_8
  end
end


<whitehat101> Hello, I'm investigating some strange behavior with OpenSSL::SSL::SSLSocket and string encodings
<whitehat101> #readline returns UTF-8 encoded strings, until the string actually contains UTF-8, then it claims that the encoding is ASCII-8BIT
<whitehat101> I've been reading through the source, and I'm not sure where to try to patch it
<drbrain> whitehat101: have an example script?
<drbrain> whitehat101: can you reproduce it with #sysread?
<drbrain> if you can, the problem lies in the C code
<drbrain> if you cannot, the problem lies in the OpenSSL::Buffering module
<whitehat101> I don't have a concise example, I'm working with the xmpp4r project
<drbrain> whitehat101: look at sample/openssl/echo_*
<drbrain> you can probably make a simple example out of that
<whitehat101> I found that #sysread always returns 8BIT, but #readline usually gives UTF-8
<whitehat101> Thank you, i'll look at those
<drbrain> whitehat101: then I imagine the problem is that OpenSSL::Buffering#initialize creates a UTF-8 buffer
<drbrain> (@rbuffer)
<drbrain> I bet that # encoding: ASCII-8BIT at the very top of the file will fix it
<whitehat101> in buffering.rb?
<drbrain> in ext/openssl/lib/openssl/buffering.rb
<whitehat101> My feeling is that these functions should be returning UTF-8
<whitehat101> A patch that works for my project:
    class SSLSocketUtf8 < OpenSSL::SSL::SSLSocket
      def sysread *args
        super.force_encoding ::Encoding::UTF_8
      end
    end
<drbrain> hrm
<drbrain> they should be returning the encoding of the SSLSocket
<whitehat101> It doesn't look like SSLSocket has any supportfor encodings
<whitehat101> I tried setting the encoding of the TCPSocket, but it had no effect
<drbrain> since SSLSocket wraps the TCPSocket, I don't know if that has an effect on SSLSocket#sysread
<whitehat101> I'm guessing that SSLSocket has no idea what the encoding is, it just deals with bytes
<whitehat101> We're passing the SSLSocket directly to  REXML::Parsers::SAX2Parser
<whitehat101> and REXML throws exceptions when the input is not UTF-8
<drbrain> possibly, since it isn't an IO subclass and doesn't seem to respond to #set_encoding
<drbrain> setting the encoding on the TCPSocket probably has no effect because SSLSocket needs to read binary data off the TCPSocket
<drbrain> the ultimate solution would be "make SSLSocket support encodings"
<whitehat101> That sounds right to me
<drbrain> a short-term fix would be "make the SSLSocket methods return a consistent encoding, regardless of correctness"
<drbrain> whitehat101: if you file a bug, maybe I'll find the time to fix it for ruby 2.1
<drbrain> you can file one here: http://bugs.ruby-lang.org/projects/ruby-trunk/issues/new
<whitehat101> That would be excellent, thanks
<whitehat101> Should I try to make an example, or just include this conversation?
<drbrain> this conversation is enough



-- 
http://bugs.ruby-lang.org/