--4ea229a6_759f82cd_1124
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

To try and cut to the core of the issue: in Ruby 1.8 it was common practice to use the String class to represent both "proper strings" as wells a "bag-o-bytes". In Ruby 1.9, you can only properly use the String class to represent "proper strings". For a "bag-o-bytes" we're left with Array, but there are times when Array is not the right abstraction (e.g. reading data from a socket, identifying a start and stop token, and writing the bytes between to a file on disk). Also, the "BINARY"ncoding is not the right abstraction, because you still have an object which will worry about encodings and, due to Ruby always trying to do "the right thing", bugs can be very difficult to track down. Consider:

    > a = "test".force_encoding('BINARY')
    > b = "\xFF".force_encoding('BINARY')
    > a << "test"
    "testtest"
    > b << "test"
    "\xFFtest"
    > a << "tést"
    "testtést"
    > b << "tést"
    Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8

What Ruby needs (IMHO), is the equivalent of Obj-C's NSData class. That is, something which can hold a contiguous span of raw bytes without encoding, but with the ability to access ranges and iterate over the data like a String. I regret that I did not recall this desire of mine for the original Ruby 2.0 feature list (I originally encountered the need for this when writing the ControlTower server for MacRuby; which, consequently, doesake use of NSData). I would, however, like to propose such a class for Ruby 2.0.  


On Friday, October 21, 2011 at 9:45 PM, Eric Hodel wrote:

> On Oct 21, 2011, at 9:43 AM, Perry Smith wrote:
> > Rails, thin, rvm, almost nothing is really and truly ruby 1.9 compliant -- not really. Not when you include all the encoding problems that aretill very common in my life and I assume in the life of anyone trying to use Ruby in any serious fashion.
>  
> In order to generate correct documentation RDoc may need to transcode your source files into the output encoding you desire.
>  
> > Is there a compile time option (or can one be added) that says "I don't care!!! -- just cram the two strings together and eat your spinach!"
>  
> Mashing strings of different encodings together destroys data, but Rubyill automatically handle compatible encodings. You can concatenate a UTF-8 string with a US-ASCII string, for example.
>  
> > Because, ultimately, I've yet to find anything except ruby that actually cares.
>  
> I care, because I hate to see text like 'This âŸncodingâstuff is ' on a website. (Yeah, I knowancy-quotes are an equal abomination, but it's not that hard to be aware of encodings, is it?)
>  
> On Oct 21, 2011, at 5:52 PM, Perry Smith wrote:
> > Just as good of an alternative would be to change my default to UTF-8 instead of US-ASCII.
>  
> This will not fix your problem, nor will -KU fix your problem. They'll only mask your problem.
>  
> The correct solution is to add the encoding magic comment to files thatatches the expected encoding of the strings they create. Blindly forcing all strings to UTF-8 will break libraries that depend on their strings being in US-ASCII encoding.
>  
> See:
>  
> https://github.com/rdoc/rdoc/commit/ca7651a8b9e6ef32dfa56f4ca618d9cff6ba8b74
>  
> https://github.com/rdoc/rdoc/issues/63
>  
> You will need to send patches to the library maintainers to mark their required encodings correctly, or file tickets.
>  
> > My first attempt to solve this was to put a UTF-8 coding into all my ruby files. This appeared to help but upon reflection, I don't think iteally did. Adding the -KU in scripts like thin's startup script helps more. In fact, I think it solves 99.99% of my problems. But when I update thin (for example) I forget to add the -KU to the script and hit errors until I add the -KU back.
>  
> Let's get concrete.
>  
> Show us an error you get when running thin without any modification andan help you and the maintainer of thin (or whatever other library) find the appropriate changes to make for it to work correctly.
>  
> Through our combined efforts at a concrete task we may even be able to make it easier for authors to avoid such a pitfall.
>  
> > One recent saga involved memcache-client (which I've mentioned). memcache-client tries to concatenate a command, key, and Marshall'ed data. Ifhe Marshalled data really is ASCII-8BIT, then the concatenation dies.
>  
> Marshal data is always ASCII-8BIT. If memcache-client doesn't set the encoding to US-ASCII (compatible with ASCII-8BIT) then -KU will break it:
>  
> $ ruby19 -e 'a = "text"; b = "\xFF"; a.force_encoding Encoding::US_ASCII; b.force_encoding Encoding::BINARY; p (a + b).encoding'
> #<Encoding:ASCII-8BIT>


--4ea229a6_759f82cd_1124
Content-Type: text/html; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline


                <div>To try and cut to the core of the issue: in Ruby 1.8t was common practice to use the String class to represent both "proper strings" as well as a "bag-o-bytes". In Ruby 1.9, you can only properly use the String class to represent "proper strings". For a "bag-o-bytes" we're left with Array, but there are times when Array is not the right abstraction (e.g. reading data from a socket, identifying a start and stop token, and writing the bytes between to a file on disk). Also, the "BINARY" encoding is not the right abstraction, because you still have an object which will worry about encodings and, due to Ruby always trying to do "the right thing", bugs can be very difficult to track down. Consider:<div><br></div><div>&nbsp; &nbsp; &gt; a = "test".force_encoding('BINARY')</div><div>&nbsp; &nbsp; &gt; b = "\xFF".force_encoding('BINARY')</div><div>&nbsp; &nbsp; &gt; a &lt;&lt;22test"</div><div>&nbsp; &nbsp; "testtest"</div><div>&nbsp; &nbsp; &gt; b &lt;&lt; "test"</div><div>&nbsp; &nbsp; "\xFFtest"</div><div>&nbsp; &nbsp; &gt; a &lt;&lt; "tst"</div><div>&nbsp;nbsp; "testtst"</div><div>&nbsp; &nbsp; &gt; b &lt;&lt; "tst"</div><div>&nbsp; &nbsp;&nbsp;Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8</div><div><br></div><div>What Ruby needs (IMHO), is the equivalent of Obj-C's NSData class. That is, something which can hold a contiguous span of raw bytes without encoding, but with the ability to access ranges and iterate over the data like a String. I regret that I did not recall this desire of mine for the original Ruby 2.0 feature list (I originally encountered the need for this when writing the ControlTower server for MacRuby; which, consequently, does make use of NSData). I would, however, like to propose such a class for Ruby 2.0.</div></div>
                <div></div>
                 
                <p style="color: #A0A0A8;">On Friday, October 21, 2011 at 9:45 PM, Eric Hodel wrote:</p>
                <blockquote type="cite" style="border-left-style:solid;border-width:1px;margin-left:0px;padding-left:10px;">
                    <span><div><div>On Oct 21, 2011, at 9:43 AM, Perry Smith wrote:<br><blockquote type="cite"><div>Rails, thin, rvm, almostothing is really and truly ruby 1.9 compliant -- not really.  Not when you include all the encoding problems that are still very common in my life and I assume  in the life of anyone trying to use Ruby in any serious fashion.<br></div></blockquote><br>In order to generate correct documentation RDoc may need to transcode your source files into the output encoding you desire.<br><br><blockquote type="cite"><div>Is there a compile time option (or can one be added) that says "I don't care!!! --ust cram the two strings together and eat your spinach!"<br></div></blockquote><br>Mashing strings of different encodings together destroys data, but Ruby will automatically handle compatible encodings.  You can concatenate a UTF-8 string with a US-ASCII string, for example.<br><br><blockquote type="cite"><div>Because, ultimately, I've yet to find anything except ruby that actually cares.<br></div></blockquote><br>I care, because I hate to see text like 'This encoding stuff is ' on a website.  (Yeah, I know fancy-quotes are an equal abomination, but it's not that hard to be aware of encodings, is it?)<br><br>On Oct 21, 2011, at 5:52 PM, Perry Smith wrote:<br><blockquote type="cite"><div>Just as good of an alternative woulde to change my default to UTF-8 instead of US-ASCII.<br></div></blockquote><br>This will not fix your problem, nor will -KU fix your problem. hey'll only mask your problem.<br><br>The correct solution is to add the encoding magic comment to files that matches the expected encoding of the strings they create.  Blindly forcing all strings to UTF-8 will break libraries that depend on their strings being in US-ASCII encoding.<br><br>See:<br><br>https://github.com/rdoc/rdoc/commit/ca7651a8b9e6ef32dfa56f4ca618d9cff6ba8b74<br><br>https://github.com/rdoc/rdoc/issues/63<br><br>You will need to send patches to the library maintainers to markheir required encodings correctly, or file tickets.<br><br><blockquote type="cite"><div>My first attempt to solve this was to put a UTF-8 coding into all my ruby files.  This appeared to help but upon reflection, I don't think it really did.  Adding the -KU in scripts like thin's startup script helps more.  In fact, I think it solves 99.99% of my problems.  But when I update thin (for example) I forget to add the -KU to the script and hit errors until I add the -KU back.<br></div></blockquote><br>Let's get concrete.<br><br>Show us an error you get when running thin without any modification and I can help you and the maintainer of thin (or whatever other library) find the appropriate changes to make for it to work correctly.<br><br>Through our combined efforts at a concrete task we may even be able to make it easier for authors to avoid such a pitfall.<br><br><blockquote type="cite"><div>One recent saga involved memcache-client (which I've mentioned).  memcache-client tries to concatenate a command, key, and Marshall'ed data.  If the Marshalled data really is ASCII-8BIT, then the concatenation dies.<br></div></blockquote><br>Marshal data is always ASCII-8BIT.  If memcache-client doesn't set the encoding to US-ASCII (compatible with ASCII-8BIT) then -KU will break it:<br><br>$ ruby19 -e 'a = "text"; b = "\xFF"; a.force_encoding Encoding::US_ASCII; b.force_encoding Encoding::BINARY; p (a + b).encoding'<br>#&lt;Encoding:ASCII-8BIT&gt;<br></div></div></span>
                 
                 
                 
                 
                </blockquote>
                 
                <div>
                    <br>
                </div>
            
--4ea229a6_759f82cd_1124--