To try and cut to the core of the issue: in Ruby 1.8 it was common practice to use the String class to represent both "proper strings" as wellas a "bag-o-bytes". In Ruby 1.9, you can only properly use the String class to represent "proper strings". For a "bag-o-bytes" we're left with Array, but there are times when Array is not the right abstraction (e.g. reading data from a socket, identifying a start and stop token, and writing the bytes between to a file on disk). Also, the "BINARY"encoding is not the right abstraction, because you still have an object which will worry about encodings and, due to Ruby always trying to do "the right thing", bugs can be very difficult to track down. Consider:

    > a = "test".force_encoding('BINARY')
    > b = "\xFF".force_encoding('BINARY')
    > a << "test"
    "testtest"
    > b << "test"
    "\xFFtest"
    > a << "tst"
    "testtst"
    > b << "tst"
    Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8

What Ruby needs (IMHO), is the equivalent of Obj-C's NSData class. That is, something which can hold a contiguous span of raw bytes without encoding, but with the ability to access ranges and iterate over the data like a String. I regret that I did not recall this desire of mine for the original Ruby 2.0 feature list (I originally encountered the need for this when writing the ControlTower server for MacRuby; which, consequently, doesmake use of NSData). I would, however, like to propose such a class for Ruby 2.0.  


On Friday, October 21, 2011 at 9:45 PM, Eric Hodel wrote:

> On Oct 21, 2011, at 9:43 AM, Perry Smith wrote:
> > Rails, thin, rvm, almost nothing is really and truly ruby 1.9 compliant -- not really. Not when you include all the encoding problems that arestill very common in my life and I assume in the life of anyone trying to use Ruby in any serious fashion.
>  
> In order to generate correct documentation RDoc may need to transcode your source files into the output encoding you desire.
>  
> > Is there a compile time option (or can one be added) that says "I don't care!!! -- just cram the two strings together and eat your spinach!"
>  
> Mashing strings of different encodings together destroys data, but Rubywill automatically handle compatible encodings. You can concatenate a UTF-8 string with a US-ASCII string, for example.
>  
> > Because, ultimately, I've yet to find anything except ruby that actually cares.
>  
> I care, because I hate to see text like 'This encoding stuff is กฤ' on a website. (Yeah, I knowfancy-quotes are an equal abomination, but it's not that hard to be aware of encodings, is it?)
>  
> On Oct 21, 2011, at 5:52 PM, Perry Smith wrote:
> > Just as good of an alternative would be to change my default to UTF-8 instead of US-ASCII.
>  
> This will not fix your problem, nor will -KU fix your problem. They'll only mask your problem.
>  
> The correct solution is to add the encoding magic comment to files thatmatches the expected encoding of the strings they create. Blindly forcing all strings to UTF-8 will break libraries that depend on their strings being in US-ASCII encoding.
>  
> See:
>  
> https://github.com/rdoc/rdoc/commit/ca7651a8b9e6ef32dfa56f4ca618d9cff6ba8b74
>  
> https://github.com/rdoc/rdoc/issues/63
>  
> You will need to send patches to the library maintainers to mark their required encodings correctly, or file tickets.
>  
> > My first attempt to solve this was to put a UTF-8 coding into all my ruby files. This appeared to help but upon reflection, I don't think itreally did. Adding the -KU in scripts like thin's startup script helps more. In fact, I think it solves 99.99% of my problems. But when I update thin (for example) I forget to add the -KU to the script and hit errors until I add the -KU back.
>  
> Let's get concrete.
>  
> Show us an error you get when running thin without any modification andI can help you and the maintainer of thin (or whatever other library) find the appropriate changes to make for it to work correctly.
>  
> Through our combined efforts at a concrete task we may even be able to make it easier for authors to avoid such a pitfall.
>  
> > One recent saga involved memcache-client (which I've mentioned). memcache-client tries to concatenate a command, key, and Marshall'ed data. Ifthe Marshalled data really is ASCII-8BIT, then the concatenation dies.
>  
> Marshal data is always ASCII-8BIT. If memcache-client doesn't set the encoding to US-ASCII (compatible with ASCII-8BIT) then -KU will break it:
>  
> $ ruby19 -e 'a = "text"; b = "\xFF"; a.force_encoding Encoding::US_ASCII; b.force_encoding Encoding::BINARY; p (a + b).encoding'
> #<Encoding:ASCII-8BIT>