On Sunday, October 23, 2011 at 9:12 AM, Perry Smith wrote:
> 
> On Oct 22, 2011, at 9:16 PM, Joshua Ballanco wrote:
> > On Saturday, October 22, 2011 at 12:43 PM, Jon wrote:
> > >  
> > > >  What Ruby needs (IMHO), is the equivalent of Obj-C's NSData class. That is, something which can hold a contiguous span of raw bytes without encoding, but with the ability to access ranges and iterate over the data like a String. I regret that I did not recall this desire of mine for the original Ruby 2.0 feature list (I originally encountered the need for this when writing the ControlTower server for MacRuby; which, consequently, does make use of NSData). I would, however, like to propose such a class for Ruby 2.0.
> > > > 
> > > > 
> > > > 
> > > > 
> > > 
> > >  
> > >  
> > > What's your view regarding both the `bytes` (immutable) and `bytearray` (mutable) abstractions from
> > >  
> > >   http://docs.python.org/py3k/library/functions.html#bytearray
> > > 
> > > 
> > > 
> > > 
> > 
> > 
> > 
> > Yes, this sounds like a very similar idea (NSData is immutable and has an NSMutableData counterpart). I think the intro for the NSData documentation captures the motivation perfectly:
> > 
> > > NSData and its mutable subclass NSMutableData provide data objects, object-oriented wrappers for byte buffers. Data objects let simple allocated buffers (that is, data with no embedded pointers) take on the behavior of Foundation objects.
> > 
> > Basically, since the Array class in Ruby is designed to hold objects, there is an annoying amount of overhead required to use Ruby arrays to hold simple bytes (e.g. you have to manually decompose bytes on each append operation). On the other hand, since Ruby does its best to always do the right thing with encodings for String objects, it can get annoying to try and use Ruby strings to hold bytes (you never know when your BINARY string might be coerced into UTF-8).  
> I like and agree with this concept but I wonder if we are talking about two (subtly) different things.

We are. I apologize for derailing the thread a bit. I'll write this up as a formal feature request and we can take the discussion there.
 
> There really is a problem if I take a string encoded with UTF-8 and try to concatenate it with a string encoded with 8859-1 (or one of the more exotic character sets).  What I have never understood (and the Ruby people have tried to educate me) is why, when I say "utf-8-string" + "8859-1-string", Ruby can't just convert the latter to the encoding of the first, do the concatenation and be done with it.  So, there is a second problem.
> 
> And there is a third problem (which is probably a set of problems).  In my application, all the data actually starts off as various EBCDIC code pages. (http://bit.ly/rtTO8F).  Using ICU (http://site.icu-project.org/), I convert these to UTF-8 strings.  I store these in a PostgreSQL database (9.0.4) that is set up with UTF-8 encoding.  But STILL, frequently, something creates strings that are not UTF-8 strings.  As previously stated, I've set all my files to UTF-8 coding as well as set -KU but there are still ways for things to get botched.  And my whole point here is that what Ruby has ended up doing is making simple libraries damn near impossible to write if you really really really really want to do things properly.  Any library that concatenate any strings is open to mistakes.

So, to make up for the earlier derailing, let me see if I can help with the problem you face. Let me state it a bit differently. We can organize encodings into two different kinds of hierarchies: code-points encoded and valid byte sequences.

For example, ASCII encodes 128 code points and recognizes 128 bytes as valid for encoding. UTF-8 encodes 1,112,064 code points, but only recognizes 242 bytes as valid for encoding. The ISO-8859 family of encodings each encode only 256 code points but recognize the full range of 256 bytes as valid for encoding. Ideally, when combining strings the default should be to always move up the hierarchy of code points. That is, combining an ASCII with an ISO-8859 string should result in an ISO-8859 string, and combining an ISO-8859 string with a UTF-8 string should result in a UTF-8 string.

Practically, however, if moving up the code point hierarchy involves moving *down* the byte hierarchy, then there will be conversion overhead involved. Since this changes the run-time characteristics of string operations, it seems reasonable that Ruby does not perform these conversions by default and instead tasks the programmer with checking the encoding and using something like icon to complete the task. That said, I think it would be nice to be able to put String objects into an "auto iconv" mode where combining a UTF-8 and ISO-8859 string would not be an error, but would instead invoke the encoding conversion under the covers (run-time performance be damned!).

I should also add that in thinking about the "dual hierarchies" of encodings this way, the ASCII-8BIT/BINARY encoding should (IMHO) be at the top of both hierarchies, and Ruby should never move down *either* hierarchy by default as a result of a concatenation. Unfortunately, this is not the case, making the BINARY encoding more-or-less useless (bringing me back to the tangent I got caught up on before).