On Saturday 17 June 2006 16:58, gwtmp01 / mac.com wrote: > On Jun 17, 2006, at 9:50 AM, Stefan Lang wrote: > > On Saturday 17 June 2006 13:08, Juergen Strobel wrote: > >> 2. Strings should neither have an internal encoding tag, nor an > >> external one via $KCODE. The internal encoding should be > >> encapsulated by the string class completely, except for a few > >> related classes which may opt to work with the gory details for > >> performance reasons. The internal encoding has to be decided, > >> probably between UTF-8, UTF-16, and UTF-32 by the String class > >> implementor. > > > > Full ACK. Ruby programs shouldn't need to care about the > > *internal* string encoding. External string data is treated as > > a sequence of bytes and is converted to Ruby strings through > > an encoding API. > > I don't claim to be an Unicode export but shouldn't the goal be to > have Ruby work with *any* text encoding on a per-string basis? Why > would you want to force all strings into Unicode for example in a > context where you aren't using Unicode? (The internal encoding has > to be....). And of course even in the Unicode world you have > several different encodings (UTF-8, UTF-16, and so on). Juergen, > when you say 'internal encoding' are you talking about the text > encoding of Ruby source code? I'm not Juergen, but since you responded to my message... First of all Unicode is a character set and UTF-8, UTF-16 etc. are encodings, that is they specify how a Unicode character is represented as a series of bits. At least *I* am not talking about the encoding of Ruby source code. The main point of the proposal is to use a single universal character encoding for all Ruby character strings (instances of the String class). Assuming there is an ideal character set that is really sufficient to represent any text in this world, it could be used to construct a String class that abstracts the underlying representation completely away. Consider the "float" data type you will find in most programming languages: The programmer doesn't think in terms of the bits that represent a floating point value. He just uses the operators provided for floats. He can choose between different serialization strategies if he needs to serialize floats. But the *operators* on floats the programming language provides don't care about the different serialization formats, they all work using the same internal representation. Conversion is done on IO. Ideally, the same level of abstraction should be there for character data. If you have a universal character set (Unicode is an attempt at this), and an encoding for it, the programming language can abstract the underlying String representation away. For IO, it provides methods (i.e. through Encoding objects) that serialize Strings to a stream of bytes and vice versa. > It seems to me that irrespective of any particular text encoding > scheme you need clean support of a simple byte vector data > structure completely unencumbered with any notion of text encoding > or locale. I have proposed that further below as Buffer or ByteString. > Right now that is done by the String class, whose name I > think certainly creates much confusion. If the class had been > called Vector and then had methods like: > > Vector#size # size in bytes > Vector#str_size # size in characters (encoding and locale > considered) By providing str_size you are already mixing up the purpose of your simple byte vector and character strings. -- Stefan