On Saturday 17 June 2006 16:58, gwtmp01 / mac.com wrote:
> On Jun 17, 2006, at 9:50 AM, Stefan Lang wrote:
> > On Saturday 17 June 2006 13:08, Juergen Strobel wrote:
> >> 2. Strings should neither have an internal encoding tag, nor an
> >> external one via $KCODE. The internal encoding should be
> >> encapsulated by the string class completely, except for a few
> >> related classes which may opt to work with the gory details for
> >> performance reasons. The internal encoding has to be decided,
> >> probably between UTF-8, UTF-16, and UTF-32 by the String class
> >> implementor.
> >
> > Full ACK. Ruby programs shouldn't need to care about the
> > *internal* string encoding. External string data is treated as
> > a sequence of bytes and is converted to Ruby strings through
> > an encoding API.
>
> I don't claim to be an Unicode export but shouldn't the goal be to
> have Ruby work with *any* text encoding on a per-string basis?  Why
> would you want to force all strings into Unicode for example in a
> context where you aren't using Unicode?  (The internal encoding has
> to be....).  And of course even in the Unicode world you have
> several different encodings (UTF-8, UTF-16, and so on).  Juergen,
> when you say 'internal encoding' are you talking about the text
> encoding of Ruby source code?

I'm not Juergen, but since you responded to my message...

First of all Unicode is a character set and UTF-8, UTF-16 etc.
are encodings, that is they specify how a Unicode character is
represented as a series of bits.

At least *I* am not talking about the encoding of Ruby source
code. The main point of the proposal is to use a single
universal character encoding for all Ruby character strings
(instances of the String class). Assuming there is an ideal
character set that is really sufficient to represent any
text in this world, it could be used to construct a String
class that abstracts the underlying representation completely
away.

Consider the "float" data type you will find in most
programming languages: The programmer doesn't think in terms
of the bits that represent a floating point value. He just
uses the operators provided for floats. He can choose between
different serialization strategies if he needs to serialize
floats. But the *operators* on floats the programming language
provides don't care about the different serialization formats,
they all work using the same internal representation.
Conversion is done on IO. Ideally, the same level of
abstraction should be there for character data.

If you have a universal character set (Unicode is an attempt
at this), and an encoding for it, the programming language can
abstract the underlying String representation away. For IO, it
provides methods (i.e. through Encoding objects) that
serialize Strings to a stream of bytes and vice versa.

> It seems to me that irrespective of any particular text encoding
> scheme you need clean support of a simple byte vector data
> structure completely unencumbered with any notion of text encoding
> or locale.

I have proposed that further below as Buffer or ByteString.

> Right now that is done by the String class, whose name I
> think certainly creates much confusion.  If the class had been
> called Vector and then had methods like:
>
> 	Vector#size		# size in bytes
> 	Vector#str_size 	# size in characters (encoding and locale
> considered)

By providing str_size you are already mixing up the purpose of
your simple byte vector and character strings.

-- 
Stefan