--xo44VMWPx7vlQ2+2
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sat, Jun 17, 2006 at 10:52:24PM +0900, Austin Ziegler wrote:
> On 6/17/06, Juergen Strobel <strobel / secure.at> wrote:
> >I empathically agree. I'll even repeat and propose a new Plan for
> >Unicode Strings in Ruby 2.0 in 10 points:
> >
> >1. Strings should deal in characters (code points in Unicode) and not
> >in bytes, and the public interface should reflect this.
> 
> Agree, mostly. Strings should have a way to indicate the buffer size of
> the String.
> 
> >2. Strings should neither have an internal encoding tag, nor an
> >external one via $KCODE. The internal encoding should be encapsulated
> >by the string class completely, except for a few related classes which
> >may opt to work with the gory details for performance reasons.
> >The internal encoding has to be decided, probably between UTF-8,
> >UTF-16, and UTF-32 by the String class implementor.
> 
> Completely disagree. Matz has the right choice on this one. You can't
> think in just terms of a pure Ruby implementation -- you *must* think
> in terms of the Ruby/C interface for extensions as well.

I admit I don't know about Ruby's C extensions. Are they unable to
access String's methods? That is all that is needed to work with them.

And since this String class does not have a parametric encoding
attribute, it should be easier to crunch in C even.

> >3. Whenever Strings are read or written to/from an external source,
> >their data needs to be converted. The String class encapsulates the
> >encoding framework, likely with additional helper Modules or Classes
> >per external encoding. Some methods take an optional encoding
> >parameter, like #char(index, encoding=:utf8), or
> >#to_ary(encoding=:utf8), which can be used as helper Class or Module
> >selector.
> 
> Conversion should be possible at any time. An "external source" may be
> an extension that your Ruby program can't distinguish. Again, this point
> fails because your #2 is unacceptable.

Note that explict conversion to characters, arrays, etc, is possible
for any supported character set and encodig. I have even given method
examples. "External" is to be seen in the context of the String class.

> >4. IO instances are associated with a (modifyable) encoding. For
> >stdin, stdout this can be derived from the locale settings. String-IO
> >operations work as expected.
> 
> Agree, realising that the internal implementation of String must be
> completely different than you've suggested. It is also important to
> retain *raw* reading; a JPEG should not be interpreted as Unicode.
> 
> >5. Since the String class is quite smart already, it can implement
> >generally useful and hard (in the domain of Unicode) operations like
> >case folding, sorting, comparing etc.
> 
> Agreed, but this would be expected regardless of the actual encoding of
> a String.

I am unaware of Matz's exact plan. Any good english language links?  

I was under the impression users of Matz' String instances need to
look at the encoding tag to implement eg. #version_sort. If that is
not the case our proposals are not that much different, only Matz' one
is even more complex to implement than mine.

> 
> >6. More exotic operations can easily be provided by additional
> >libraries because of Ruby's open classes. Those operations may be
> >coded depending on on String's public interface for simplicissity, or
> >work with the internal representation directly for performance.
> 
> Agreed.
> 
> >7. This approach leaves open the possibility of String subclasses
> >implementing different internal encodings for performance/space
> >tradeoff reasons which work transparently together (a bit like FixInt
> >and BigInt).
> 
> Um. Disagree. Matz's proposed approach does this; yours does not. Yours,
> in fact, makes things *much* harder.

If Matz's approach requires looking at the encoding tag from the
outside, it is not as transparent as mine. If it isn't it just boils
down to a parametric class versus subclass hierarchy design decision,
and I don't see much difference and would be happy with either one.

> 
> >8. Because Strings are tightly integrated into the language with the
> >source reader and are used pervasively, much of this cannot be
> >provided by add-on libraries, even with open classes. Therefore the
> >need to have it in Ruby's canonical String class. This will break some
> >old uses of String, but now is the right time for that.
> 
> "Now" isn't; Ruby 2.0 is. Maybe Ruby 1.9.1.

My original title, somewhere snipped out, was "A Plan for Unicode
Strings in Ruby 2.0". I don't want to rush things or break 1.8 either.

> 
> >9. The String class does not worry over character representation
> >on-screen, the mapping to glyphs must be done by UI frameworks or the
> >terminal attached to stdout.
> 
> The String class doesn't worry about that now.

I was just playing safe here.

> >10. Be flexible. <placeholder for future idea>
> 
> And little is more flexible than Matz's m17n String.

I've had flexibility with respect to Unicode Standards in mind, to not
fall into traps similiar to Java. A simple to use String class,
powerful enough to include every character of the world was my goal,
with the ability to convert to and from other external (from the
String class'es point of view) representations.

The flexibility to have parametric String encodings inside the String
class was not what I was going for, rather I would have that
inaccessible or at least unneccessary to access for the common String
user, and I provided a somewhat weaker but maybe still sufficient
technique via subclassing.

> >This approach has several advantages and a few disadvantages, and I'll
> >try to bring in some new angles to this now too:
> >
> >*Advantages*
> >
> >-POL, Encapsulation-
> >
> >All Strings behave exactly the same everywhere, are predictable,
> >and do the hard work for their users.
> 
> Remember: POLS is not an acceptable reason for anything. Matz's m17n
> Strings would be predictable, too. a + b would be possible if and only
> if a and b are the same encoding or one of them is "raw" (which would
> mean that the other is treated as the defined encoding) *or* there is a
> built-in conversion for them.

Since I probably cannot control which Strings I get from libraries,
and dont't want to worry which ones I'll have to provide to them, this
is weaker than my approach in this respect, see my next point.

> 
> >-Cross Library Transparency-
> >No String user needs to worry which Strings to pass to a library, or
> >worry which Strings he will get from a library. With Web-facing
> >libraries like rails returning encoding-tagged Strings, you would be
> >likely to get Strings of all possible encodings otherwise, and isthe
> >String user prepared to deal with this properly?  This is a *big* deal
> >IMNSHO.
> 
> This will be true with m17n strings. However, your proposal does *not*
> work for Ruby/C interfaced items. Sorry.

Please elaborate this or provide pointers. I cannot believe C cannot
crunch at my Strings, which are less parametric than Matz's ones are.

> 
> >-Limited Conversions-
> >
> >Encoding conversions are limited to the time Strings are created or
> >written or explicitly transformed to an external representation.
> 
> This is a mistake. I may need to know the internal representation of a
> particular encoding of a String inside of a program. Trust me on this
> one: I *have* done some low-level encoding work. Additionally, even
> though I might have marked a network object as "UTF-8", I may not know
> whether it's *actually* UTF-8 or not until I get HTTP headers -- or
> worse, a <meta http-equiv> tag. Assuming UTF-8 reading in today's world
> is doomed to failure.

Read it as binary, and decide later. These problems should be locally
containable, and methods are still able to return Strings after
determining the encoding.

> >-Correct String Operations-
> >Even basic String operations are very hard in the world of Unicode. If
> >we leave the String users to look at the encoding tags and sort it out
> >themselves, they are bound to make mistakes because they don't care,
> >don't know, or have no time. And these mistakes may be _security_
> >_sensitive_, since most often credentials are represented as Strings
> >too. There already have been exploits related to Unicode.
> 
> This is a misunderstanding on your part. Nothing about Matz's m17n
> Strings suggests that String users would have to look at the encoding
> tags. Merely that they *could*. I suspect that there will be pragma-
> like behaviours to enforce a particular internal representation at all
> times.

Previously you stated users need to look at the encoding to determine
if simple operations like a + b work.

Can you point to more info? I am interested how this pragma stuff
works, and if not doing it "right" can break things.

> >*Disadvantages* (with mitigating reasoning of course)
> >- String users need to learn that #byte_length(encoding=:utf8) >=
> >#size, but that's not too hard, and applies everywhere. Users do not
> >need to learn about an encoding tag, which is surely worse to handle
> >for them.
> 
> True, but the encoding tag is not worse. Anyone who assumes that
> developers can ignore encoding at any time simply *doesn't* know about
> the level of problems that can be encountered.

For String concatenates, substring access, search, etc, I expect to be
able to ignore encoding totally. Only when interfacing with
non-String-class objects (I/O and/or explicit conversion) would I need
encoding info.

> >- Strings cannot be used as simple byte buffers any more. Either use
> >an array of bytes, or an optimized ByteBuffer class. If you need
> >regular expresson support, RegExp can be extended for ByteBuffers or
> >even more.
> 
> I see no reason for this.

In my proposal, Unicode Strings cannot represent arbitrary binary data
in their internal representation, since not everything would be valid
characters. In fact, you cannot set the internal representation
directly.

The interface could accept a code point sequence of values
(0..255), but that would be wasteful compared to an array of bytes.

> >- Some String operations may perform worse than might be expected from
> >a naive user, in both the time or space domain. But we do this so the
> >String user doesn't need to himself, and are problably better at it
> >than the user too.
> 
> This is a wash.

Only trying to refute weak arguments in advance.

> >- For very simple uses of String, there might be unneccessary
> >conversions. If a String is just to be passed through somewhere,
> >without inspecting or modifying it at all, in- and outwards conversion
> >will still take place. You could and should use a ByteBuffer to avoid
> >this.
> 
> This is a wash.

Not a big problem either, but someone was bound to bring it up.

> >- This ties Ruby's String to Unicode. A safe choice IMHO, or would we
> >really consider something else? Note that we don't commit to a
> >particular encoding of Unicode strongly.
> 
> This is a wash. I think that it's better to leave the options open.
> After all, it *is* a hope of mine to have Ruby running on iSeries
> (AS/400) and *that* still uses EBCDIC.
> 
> >- More work and time to implement. Some could call it over-engineered.
> >But it will save a lot of time and troubles when shit hits the fan and
> >users really do get unexpected foreign characters in their Strings. I
> >could offer help implementing it, although I have never looked at
> >ruby's source, C-extensions, or even done a lot of ruby programming
> >yet.
> 
> I would call it the amount of work necessary. But the work needs to be
> done for a *variety* of encodings, and not just Unicode. *Especially*
> because of C extensions.
> 
> >Close to the start of this discussion Matz asked what the problem with
> >current strings really was for western users. Somewhere later he
> >concluded case folding. I think it is more than that: we are lazy and
> >expect character handling to be always as easy as with 7 bit ASCII, or
> >as close as possible. Fixed 8-bit codepages worked quite fine most of
> >the time in this regard, and breakage was limited to special
> >characters only.
> 
> >Now let's ask the question in reverse: are eastern programmers so used
> >to doing elaborate byte-stream to character handling by hand they
> >don't recognize how hard this is any more? Surely it is a target for
> >DRY if I ever saw one. Or are there actual problems not solveable this
> >way? I looked up the mentioned Han-Unification issue, and as far as I
> >understood this could be handled by future Unicode revisions
> >allocating more characters, outside of Ruby, but I don't see how it
> >requires our Strings to stay dumb byte buffers.
> 
> No one has ever suggested that Ruby Strings stay byte buffers. However,
> blindly choosing Unicode *adds* unnecessary complexity to the situation.
> 
> -austin
> --
> Austin Ziegler * halostatue / gmail.com * http://www.halostatue.ca/
>               * austin / halostatue.ca * http://www.halostatue.ca/feed/
>               * austin / zieglers.ca

The way I see it we have to choose a character set. I proposed
Unicode, because their official goal is to be the one unifying set,
and if they ain't yet, I hope they'll be sometime.

If that is not enough we will effectively create our own character
set, let's call it RubyCode, which will contain characters from the
union of Unicode and a few other sets. Each String will have a
particular encoding, which will determine which characters of RubyCode
are valid in this particular String instance. Hopefully many
characters will be valid in multiple encodings. But it doesn't sound
like a very clear design to me.

Jgen

-- 
 The box said it requires Windows 95 or better so I installed Linux

--xo44VMWPx7vlQ2+2
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)

iQEVAwUBRJR4rPy64gyiEfXtAQKrBQf/Q44dP/39T8lsA8C0jXAdzRZ+AewPo9To
mhEl2ihU5LrnhTIjskL1WWrD+5lBOygmyLWXnMEPc2GjswHQRSdmNawrbkVCOn7Q
FEZfobsrt4BSM++eVKWJuwNEGOy2Z5HitmoI1cogwAFehsfCXYRrb2vc5vOYRGZx
ncGW5ZnJwVhX1DfraW/lYO/NMOlccybKSKIVPP2YOqMhCCVz8HgCJKrjFyioH7OG
fpcJd2WZVuh+lsJJLFR6y9PYaA09kVY+56z++Ld5xGnwl2pWufY1Zutp9wtS5VQ/
pGUyVJt9e8AYyDAvwgjv8hPjQph3MqxUzvARIRjVdZmTkg4dF7nGngS0
-----END PGP SIGNATURE-----

--xo44VMWPx7vlQ2+2--