On 2/7/07, David Flanagan <david / davidflanagan.com> wrote:
> Nikolai Weibull wrote:
> > On 2/7/07, Sam Roberts <sroberts / uniserve.com> wrote:
> >
> >> Is creating a temporary 1-byte String really that expensive? Some
> >> benchmarks showing an algorithm that uses a long binary string as a data
> >> structure performs much faster with String#ord(i) than String#[i].ord
> >> would probably convince everybody.
> >
> > May I beg for an /important/ algorithm?
>
> Geez!  What does it take to satisfy you guys!  :-)

A lot, which is why most of us are using Ruby in the first place!

> Seriously, though, I thought that my suggestion for an optional index
> argument to ord was a modest and sensible one.  Actually, I thought I
> was just pointing out an oversight and I'm surprised at the resistance
> it has faced.  Given the richness of the core Ruby API, I assume that
> decisions about adding methods are based on elegance, and that is
> permitted or even desirable to have more than one way to do something.
>
> The objections so far have seemed to be "why would you want to do that?"
> and "isn't it fast enough as it is?".  I would have expected objections
> to be more like: "That isn't the Ruby way, you newbie!"  That is, I
> wouldn't have been surprised to be told simply "that doesn't feel
> right", but I am surprised by the request for benchmarks and algorithms.

I thought my initial response was along the lines of it not feeling right.


> >> Btw, isn't ruby 1.9 going to have character set information associated
> >> with strings? Would #ord(idx) return the value of the byte at a
> >> particular byte offset idx, or a codepoint at a character idx?
> >
> > It's worse for other methods like #[], where one can wonder how
> > grapheme clusters are to be dealt with.  My idea was that you would
> > have encodings layered over other encodings for this kind of thing.
>
> I thought I knew a lot about Unicode, but I haven't heard about
> graphemes. Are those things that come up in Arabic?

No.  A grapheme is simply what most people would refer to as a character:

  Grapheme.  (1) A minimally distinctive unit of writing in the
context of a particular writing system.  For example, <b> and <d> are
distinct graphemes in English writing systems because there exist
distinct words like big and dig.  Conversely </a/> and <a> [where
/.../ denotes an italic form of the letter 'a'] are not distinct
graphemes because no word is distinguished on the basis of these two
different forms.  (2) What a user thinks of as a character.

  -- The Unicode Standard, Version 5.0

A grapheme cluster is a set of character followed by pseudo-characters
like accents and other types of modifiers:

  Grapheme Cluster.  A maximal character sequence consisting of a
grapheme base followed by zero or more grapheme extenders or,
alternatively, by the sequence <CR, LF>.  (See definition D60 in
Section 3.6, Combination.)  A grapheme cluster represents a
horizontally segmentable unit of text, consisting of some grapheme
base (which may consist of a Korean syllable) together with any number
of nonspacing marks applied to it.

  -- The Unicode Standard, Version 5.0

You often want to deal with grapheme clusters as a unit, but not many
programs currently do that.

> > Say that you have a string s = {abc}, where a, b, and c are Unicode
> > characters and the {...} syntax means the string of these characters
> > in some encoding, and that it is encoded using UTF-8, and that a and b
> > constitute a grapheme cluster.  Under certain conditions you may want
> > to work with each codepoint separately, under other conditions each
> > grapheme cluster.  Normally, s.encoding would be "utf-8", but if I
> > want to work with grapheme clusters I may set s.encoding =
> > "utf-8.graphemes", where the dot introduces another "encoding axis".
> > In the first case, s[0] would give you the string {a}.  In the second
> > case, s[0] would give you the string {ab}.  Sometimes you may want to
> > work with the individual bytes of s.  You could then set s.encoding =
> > 'ascii' or s.encoding = 'bytes' or something like that (ascii wouldn't
> > be great, as it is 7-bit and perhaps some implementation may depend on
> > this), then s[0] would give you the string {a_1}, where a_1 is the
> > first byte of the encoding of a in UTF-8.
>
> What's the thinking behind the encoding= method, do you know?

Well, I can't actually say that there's going to be an #encoding=
method, but my thought was that the whole encoding business would be
based on a pluggable system and encodings could be changed on the fly
for any given object where it is relevant, like String and IO.

> The way I would have imagined it would be to have methods like as_euc,
> as_unicode, and as_bytes which return objects that provide a new "view"
> on the same underlying bytes.  (Allowing multiple views of the same
> bytes raises hairy isues of concurrent modifications, of course)

The Ruby-on-Rails people have done something like that, sort of.  The
way to access the UTF-8 characters of a String is by invoking
String#chars.  That method will return a proxy that treats the
original String as a UTF-8-encoded sequence of Unicode characters.

I guess one can use a similar scheme for when dealing with graphemes
versus grapheme clusters, i.e., have methods like String#bytes,
String#chars, and String#clusters (while still retaining #encoding, of
course).

I personally don't like this set-up, as it only really makes sense for
a couple of encodings.

> Speaking of concurrent modifications, what happens if one thread changes
> the encoding of a string while another thread is iterating through it?

Well, what happens when you iterate through an array or hash that's
being modified by another thread?  Unless you employ thread-safe
data-structures, it seems to be a free-for-all.

> Anyway, I can see why encoding issues and multi-byte character issues
> argue strongly for representing characters as a kind of String.  Has
> anyone argued for creating a Character class that extends String?  This
> would be a natural place to put methods like ord and digit? alpha?, etc.
> Also, I tend to think that characters should be immutable (I'm not sure
> why) and having a subclass would probably allow that.

I don't have an answer to this question, sorry.

  nikolai