Nikolai Weibull wrote:
> On 2/7/07, Sam Roberts <sroberts / uniserve.com> wrote:
> 
>> Is creating a temporary 1-byte String really that expensive? Some
>> benchmarks showing an algorithm that uses a long binary string as a data
>> structure performs much faster with String#ord(i) than String#[i].ord
>> would probably convince everybody.
> 
> May I beg for an /important/ algorithm?

Geez!  What does it take to satisfy you guys!  :-)

Seriously, though, I thought that my suggestion for an optional index 
argument to ord was a modest and sensible one.  Actually, I thought I 
was just pointing out an oversight and I'm surprised at the resistance 
it has faced.  Given the richness of the core Ruby API, I assume that 
decisions about adding methods are based on elegance, and that is 
permitted or even desirable to have more than one way to do something.

The objections so far have seemed to be "why would you want to do that?" 
and "isn't it fast enough as it is?".  I would have expected objections 
to be more like: "That isn't the Ruby way, you newbie!"  That is, I 
wouldn't have been surprised to be told simply "that doesn't feel 
right", but I am surprised by the request for benchmarks and algorithms.

Anyway, that is my meta-commentary on the topic.

>> Btw, isn't ruby 1.9 going to have character set information associated
>> with strings? Would #ord(idx) return the value of the byte at a
>> particular byte offset idx, or a codepoint at a character idx?
> 
> It's worse for other methods like #[], where one can wonder how
> grapheme clusters are to be dealt with.  My idea was that you would
> have encodings layered over other encodings for this kind of thing.

I thought I knew a lot about Unicode, but I haven't heard about 
graphemes. Are those things that come up in Arabic?

> Say that you have a string s = {abc}, where a, b, and c are Unicode
> characters and the {...} syntax means the string of these characters
> in some encoding, and that it is encoded using UTF-8, and that a and b
> constitute a grapheme cluster.  Under certain conditions you may want
> to work with each codepoint separately, under other conditions each
> grapheme cluster.  Normally, s.encoding would be "utf-8", but if I
> want to work with grapheme clusters I may set s.encoding =
> "utf-8.graphemes", where the dot introduces another "encoding axis".
> In the first case, s[0] would give you the string {a}.  In the second
> case, s[0] would give you the string {ab}.  Sometimes you may want to
> work with the individual bytes of s.  You could then set s.encoding =
> 'ascii' or s.encoding = 'bytes' or something like that (ascii wouldn't
> be great, as it is 7-bit and perhaps some implementation may depend on
> this), then s[0] would give you the string {a_1}, where a_1 is the
> first byte of the encoding of a in UTF-8.

What's the thinking behind the encoding= method, do you know?

The way I would have imagined it would be to have methods like as_euc, 
as_unicode, and as_bytes which return objects that provide a new "view" 
on the same underlying bytes.  (Allowing multiple views of the same 
bytes raises hairy isues of concurrent modifications, of course)

Speaking of concurrent modifications, what happens if one thread changes 
the encoding of a string while another thread is iterating through it?

Anyway, I can see why encoding issues and multi-byte character issues 
argue strongly for representing characters as a kind of String.  Has 
anyone argued for creating a Character class that extends String?  This 
would be a natural place to put methods like ord and digit? alpha?, etc.
Also, I tend to think that characters should be immutable (I'm not sure 
why) and having a subclass would probably allow that.

	David