On 2/7/07, Sam Roberts <sroberts / uniserve.com> wrote:

> Is creating a temporary 1-byte String really that expensive? Some
> benchmarks showing an algorithm that uses a long binary string as a data
> structure performs much faster with String#ord(i) than String#[i].ord
> would probably convince everybody.

May I beg for an /important/ algorithm?

> Btw, isn't ruby 1.9 going to have character set information associated
> with strings? Would #ord(idx) return the value of the byte at a
> particular byte offset idx, or a codepoint at a character idx?

It's worse for other methods like #[], where one can wonder how
grapheme clusters are to be dealt with.  My idea was that you would
have encodings layered over other encodings for this kind of thing.

Say that you have a string s = {abc}, where a, b, and c are Unicode
characters and the {...} syntax means the string of these characters
in some encoding, and that it is encoded using UTF-8, and that a and b
constitute a grapheme cluster.  Under certain conditions you may want
to work with each codepoint separately, under other conditions each
grapheme cluster.  Normally, s.encoding would be "utf-8", but if I
want to work with grapheme clusters I may set s.encoding =
"utf-8.graphemes", where the dot introduces another "encoding axis".
In the first case, s[0] would give you the string {a}.  In the second
case, s[0] would give you the string {ab}.  Sometimes you may want to
work with the individual bytes of s.  You could then set s.encoding =
'ascii' or s.encoding = 'bytes' or something like that (ascii wouldn't
be great, as it is 7-bit and perhaps some implementation may depend on
this), then s[0] would give you the string {a_1}, where a_1 is the
first byte of the encoding of a in UTF-8.

Wow, that wasn't a very good explanation, but perhaps you'll
understand what I'm getting at.  It's about treating a String as a
sequence of bytes with some encoding layered over it to decide how to
retrieve characters from it, i.e., mostly how indexing into the String
works.

It all makes a lot of sense if you implement the encoding handling
using a virtual method table which one could then easily change for a
given String instance whenever needed.

  nikolai