On 2/6/07, David Flanagan <david / davidflanagan.com> wrote:
> Nikolai Weibull wrote:

> > Also, how often is it actually necessary to convert strings to their
> > ordinal value in their encoding table?
>
> If you're working on binary data and want to read the raw byte string
> instead of unpacking it into an array of Fixnums?  I don't know how
> common this is in practice.  I was using a string as a compact sequence
> of bytes to represent a Sudoku grid, which is what made me bring this up.

True.  But binary data is evil!  ;-)

> You say that characters-as-strings makes perfect sense:
>
> > Perhaps, but this is a tradeoff of keeping "characters" and "strings"
> > in the same class.  As already mentioned,  "characters" will currently
> > be represented by one-character-long Strings in 1.9/2.0.  To me, this
> > makes perfect sense, considering that one of the main design goals for
> > Strings in 1.9/2.0 is that they should be able to handle most any
> > encoding scheme (as I've understood it).
> >
>
> But then you muse about a new type of Fixnum to represents characters!

No, what I go on to say is that perhaps we need a new class for
representing the /codepoint/, not the character.

> > Anyway, while we're on the topic, what exactly should String#ord
> > return?  I'd argue that a subclass of Fixnum would make sense, which
> > would have methods like #alpha?, #digit?, and so on, according to what
> > information is provided by the encoding scheme.  This can easily get a
> > bit too Unicode-centric, but I prefer writing
>
> I agree with the need for methods like this, but if that's going to
> happen, I'd say the class should just be called a Character, and there
> should be a way to get Character objects directly from strings without
> having to stick the ord method in the middle.  Personally, I'd suggest
> that String.[x] with one argument should return a Character object, and
> String.[x,1] should return a String of length one.
>
> My own musings along these lines make characters a subclass of Symbol
> rather than of Fixnum.  So ?A would be an object much like :A, but would
> have additional character-specific methods, such as #encoding, #alpha?, etc.
>
> >  "a".ord.alpha?
> >
> > to
> >
> >  Codepoint.alpha?("a".ord)
> >
> > or something similar.  I guess a good name for this subclass would be
> > Codepoint, but then perhaps #ord isn't a very good name and #codepoint
> > would make more sense.
> >
> > Finally, perhaps the type of methods I've described above, i.e.,
> > #alpha?, #digit?, ..., should be methods of String for strings of
> > length one character, like #ord.
> >
> > Let's try it out:
> >
> >  "a".alpha?
> >
> > yes, yes I like that.  Still, String may be getting a bit overloaded by
> > then.
>
> I think it is asking too much to have the String class represent byte
> strings, multi-byte character strings, and individual characters.

Maybe so.  It's not a matter to be taken lightly, considering that one
of the main complaints I've seen about Ruby 1.8 (and below) is the
poor support for Unicode (thank you very much, Tim Bray ;-).  Now also
consider the fact that Ruby is a language born and bred in Japan and
you'll have to throw in a couple more encodings to the mix.  You end
up with a lot of cases to cover, and different encodings can represent
different kinds of text that have different kinds of attributes.  It's
not going to be easy, and I'm fearing that things such as character
properties are going to be left out.

> Let me also respond to a couple of things from other messages:
>
> > Like the fact that #ordAt isn't a very Rubyish name.
>
> My bad.  That was a typo based on my background in Java and JavaScript.
>   I don't actually like the idea of a separate method, but if one were
> needed, ord_at would obviously be a better name than ordAt.

I know it was only a "pseudo-named" method.  But it's hard to come up
with a good name for the method you want, and I did want to make a
point about how well-named methods are in Ruby.  Being able to keep
method names short and simple is often the result of well-thought
through APIs and also result in easy-to-follow code when used, in my
experience.

> David Black wrote:
>
> > It's not going to be backward compatible in any case, since [] will
> > have changed.  I think the reasoning is that people use [].chr more
> > than they're likely to use [].ord, so offloading the less simple
> > behavior onto the ord case will save method calls in the long run.
>
> I would have thought that people would use s[x,1] instead of s[x].ord,
> avoiding the extra method call.

Why am I not following this?  My understanding is that s[x,1] and s[x]
will give the same result in 1.9/2.0, i.e., a String containing the
character at offset x in s.

But I don't follow what David is saying, which is also why I didn't
respond to his message.

  nikolai