* Glenn Parker (Mar 20, 2005 17:10):
> > > > Well, the problem is that a String is represented by one long
> > > > sequence of bytes in memory.

> > > AFAIK, that is not actually a requirement for a Ruby
> > > implementation.

> > What does that mean?  That is how it's implemented in Ruby.  There's
> > no definition of how Ruby should be implemented.

> Sure there is: a Ruby implementation should implement the Ruby
> language as currently defined.  What I'm saying is that there is
> nothing in the definition of the Ruby language that forces an
> implementor to use contiguous blocks of memory for strings or arrays.

Well, yes there is.  You are forgetting the C interface:
RSTRING(str)->ptr is used in a lot of modules.

> Managing large strings as linked chunks is something text editors
> already do for performance.

It's not a very good representation for a text editor, though.  The
lookup method becomes O(n), which is just terrible.  Read [1] for an
overview of representing sequences.  I have already mentioned Ropes a
couple of times as a possible solution to the problem of dealing with
large strings.

> Assertions about how strings should or should not be used are based,
> in part, on assumptions that should be questioned.

Yes, but the whole point is that they haven't and it seems that people
don't realize how their implementation limits their use.  A String is
simply not well-suited for sequences longer than some certain limit
(depending on system and so on).  There are ways to alleviate these
issues.  For me, the most obviously good solution would be to add
another sequence data structure that would fit use-cases where long
sequences are expected.

> > The implementation is the standard.

> I would rather say: The most widely used implementation is the de
> facto standard.  :)

Which, in this case, is like saying what I already said.

> > That isn't to say that they are the final word of how a String
> > should be represented, but if it will, in a future release, be
> > represented in some other way, perhaps having destructive versus
> > non-destructive methods will make no sense,

> I think what I'm suggesting would actually make destructive methods
> more appealing, without reducing the object creation overhead for
> non-destructive methods.

I don't see how.  One of the gains of breaking the string up is that
parts of it can be reused in strings based on the original, e.g.,
"abc" + "def" can be implemented as an O(1)-operation where we simply
create a string that points to the two substrings.  slice can be
implemented by pointing into the string we are taking a slice of and
slice! can be implemented by breaking the string up and pointing to the
two parts of the string that remain after the deletion.  Implementing
slice and slice! in this manner means that they will be equally
efficient, both time- and space-wise.

See [1] and [2] for further discussions on the ideas I have presented
here.

A representation of a sequence as a sequence of continuous bytes in
memory is very efficient for lookups and it's space optimal.  Most
operations will perform very efficiently for shorter sequences as well,
but degrades horribly for long sequences.  I am not advocating that the
"sequence of continuous bytes" implementation for String be replaced, it
fits very well with many use-cases.  What I am advocating is that we
should realize that it doesn't fit every case and that a second sequence
data structure should be added for dealing with the cases where a String
isn't optimal.  I feel that most of the destructive methods on Strings
are optimizing the wrong problem.  It's not excessive copying that
should be avoided, but excessive computation.

My solution will only solve the destructive vs. non-destructive problem
for those methods that deal with substring of the original string, e.g.,
concat, chomp, delete, gsub, lstrip, rstrip, slice, strip, sub.  For
capitalize, downcase, swapcase, tr, tr_s, and upcase, one would still
have to create a new copy.  I wonder, though, how often the difference
between capitalize and capitalize! in execution speed actually makes a
difference,
        nikolai

[1] http://citeseer.ist.psu.edu/crowley98data.html
[2] http://rubyurl.com/rubyurl/show/2FRbO (PDF)

-- 
::: name: Nikolai Weibull    :: aliases: pcp / lone-star / aka :::
::: born: Chicago, IL USA    :: loc atm: Gothenburg, Sweden    :::
::: page: minimalistic.org   :: fun atm: gf,lps,ruby,lisp,war3 :::
main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}