Issue #13110 has been updated by Shugo Maeda.


Martin Drst wrote:
> Shugo Maeda wrote:
> > Let me clarify my intention.
> > 
> > I'd like to handle not only singlebyte characters but multibyte
> > characters efficiently by byte-based operations.
> 
> What about using UTF-32? It will use some additional memory, but give you the speed you want.

UTF-32 is not useful because it's a dummy encoding.

> > Once a string is scanned, we have a byte offset, so we don't need
> > scan the string from the beginning, but we are forced to do it by
> > the current API.
> 
> One way to improve this is to somehow cache the last used character and byte index for a string. I think Perl does something like this.
> 
> This could be expanded to a string with several character index/byte index pairs cached, which could be searched by binary search. All this could (should!) be totally opaque to the Ruby programmer (except for the speedup).
> 
> Another way would be to return an Index object that keeps the character and byte indices opaque, but can be used in a general way where speedups are needed.

Theses ways seem worth considering.

> > In the following example, the byteindex version is much faster than
> > the index version.
> 
> Of course it is. (Usually programs in C are faster than programs in Ruby, and this is just moving closer to C, and thus getting faster.)

I don't think it's a language issue but a data structure issue.

> But what I'm wondering is that using a single string for the data in an editor buffer may still be quite inefficient. Adding or deleting a character in the middle of the buffer will be slow, even if you know the exact position in bytes. Changing the representation e.g. to an array of lines will make the efficiency mostly go away. (After all, editors need only be as fast as humans can type :-).

I use a technique called buffer gap described in "The Craft of Text Editing" to improve performance.

  https://www.finseth.com/craft/

See Chapter 6 of the book for details.

> More generally, what I'm afraid of is that with this, we start to more and more expose String internals. That can easily lead to problems.
> 
> Some people may copy a Ruby snippet using byteindex, then add 1 to that index because they think that's how to get to the next character. Others may start to use byteindex everywhere, even if it's absolutely not necessary. Others may demand byte- versions of more and more operations on strings. We have seen all of this in other contexts.

Doesn't this concern apply to `byteslice`? 


----------------------------------------
Feature #13110: Byte-based operations for String
https://bugs.ruby-lang.org/issues/13110#change-62435

* Author: Shugo Maeda
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
How about to add byte-based operations for String?

```ruby
s = "あああいいいあああ"
p s.byteindex(/ああ/, 4) #=> 18
x, y = Regexp.last_match.byteoffset(0) #=> [18, 24]
s.bytesplice(x...y, "おおお")
p s #=> "あああいいいおおおあ"
```



---Files--------------------------------
byteindex.diff (2.83 KB)


-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>