Issue #13110 has been updated by Shugo Maeda.


Eric Wong wrote:
>  For reading and parsing operations, I'm not sure they're needed
>  because IO#read/read_nonblock/etc all return binary strings when
>  passed explicit length arg; and //n exists for Regexp.  (And any
>  socket server reading without a length arg would be dangerous)

Let me clarify my intention.

I'd like to handle not only singlebyte characters but multibyte
characters efficiently by byte-based operations.

Once a string is scanned, we have a byte offset, so we don't need
scan the string from the beginning, but we are forced to do it by
the current API.

In the following example, the byteindex version is much faster than
the index version.

```
lexington:ruby$ cat bench.rb 
require "benchmark"

s = File.read("README.ja.md") * 10

Benchmark.bmbm do |x|
  x.report("index") do
    pos = 0
    n = 0
    loop {
      break unless s.index(/\p{Han}/, pos)
      n += 1
      _, pos = Regexp.last_match.offset(0)
    }
  end
  x.report("byteindex") do
    pos = 0
    n = 0
    loop {
      break unless s.byteindex(/\p{Han}/, pos)
      n += 1
      _, pos = Regexp.last_match.byteoffset(0)
    }
  end
end
lexington:ruby$ ./ruby bench.rb 
Rehearsal ---------------------------------------------
index       1.060000   0.010000   1.070000 (  1.116932)
byteindex   0.000000   0.010000   0.010000 (  0.004501)
------------------------------------ total: 1.080000sec

                user     system      total        real
index       1.050000   0.000000   1.050000 (  1.080099)
byteindex   0.000000   0.000000   0.000000 (  0.003814)
```


----------------------------------------
Bug #13110: Byte-based operations for String
https://bugs.ruby-lang.org/issues/13110#change-62409

* Author: Shugo Maeda
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
* ruby -v: 
* Backport: 2.2: UNKNOWN, 2.3: UNKNOWN, 2.4: UNKNOWN
----------------------------------------
How about to add byte-based operations for String?

```
s = "あああいいいあああ"
p s.byteindex(/ああ/, 4) #=> 18
x, y = Regexp.last_match.byteoffset(0) #=> [18, 24]
s.bytesplice(x...y, "おおお")
p s #=> "あああいいいおおおあ"
```



---Files--------------------------------
byteindex.diff (2.83 KB)


-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>