On Dec 5, 6:15 pm, Daniel DeLorme <dan... / dan42.com> wrote:
> marc wrote:
> > Daniel DeLorme said...
> >> MonkeeSage wrote:
> >>> Everything in ruby is a bytestring.
> >> YES! And that's exactyly how it should be. Who is it that spread the
> >> flawed idea that strings are fundamentally made of characters?
>
> > Are you being ironic?
>
> Not at all. By "fundamentally" I mean the fundamental, lowest level of
> representation. If strings were fundamentally made of characters then we
> wouldn't be able to access individual bytes because that's a lower level
> than the fundamental level, which is by definition impossible.
>
> If you are using UCS2 it makes sense to consider strings as arrays of
> characters because that's what they are. But UTF8 strings do not follow
> the characteristics of arrays at all. Each access into the "array" is
> O(n) rather than O(1). So IMHO treating it as an array of characters is
> a *very* leaky abstraction.
>
> I agree that 99.9% of the time you want to deal with characters, and I
> believe that in 99% of those cases you would be better served with regex
> than this pretend "array" disguise.
>
> Daniel

Here is a micro-benchmark on three common string operations (split,
index, length), using bytestrings and unicode regexp, verses native
utf-8 strings in 1.9.0 (release).


$ ruby19 -v
ruby 1.9.0 (2007-10-15 patchlevel 0) [i686-linux]

$ echo && cat bench.rb
#!/usr/bin/ruby19
# -*- coding: ascii -*-

require "benchmark"
require "test/unit/assertions"
include Test::Unit::Assertions

$KCODE = "u"

$target = "!日本語!" * 100
$unichr = "本".force_encoding('utf-8')
$regchr = /[本]/u

def uni_split
  $target.split($unichr)
end
def reg_split
  $target.split($regchr)
end

def uni_index
  $target.index($unichr)
end
def reg_index
  $target =~ $regchr
end

def uni_chars
  $target.length
end
def reg_chars
  $target.unpack("U*").length
  # this is *alot* slower
  # $target.scan(/./u).length
end

$target.force_encoding("ascii")
a = reg_split
$target.force_encoding("utf-8")
b = uni_split
assert_equal(a.length, b.length)

$target.force_encoding("ascii")
a = reg_index
$target.force_encoding("utf-8")
b = uni_index
assert_equal(a-2, b)

$target.force_encoding("ascii")
a = reg_chars
$target.force_encoding("utf-8")
b = uni_chars
assert_equal(a, b)

n = 10_000
Benchmark.bm(12) { | x |
  $target.force_encoding("ascii")
  x.report("reg_split") { n.times { reg_split } }
  $target.force_encoding("utf-8")
  x.report("uni_split") { n.times { uni_split } }
  puts
  $target.force_encoding("ascii")
  x.report("reg_index") { n.times { reg_index } }
  $target.force_encoding("utf-8")
  x.report("uni_index") { n.times { uni_index } }
  puts
  $target.force_encoding("ascii")
  x.report("reg_chars") { n.times { reg_chars } }
  $target.force_encoding("utf-8")
  x.report("uni_chars") { n.times { uni_chars } }
}

====

With caches initialized, an 5 prior runs, I got these numbers:

$ ruby19 bench.rb
                  user     system      total        real
reg_split     2.550000   0.010000   2.560000 (  2.799292)
uni_split     1.820000   0.020000   1.840000 (  2.026265)

reg_index     0.040000   0.000000   0.040000 (  0.097672)
uni_index     0.150000   0.000000   0.150000 (  0.202700)

reg_chars     0.790000   0.010000   0.800000 (  0.919995)
uni_chars     0.130000   0.000000   0.130000 (  0.193307)

====

So String#=~ with a bytestring and unicode regexp is faster than
String#index by a fator or ~0.5. In the other two cases, the opposite
is true.

Ps. BTW, in case there is any confusion, bytestrings aren't going
away; you can, as you see above, specify a magic encoding comment to
ensure that you have bytestrings by default. You can also explicitly
decode from utf-8 back to ascii. and you can get a byte enumerator (or
array from calling to_a on the enumerator) from String#bytes, and an
iterator from #each_byte, irregardless of the encoding.

Regards,
Jordan