Bill's RegExp tokenizer is pretty cool. One thing I would like to point
out regarding this discussion is that using the built-in classes is
generally much faster than rolling your own. This seems fairly obvious,
but let me give an example relating to this discussion: I was looking at
the Count Lines/Words/Chars Ruby benchmark at
http://www.bagley.org/~doug/shootout/ and thought I could do some
optimization. You see, this word counter does multiple passes through a
string using built-in methods like String#count, tr and squeeze. I
thought that if I was clever and used String#each_byte to just iterate
through the String once and do my counting, it would be faster. This
was wrong. It ended up being quite a bit slower (and didn't work
correctly anyhow.) The built-in methods will always be faster because
they are coded in C. So in general I would recommend that if you are
concerned with performance you should try using regular expressions and
other built-in operations instead of making your own parser. Here is an
example illustrating this:
def count_chars_built_in(str)
result = Array.new(256, 0)
256.times do |b|
# For each ASCII character we make a "pass" through the String
result[b] = str.count(b.chr)
end
result
end
def count_chars_by_hand(str)
result = Array.new(256, 0)
# Here we only make one pass through the String
str.each_byte do |b|
result[b] += 1
end
result
end
if __FILE__ == $0
if ARGV.length < 2
puts "Usage: #$0 <input file name> <iterations>"
exit(1)
end
# Read in a file
str = ''
IO.foreach(ARGV[0]) do |line|
str << line
end
iterations = ARGV[1].to_i
r1 = nil
t1 = Time.now
iterations.times do
r1 = count_chars_built_in(str)
end
t2 = Time.now
puts "Using the built-in method took #{t2 - t1} ms"
puts r1.inspect
r2 = nil
t1 = Time.now
iterations.times do
r2 = count_chars_by_hand(str)
end
t2 = Time.now
puts "Using the by-hand method took #{t2 - t1} ms"
puts r2.inspect
puts "Results are the same" if r1 == r2
end
On my test machine in most cases the built-in method was faster than the
by-hand one. The difference varied quite a bit based on the test data
(and for really small data sets the by-hand one was faster.) But it is
doing 256 passes through the String versus only 1 in the by-hand method.
In an earlier version of this I was using a Hash instead of an Array and
the by-hand method was 3 times slower (for one particular set of data.)
But that was mostly do to hash lookups and sets (which is why I switched
to Arrays to minimize "coloring" from other time consumers.) But either
way this is just something to take note of. I think I will continue to
do research into Ruby performance and maybe post a report one of these
days. I've found books on Java performance to be quite enlightening.
Ryan Leavengood