Bill's RegExp tokenizer is pretty cool.  One thing I would like to point
out regarding this discussion is that using the built-in classes is
generally much faster than rolling your own.  This seems fairly obvious,
but let me give an example relating to this discussion: I was looking at
the Count Lines/Words/Chars Ruby benchmark at
http://www.bagley.org/~doug/shootout/ and thought I could do some
optimization.  You see, this word counter does multiple passes through a
string using built-in methods like String#count, tr and squeeze.  I
thought that if I was clever and used String#each_byte to just iterate
through the String once and do my counting, it would be faster.  This
was wrong.  It ended up being quite a bit slower (and didn't work
correctly anyhow.)  The built-in methods will always be faster because
they are coded in C.  So in general I would recommend that if you are
concerned with performance you should try using regular expressions and
other built-in operations instead of making your own parser.  Here is an
example illustrating this:

def count_chars_built_in(str)
  result = Array.new(256, 0)
  256.times do |b|
    # For each ASCII character we make a "pass" through the String
    result[b] = str.count(b.chr)
  end
  result
end

def count_chars_by_hand(str)
  result = Array.new(256, 0)
  # Here we only make one pass through the String
  str.each_byte do |b|
    result[b] += 1
  end
  result
end

if __FILE__ == $0
  if ARGV.length < 2
    puts "Usage: #$0 <input file name> <iterations>"
    exit(1)
  end

  # Read in a file
  str = ''
  IO.foreach(ARGV[0]) do |line|
    str << line
  end

  iterations = ARGV[1].to_i

  r1 = nil
  t1 = Time.now
  iterations.times do
    r1 = count_chars_built_in(str)
  end
  t2 = Time.now
  puts "Using the built-in method took #{t2 - t1} ms"
  puts r1.inspect

  r2 = nil
  t1 = Time.now
  iterations.times do
    r2 = count_chars_by_hand(str)
  end
  t2 = Time.now
  puts "Using the by-hand method took #{t2 - t1} ms"
  puts r2.inspect
  puts "Results are the same" if r1 == r2
end

On my test machine in most cases the built-in method was faster than the
by-hand one.  The difference varied quite a bit based on the test data
(and for really small data sets the by-hand one was faster.)  But it is
doing 256 passes through the String versus only 1 in the by-hand method.
In an earlier version of this I was using a Hash instead of an Array and
the by-hand method was 3 times slower (for one particular set of data.)
But that was mostly do to hash lookups and sets (which is why I switched
to Arrays to minimize "coloring" from other time consumers.)  But either
way this is just something to take note of.  I think I will continue to
do research into Ruby performance and maybe post a report one of these
days.  I've found books on Java performance to be quite enlightening.

Ryan Leavengood