thanks Sean and Park. its actually been interesting comparing these two
very different pieces of code that do the same thing. here are my
results thus far with my modification to both:

# park, note removal of the for j loop.
# really helped the speed!
def parks_tokenizer(string)
  s = ['<','[','{','"','*']
  e = ['>',']','}','"','*']
  items = []
  i = 0
  while i < string.length
    if not s.include?(string[i,1])
       j = i+1
       j += 1 while j < string.length && !s.include?(string[j,1])
       items.concat string[i..j-1].strip.split(' ')
       i = j
    else
      j = s.index(string[i,1])
      if s[j] == '"' || s[j] == '*'
        k = string.index(e[j],i+1)
      else
        k = i
        k += 1 while string[k,1] == s[j]
        k = string.index(e[j],k)
        k += 1 while k+1 < string.length && string[k+1,1] == e[j]
      end
      items << string[i..k]
      i = k+1
    end
  end
  return items
end

# sean, i got rid of the perlish notation
# and made the second part more like the first
def seans_tokenizer(string)
  tokens = {'['=>']', '<'=>'>', '"'=>'"', '{'=>'}', '*'=>'*', "'"=>"'"}
  items = []
  while string.size > 0
    if tokens.keys.include?(string[0,1])
      end_index = string.index(tokens[string[0,1]], 1)
      item = string[0..end_index]
      items << item
      string = string[end_index+1..-1]
      while item.count(item[0,1]) > item.count(tokens[item[0,1]])
        end_index = string.index(tokens[item[0,1]])
        item << string[0..end_index]
        string = string[end_index+1..-1]
      end
    else
      end_index = string.index(/[[{<*"'\s]|\z/, 1)
      item = string[0..end_index-1].strip
      items << item if not item.empty?
      string = string[end_index..-1]
    end
  end
  items
end


looping over 100 itereations of each results in park's version taking
~2.8 seconds and sean's ~2.3, but i think park's might have a little
more room for improvement. oddly the more i work with them, the more i
am beginning to see that they are, in effect, the same. let you know how
that progresses.

by the way, one of the reasons i brought this up (and thank god i did as
these pieces of code are so much better then mine) was to perhaps talk
about Regular Expressions and string parsing in general. Seems to me
that parsing text is like THE fundemental programming task. why hasn't
any really awsome technologies come about to deal with this. in my
personal opinion Regexps are powerful but limited, as indicated by my
parsing problem. i remember hearing that a language called Snobol had
great string processing capabilities. does anyone know about that?
finally, a Steven J. Hunter sent me this Icon version:

l_ans := []
str_in ? until pos(0) do               # Written by Steven J. Hunter
 if close_delim_cs := \open2close_t[open_delim := move(1)]
   then put(l_ans, open_delim||tab(1+bal(close_delim_cs, '<[{','}]>')))
   else tab(many(' ')) | put(l_ans, tab(upto(start_of_nxt_token_cs)|0))

a real mouthful, but quite compact. i haven't fully digested this yet.

thanks for participation! this has turned out to be much more
interesting and fruitful then i expected.

~transami (tom)



On Thu, 2002-07-04 at 09:15, Sean Russell wrote:
> Park Heesob wrote:
> 
> 
> >> did you take a look at sean's version, by the way?
> >> a tad more elegent although he does use regexps.
> >>
> > Sean's version fails at
> > str = 'a<b>[c]{d}"e"f {{g}} [[h]]*i**j*"k"l'
> 
> Adding two characters to the regexp fixes that.  The regexp should be
> 
>    string =~ /(.*?)(?=[<[{"*']|$)/
> 
> -- 
>  |..  "They that can give up essential liberty to obtain a little
> <|>    temporary safety deserve neither liberty nor safety."
> /|\   -- Benjamin Franklin
> /|    
>  |         
> 
>