"Hal Fulton" <hal9000 / hypermetrics.com> wrote: > I've been looking at lex.c and parse.y and parse.c, ... Pending a correction, lex.c is an unused remnant. parse.c is ignorable (generated by Yacc from parse.y). The real ruby lexer is in parse.y (function yylex). > > How might one simply break Ruby code into tokens? > > > Hal > While writing IRB, Keiju ISHITSUKA seems to have taken the trouble to expose his lexer to other callers. Thank you. ruby-lex is a ruby emulation of the interpreter's lexer. (May have slight differences.) As part of IRB, it's standard distribution. I haven't seen examples -- this offering tokenizes itself but you can change to a script-file target. #------------------------------------ require 'irb\ruby-lex' include RubyToken #File.open('testfile.rb') do |infile| # see: lex.set_input tree = [] ikeys = [:name, :op, :value, :node] lex = RubyLex.new DATA.rewind lex.set_input(DATA) # (DATA) or (infile) line = lex.get_readed # read (past tense;) while tk = lex.token tkc = tk.class.to_s.sub(/\ARubyToken::/, '') tkih = { :tk => tkc, :line => tk.line_no, :seek => tk.seek, :char_no => tk.char_no } # some tokens have extra attributes. ikeys.each do |tkk| tkih[tkk.to_sym] = tk.respond_to?(tkk) && tk.send(tkk) end tree << tkih if tkc === 'TkNL' # puts line unless line == /\A\s*\Z/ # line sep line = lex.get_readed # next line # Note: read line left here otherwise # position of NL is mis-reported [BUG?]. end end tree.each do |tkh| printf("line %-3d @%3d: %-12s", tkh[:line], tkh[:char_no], tkh[:tk]) printf(" [%s]", tkh[:name]) if tkh[:name] tkh.each do |k, v| next unless (ikeys - [:name]).include?(k) printf(" %s(%s)", k, v) if v end puts puts if tkh[:tk] == 'TkNL' end #end # File.open __END__ #------------------------------------ There may be other methods of interest in: lib\ruby\1.8\irb\slex.rb lib\ruby\1.8\irb\ruby-lex.rb lib\ruby\1.8\irb\ruby-token.rb daz