"Hal Fulton" <hal9000 / hypermetrics.com> wrote:

> I've been looking at lex.c and parse.y and parse.c, ...

Pending a correction, lex.c is an unused remnant.
parse.c is ignorable (generated by Yacc from parse.y).
The real ruby lexer is in parse.y (function yylex).

>
> How might one simply break Ruby code into tokens?
>
>
> Hal
>

While writing IRB, Keiju ISHITSUKA seems to have taken
the trouble to expose his lexer to other callers.
Thank you.

ruby-lex is a ruby emulation of the interpreter's lexer.
(May have slight differences.)
As part of IRB, it's standard distribution.

I haven't seen examples -- this offering tokenizes itself
but you can change to a script-file target.


#------------------------------------
require 'irb\ruby-lex'

include RubyToken

#File.open('testfile.rb') do |infile|  # see: lex.set_input

  tree = []
  ikeys = [:name, :op, :value, :node]

  lex = RubyLex.new
  DATA.rewind
  lex.set_input(DATA)   # (DATA) or (infile)

  line = lex.get_readed   #  read (past tense;)
  while tk = lex.token

    tkc = tk.class.to_s.sub(/\ARubyToken::/, '')

    tkih = { :tk       => tkc,
             :line     => tk.line_no,
             :seek     => tk.seek,
             :char_no  => tk.char_no }

    # some tokens have extra attributes.
    ikeys.each do |tkk|
      tkih[tkk.to_sym] = tk.respond_to?(tkk) && tk.send(tkk)
    end

    tree << tkih

    if tkc === 'TkNL'
#      puts line unless line == /\A\s*\Z/  # line sep
      line = lex.get_readed  #  next line
      # Note:  read line left here otherwise
      #        position of NL is mis-reported [BUG?].
    end
  end

  tree.each do |tkh|
    printf("line %-3d @%3d:  %-12s", tkh[:line], tkh[:char_no], tkh[:tk])
    printf(" [%s]", tkh[:name]) if tkh[:name]

    tkh.each do |k, v|
      next unless (ikeys - [:name]).include?(k)
      printf("  %s(%s)", k, v) if v
    end
    puts
    puts if tkh[:tk] == 'TkNL'
  end

#end  # File.open
__END__
#------------------------------------


There may be other methods of interest in:

lib\ruby\1.8\irb\slex.rb
lib\ruby\1.8\irb\ruby-lex.rb
lib\ruby\1.8\irb\ruby-token.rb


daz