Here is my attempt.
I made the huge assumption that most code adheres to some indentation
style. Based on that all I really had to worry about was what lines
would usually not be indented (those that start a new scope).
I also changed the prefixing of the output since lines with # might
actually be part of the code as comments.
After dealing with this in several submissions I decided to allow for
text from a copied email to be removed (usually prefixed with "> ") so
that only the submitters comments were included.
With the provided example...
172 $ ruby rubyish.rb < test.eml
[TEXT] I may have missed it, but I didn't see this solution:
[CODE] require 'enumerator'
[CODE] "Hello world!\n".enum_for(:each_byte).inject($stdout) { |res, b|
[CODE] res << b.chr
[CODE] }
[TEXT] Which in 1.9 looks much better:
[CODE] "Hello world!\n".each_char.inject($stdout) { |res, c|
[CODE] res << c
[CODE] }
[TEXT] Stefano
rubyish.rb:
#!/usr/bin/env ruby
KEYWORDS = %{
begin do end while for in if else break redo next loop retry ensure
rescue case when exit unless and or not class module new def raise
puts print p
}
THRESHOLD = 0.60
def load_keywords(file)
return KEYWORDS unless file
keywords = []
begin
File.open(file).each do |line|
keywords << line.chomp.split(/\s+/)
end
rescue
keywords = KEYWORDS
end
keywords.flatten
end
# Determine if the line looks as though it begins/ends a block of code
def block_statement?(text)
# Assignment is assumed to be code
# This can lead to several false positives...keep?
return true if text =~ /[^=]+=[^=]/
case text
when /^(def|class|module)/
true
when /^(begin|do|while|ARG.[\[\.]|for|case|if|loop|puts|print|p[( ])/
true
when /(\{|do) (\|.*?\|)?\s*$/
true
when /^(end|\})/
true
else
false
end
end
# Assume that symbols are language items. And words that match
# a set of keywords are language items. Everything else is text
# Whitespace is not considered
# Returns [# language tokens, # total tokens]
def separate(text,keywords)
words = text.scan(/\w+/)
s = words.join
symbols = (text.split(//) - s.split(//)).delete_if { |x| x =~ /\s+/ }
special = symbols.size
total = words.size + special
words.each { |w| special += 1 if keywords.include?(w) }
[special,total]
end
def usage
$stderr.puts "USAGE: #{File.basename($0)} [options] [-f <file>]"
$stderr.puts "-"*40
$stderr.puts " -c <prefix>"
$stderr.puts " When text is copied from a previous email and
should be"
$stderr.puts " removed, prefix is the start of the copied lines
(ex. -c '> ')"
$stderr.puts " -s [name]"
$stderr.puts " Split comment/code pairs into separate files"
$stderr.puts " If provided, files are named name.N (Default:
submission)"
$stderr.puts " -t <value>"
$stderr.puts " Threshold for assuming a line is code vs.
comment."
$stderr.puts " 0.0 - 1.0 : percent of symbols that need to be
code-like (Default: 0.60)"
$stderr.puts " -k <keyword file>"
$stderr.puts " File containing the whitespace-separated list of
keywords"
$stderr.puts " to use in code-like matching"
$stderr.puts " -p [on|off]"
$stderr.puts " If on, prefix code output with [CODE] and comment"
$stderr.puts " output with [TEXT]. (Default: on)"
exit 42
end
def parse_options
opts = {:prefix => true}
ARGV.join.split('-').delete_if { |x| x == "" }.each do |arg|
case arg
when /^c(.+)/
opts[:copied] = $1.strip
when /^s(.+)/
opts[:outfile] = $1.strip
when /^t(.)/
t = $1.strip.to_f
raise "Invalid Threshold #{$1}" if t < 0.0 || t > 1.0
opts[:threshold] = t
when /^k(.+)/
opts[:keyfile] = $1.strip
when /^f(.+)/
opts[:input] = $1.strip
when /^p(.+)/
opts[:prefix] = $1 == "on"
else
usage
end
end
opts
end
# Begin execution
opts = parse_options
input = opts[:input] ? File.open(opts[:input]) : $stdin
keywords = load_keywords(opts[:keywords])
threshold = opts[:threshold] || THRESHOLD
# Initial classification
classified = []
input.each do |line|
# ignore all lines copied from previous emails
next if opts[:copied] && line =~ /^#{opts[:copied]}/
case line
when /^\s*$/
classified << [:blank, line]
when /^\s*#/, /^\s+/, /^require/
classified << [:code, line]
else
classified << [:text, line]
end
end
# Make educated guesses based on content of remaining lines
estimated = []
classified.each do |type,line|
case type
when :code, :blank
estimated << [type, line]
when :text
if block_statement?(line)
estimated << [:code, line]
else
# Compare words to 'guessed' language characters
special,total = separate(line, keywords)
if special.to_f/total.to_f >= threshold
estimated << [:code,line]
else
estimated << [:text,line]
end
end
end
end
# Assume that one line of code surrounded by two non-code lines
# is just an example and not part of the actual submission
size = estimated.size
(0...size).each do |i|
next if i < 2
next if i > size - 3
a,_,b,_,type,line,y,_,z_ = estimated.slice((i-2)..(i+2))
next unless type == :code
next if [a,b,c,d].include? :code
estimated[i] = [:text, line]
end
# Output modified submission
n, last, out = 0, nil, $stdout
if opts[:outfile]
file = "%s.%d" % [opts[:outfile], n]
out = File.open(file, 'w')
end
estimated.each do |type,line|
case type
when :blank
out.puts
next
when :text
if last == :code && out != $stdout
out.close
file = "%s.%d" % [opts[:outfile], n += 1]
out = File.open(file, 'w')
end
prefix = opts[:prefix] ? " [TEXT] " : ""
out.puts prefix + line
when :code
prefix = opts[:prefix] ? " [CODE] " : ""
out.puts prefix + line
end
last = type
end
out.close
--
Posted via http://www.ruby-forum.com/.