Here is my solution.
I found after some experimenting that it is surprisingly easy to get quite
good results. My solution outputs a stand alone ruby program that prints
the input. Here is a result:
$ ruby tumbledryer.rb tsql
r = {
"PKi" => "PRIMARY KEY (`id`)",
"d" => "`description`",
"NNd" => "NOT NULL default",
"v5NNd" => "varchar(50) NOT NULL default '',",
"ii1NNa" => "`id` int(11) NOT NULL auto_increment,",
"CT" => "CREATE TABLE",
"TMA" => "TYPE=MyISAM AUTO_INCREMENT=",
"i1NNd0" => "int(11) NOT NULL default '0',",
}
puts <<'EOT'.gsub(/\#(\w*)\#/){$1==""?"#":(r[$1]||"##$1#")}.chop
#CT# `authors` (
#ii1NNa#
`firstname` #v5NNd#
`name` #v5NNd#
`nickname` #v5NNd#
`contact` #v5NNd#
`password` #v5NNd#
#d# text NOT NULL,
#PKi#
) #TMA#3 ;
#CT# `categories` (
#ii1NNa#
`name` varchar(20) #NNd# '',
#d# varchar(70) #NNd# '',
#PKi#
) #TMA#3 ;
#CT# `categories_documents` (
`category_id` #i1NNd0#
`document_id` #i1NNd0#
) TYPE=MyISAM ;
#CT# `documents` (
#ii1NNa#
`title` #v5NNd#
#d# text NOT NULL,
`author_id` #i1NNd0#
`date` date #NNd# '0000-00-00',
`filename` #v5NNd#
#PKi#,
KEY `document` (`title`)
) #TMA#14 ;
EOT
It works as follows: First you need a good way to get all replacement
candidates, this is what my get_chunks method does:
# returns all substrings of line, that might be replacement candidates
# if candidates appear multiple times, they are returned multiple times
def get_chunks(line)
# split the line
parts = line.strip.scan /\w+|\W/
# build all combinations
(1..(parts.size)).collect do |len|
(0..(parts.size - len)).collect do |idx|
parts[idx, len].join ""
end
end.flatten.reject do |el|
el.length < @min_len || # to short
# reject (nonalpha separated by) whitespace at beginning and end
(el =~ /(^\W?\s)|(\s\W?$)/)
end
end
Then the rest is easy ;-).
Build a frequency hash of all those chunks from the entire text.
Then discard all chunks that don't appear often enough.
Then sort the remaining chunks by their length (longest first).
Then for each chunk (in sort order), scan the text and check if it appears
often enough, if yes replace it, if no discard it.
That's all.
My TumbleDRYer has three parameters:
min_len: is the minimum length of replacement chunks (>= 2)
min_freq: determines how often a chunk has to appear to be replaced
escape_char: may not be alphanumeric and not "\0", "\\", "\n", ...
The default values are 8, 3 and "#", they can be changed on the command
line:
$ ruby tumbledryer.rb '7-2-$' tsql
tumpledryer.rb reads ARGF, so this is also possible:
$ cat tsql | ruby tumbledryer.rb
The complete code is attached and below.
Thanks for this very interesting quiz.
Dominik
tumbledryer.rb:
class TumbleDRYer
# min_len is the minimum length of replacement chunks (>= 2)
# min_freq determines how often a chunk has to appear to be replaced
# escape_char may not be alphanumeric and not "\0", "\\", "\n", ...
def initialize(min_len = 8, min_freq = 3, escape_char = "#")
@min_len = [min_len.to_i, 2].max
@min_freq = min_freq.to_i
@escape_char = escape_char.to_s[0, 1]
@escape_char = "#" if @escape_char =~ /\w|\\|\0|\n/
end
# returns all substrings of line, that might be replacement candidates
# if candidates appear multiple times, they are returned multiple times
def get_chunks(line)
# split the line
parts = line.strip.scan /\w+|\W/
# build all combinations
(1..(parts.size)).collect do |len|
(0..(parts.size - len)).collect do |idx|
parts[idx, len].join ""
end
end.flatten.reject do |el|
el.length < @min_len || # to short
# reject (nonalpha separated by) whitespace at beginning and end
(el =~ /(^\W?\s)|(\s\W?$)/)
end
end
# get all replacement candidates (long enough chunks, that appear often
# enough) for text
def get_dry_candidates(text)
freq = Hash.new { |h, k| h[k] = 0 }
text.each do |line|
get_chunks(line).each do |chunk|
freq[chunk] += 1
end
end
freq.keys.reject { |chunk| freq[chunk] < @min_freq }
end
# generates a short alphanumeric name for str
def short_name(str)
res = str.scan(/\w+/).map { |w| w[0,1] }.join("")
res.empty? ? "a" : res
end
# returns a ruby program that prints text
# the representation of text in that program will be dried
# text may not contain "\0"
def dry(text)
text = text.dup
repl = get_dry_candidates(text).sort_by { |c| -c.size }
snames, used_repl = {}, {}
# do replacements
repl.each do |r|
if text.scan(r).size >= @min_freq # then use it
sn = short_name(r)
sn.succ! while snames.has_key? sn
# extra escape the short names, to avoid resubstitution by a
# shorter replacement ;-)
snames[sn] = sn.split(//).join("\0")
text.gsub!(r, "\0\0#{snames[sn]}\0\0")
used_repl[sn] = r
end
end
# escape the escape char
text.gsub!(@escape_char, @escape_char * 2)
# undo the extra short name escaping
snames.each do |sn, sn_esc|
text.gsub!("\0\0#{sn_esc}\0\0", @escape_char + sn + @escape_char)
end
# build result
res = ["r = {"]
used_repl.each do |s, r|
res << " #{s.inspect} => #{r.inspect},"
end
res << "}" << ""
# the "unDRYer"
esc = "\\#{@escape_char}"
res << "puts <<'EOT'.gsub(/#{esc}(\\w*)#{esc}/)" +
%Q!{$1==""?"#{@escape_char}":(r[$1]||! +
%Q!"#{@escape_char}\#$1#{@escape_char}")}.chop!
res << text << "EOT"
res.join "\n"
end
end
if $0 == __FILE__
args = []
if ARGV.first =~ /^(\d+)-(\d+)-(\W)$/
args = [$1.to_i, $2.to_i, $3]
ARGV.shift
end
puts TumbleDRYer.new(*args).dry(ARGF.read)
end