Probably the first question to ask when looking at this quiz is, what 
makes this
report so unfriendly, not just to Excel, but to work with in general?

1. There's a lot of junk we don't need.  Headers, footers, dashed 
lines.  All
that needs to go, so we can focus on the data.

2. The data itself isn't pure.  Have a look at these two fragments:

	50,589     358 %%%%  56,652     430 %%%%
	924,553 704,028   31  1,612K  1,226K

Those aren't numbers in there.  Yuck.

3. This report is ALMOST a fixed width file, which would help a lot, 
but it
fails in a few areas:

	TEE_X_101 26-Mar-2002 15:26                        1: GE
	
	
	                                         ---------------
	                                         Current  LastYr
	Part Code       Description               Period  Period
	--------------- ------------------------ ------- -------
	  Salesperson  00 NOBODY
	    Customer  1036 COMPANY 501
	      SA Sort Code  1.43 WATER DOLLS
	78-143FS        17/8# SS MODEL                10       0
	                                         ------- -------
	      SA Sort Code subtotals                  10       0
	
	      SA Sort Code  3.3 REMOTE CONTROL CARS
	74270           Model 35357-DBL                0       0
	921137-73       LARGE 19 X 18 X 14            30       0
	                                         ------- -------
	      SA Sort Code subtotals                  30       0
	                                         ------- -------
	    Customer subtotals                        40       0
	
	    Customer  14457 COMPANY 518
	      SA Sort Code  11.5 KITCHEN SETS
	943437          19/8# SS MODEL                 0       0
	                                         ------- -------
	      SA Sort Code subtotals                   0       0
	                                         ------- -------
	    Customer subtotals                         0       0
	
	    Customer  1824 COMPANY 529
	      SA Sort Code  19.4 SLIDES
	8394            2.5 OZ                         0      20
	8341            .21 SIZE PLASTIC, NO BATT      0      10
	18363           .29 SIZE PLASTIC, NO BATT      0      24
	                                         ------- -------
	      SA Sort Code subtotals                   0      54
	                                         ------- -------
	    Customer subtotals                         0      54

Use the dashed lines as a guide and you can see how the sub-headers
(Salesperson, Customer, and SA Sort Code) don't fit into columns.  
Subtotal
lines have the same problem.  Also notice that the Description content 
is
allowed to overflow it's column by one character.  That's a mess.

4. This one is less obvious, but equally real.  The data isn't very 
useful in
this format.  Christian Neukirchen did a nice job of recognizing this 
and taking
steps to correct it.

Now that we've identified the issues, let's work through them.  When I 
began
playing with a foreign report like this, I do just that:  Play around a 
bit.  My
first script may not end up being the one I keep, but I need to get 
familiar
with the data.  I'll show you what works for me, step by step.

Whenever I think, "I need to translate this into..." I mentally switch 
that to:

	while line = ARGF.gets  # traditional Unix filter
	
		print line
	end

It may not be much, but it's a start.  Now, let's tackle problem number 
1 and
jettison the junk.  The majority of it is those silly page headers.  
Those
should be easy enough to drop.  We stop printing when we enter a header 
and
start up again when we leave.

I think in patterns, so I would then start looking for a way to spot 
that I'm
entering a header.  I'm leery to anchor against arbitrary text, if I 
don't have
too, so I would like to find something better.  If you're familiar with 
these
page print style reports, you might have a good guess of what comes 
before that
header.  Even if you're not, it pays to look at a foreign document with 
the
"invisible characters" turned on, if your editor can do that.  Here's 
what a
header really looks like:

	      SA Sort Code subtotals                  20       0
	
	\f  <-- Look what's hiding here!  A form-feed character.
	TEE_X_101 26-Mar-2002 15:26                        1: GE
	
	
	                                         ---------------
	                                         Current  LastYr
	Part Code       Description               Period  Period
	--------------- ------------------------ ------- -------
	      SA Sort Code  5.207 BUILDING BLOCKS

That makes spotting the headers easy.  The very first one doesn't have 
the
form-feed, but that's not hard to work around.

Once we're in a header, the last line is just a solid line of dashes 
and spaces.
  Just be careful to distinguish it from the other dashed line in the 
header and
you can build a solution:

	header = true  # start in header
	while line = ARGF.gets
		if header  # we're inside the header
			header = false if line =~ /^-[- ]+-$/  # watch for the end
		else  # we're not in the header
			if line =~ /\f/  # watch for header beginning
				header = true
				next
			end
			
			print line
		end
	end

If you run that on the report, you'll see that we did indeed remove all 
the page
headers.  I showed (and even submitted) that version because it's easy 
to break
down and understand, but if all that code bothers you, we can shorten 
it up:

	while line = ARGF.gets
		# the next line skips all headers
		next if ($. == 1 or line =~ /\f/) .. line =~ /^-[- ]+-$/
			
		print line
	end

That does exactly the same thing, making use of Ruby's versatile 
Ranges.  The
range acts as a toggle here.  When the first condition becomes true, 
the Range
will evaluate to true (skipping lines with next()) until the second 
condition
becomes true.  Then the process begins again.  My first condition 
searches for
line one of the report or a form-feed character.  The second condition 
looks for
the solid line of dashes.  Again, the code works the same, so you can 
use
whatever you are comfortable with.

We're not done dropping junk yet!  Let's toss out dashed lines and the 
report
footer.  Both are trivial:

	while line = ARGF.gets
		next if ($. == 1 or line =~ /\f/) .. line =~ /^-[- ]+-$/
		next if line =~ /--$/  # skip dashed lines
			
		print line
	
		break if line =~ /^Report Totals/  # stop with "Report Totals"
	end

What's the report look like now?

	  Salesperson  00 NOBODY
	    Customer  1036 COMPANY 501
	      SA Sort Code  1.43 WATER DOLLS
	78-143FS        17/8# SS MODEL                10       0
	      SA Sort Code subtotals                  10       0
	
	      SA Sort Code  3.3 REMOTE CONTROL CARS
	74270           Model 35357-DBL                0       0
	921137-73       LARGE 19 X 18 X 14            30       0
	      SA Sort Code subtotals                  30       0
	    Customer subtotals                        40       0
	
	    Customer  14457 COMPANY 518
	      SA Sort Code  11.5 KITCHEN SETS
	943437          19/8# SS MODEL                 0       0
	      SA Sort Code subtotals                   0       0
	    Customer subtotals                         0       0

We're making progress.  It's almost readable.

There are four types of lines left in this report that we need to deal 
with.
Yes, I'm sure it's four and not three.  Sub-headers, product sales, 
subtotals
and blank lines.  At this point, I would try to build selectors to 
handle each
of them:

	while line = ARGF.gets
		next if ($. == 1 or line =~ /\f/) .. line =~ /^-[- ]+-$/
		next if line =~ /--$/
		
		# handle blank lines...
		if line !~ /\S/
			print "  BLANK:  ", line
		
		# handle subtotals
		elsif line =~ /^.+?totals(?:\s+(?:-?[\d,]+K?|%+)){12}\s*$/i
			print "  TOTAL:  ", line
		
		# hande product sales
		elsif line =~ /^\S/
			print "PRODUCT:  ", line
		
		# handle subheaders
		else
			print " HEADER:  ", line
		end
	
		break if line =~ /^Report Totals/
	end

Uh oh, scary Regexp in there.  Let's break it down:

	^ # the beginning of the line
	    .+?    # skip some of the beginning looking for...
	    totals # the word totals (to match subtotals or Report Totals with 
/i)
	    (?: # grouping
	        \s+ # some whitespace
	        (?: # start group
	            -?     # an optional minus
	            [\d,]+ # a comma separated digit sequence
	            K?     # an optional trailing K
	            |      # or ...
	            %+     # a run of % signs
	        ) # end group
	    ){12} # end group -- find exactly 12 of those
	    \s*   # optional trailing whitespace
	$ # the end of the line

In short, it's just looking for the word "totals" followed by 12 things 
that
could be numbers in this report.  You really don't even need an 
expression that
complex, in this case, but since the headers can hold arbitrary text 
(company
names) I want to make sure I'm matching what I'm looking for.  Ruby 
will even
let you drop that commented Regexp in your code with /x if you like.

Notice that I'm being careful to use generic patterns.  For example, I 
never
match Salesperson, Company or SA Sort Code.  What if the company 
happens to have
another report that includes Brokers?  (Trick question.  I know it to 
be true in
this case!)  These patterns will treat that like any other sub-header 
and just
work as expected.

So does that successfully locate everything?

	 HEADER:    Salesperson  00 NOBODY
	 HEADER:      Customer  1036 COMPANY 501
	 HEADER:        SA Sort Code  1.43 WATER DOLLS
	PRODUCT:  78-143FS        17/8# SS MODEL                10       0
	  TOTAL:        SA Sort Code subtotals                  10       0
	  BLANK:
	 HEADER:        SA Sort Code  3.3 REMOTE CONTROL CARS
	PRODUCT:  74270           Model 35357-DBL                0       0
	PRODUCT:  921137-73       LARGE 19 X 18 X 14            30       0
	  TOTAL:        SA Sort Code subtotals                  30       0
	  TOTAL:      Customer subtotals                        40       0
	  BLANK:
	 HEADER:      Customer  14457 COMPANY 518
	 HEADER:        SA Sort Code  11.5 KITCHEN SETS
	PRODUCT:  943437          19/8# SS MODEL                 0       0
	  TOTAL:        SA Sort Code subtotals                   0       0
	  TOTAL:      Customer subtotals                         0       0

Sure does.  Now it's easy to just transform that data into CSV, which 
will fix
our not-quite-fixed-width-columns problem:

	require "csv"
	
	while line = ARGF.gets
		next if ($. == 1 or line =~ /\f/) .. line =~ /^-[- ]+-$/
		next if line =~ /--$/
		
		if line !~ /\S/
			puts CSV.generate_line([""])
	
		elsif line =~ /^(.+?totals)((?:\s+(?:-?[\d,]+K?|%+)){12})\s*$/i
			puts CSV.generate_line(["", $1.lstrip, *$2.split(" ")])
	
		elsif line =~ /^(\S+)\s+(.+?)((?:\s+(?:-?[\d,]+K?|%+)){12})\s*$/
			puts CSV.generate_line([$1, $2, *$3.split(" ")])
	
		else
			puts CSV.generate_line(["", line.strip])
		end
	
		break if line =~ /^Report Totals/
	end

I snuck another nasty Regexp in there didn't I?  Actually, if you look 
closely,
you'll see that it's nearly identical to the other one, save that it 
also
catches product descriptions.  It's just there to help me locate all 
the parts
of the line.  After capturing the parts, I use split() to break them 
up, strip()
to clean them, and let Ruby's CSV handle the rest.

If you open the CSV output from this in Excel, you'll see that I've 
just shifted
sub-headers and subtotals into the second column.  That pretty much 
maintains
the familiar report format, while eliminating the column breakup issues.

That's really a lot of the clean up work done already.  The nagging 
problem is
that Excel doesn't consider 1,612K or %%%% numbers.  We can fix the 
first one
easy enough, but what the heck is causing the %%%%s?

More detective work is needed.

Those %%%% entries only happen in the small "Pct Var" column, which is 
four
characters wide.  "Var" huh?  Variance maybe?  If you look around a 
bit, the
formula is easy to calculate from a row like this:

	                                         Current  LastYr  Pct
	Part Code       Description               Period  Period  Var
	--------------- ------------------------ ------- ------- ----
	
	613433          .22 SIZE PLASTIC, NO BATT     65       1 6400

That looks like (65 - 1) * 100.  That doesn't make much sense though, 
because
we're probably talking about a percentage.  More likely is (65 - 1) / 1 
* 100.
You can test that on other rows to prove that it is indeed correct.

Okay, so why are we getting %%%%s?

	Current  LastYr  Pct
	    YTD     YTD  Var
	------- ------- ----
	
	    125       1 %%%%

(125 - 1) / 1 * 100 = 12400.  That's five characters and we're only 
allowed
four.  It's a column overflow problem.  Our CSV approach has no such 
limitation,
so we can recalculate those numbers and add them to the output.

Here's a routine to clean up the numbers:

	def clean( numbers )
		# turn them back into numbers...
		numbers.map! do |n|
			n.gsub!(",", "")  # drop commas
			if n.sub!(/K$/, "")  # when there's a K...
				n.to_i * 1000    # multiple by 1,000
			elsif n !~ /%/
				n.to_i           # numify
			else
				n
			end
		end
		
		# recalculate %%%% columns...
		numbers.each_with_index do |n, i|
			if n.to_s =~ /%/
				numbers[i] = ( (numbers[i - 2] - numbers[i - 1]) /
							   numbers[i - 1].to_f * 100 ).to_i
			end
		end
		
		numbers  # return results
	end

To use that, we just call it twice:

	while line = ARGF.gets
		next if ($. == 1 or line =~ /\f/) .. line =~ /^-[- ]+-$/
		next if line =~ /--$/
		
		if line !~ /\S/
			puts CSV.generate_line([""])
	
		elsif line =~ /^(.+?totals)((?:\s+(?:-?[\d,]+K?|%+)){12})\s*$/i
			# first call...
			puts CSV.generate_line(["", $1.lstrip, *clean($2.split(" "))])
	
		elsif line =~ /^(\S+)\s+(.+?)((?:\s+(?:-?[\d,]+K?|%+)){12})\s*$/
			# second call...
			puts CSV.generate_line([$1, $2, *clean($3.split(" "))])
	
		else
			puts CSV.generate_line(["", line.strip])
		end
	
		break if line =~ /^Report Totals/
	end

The rest of my solution (not shown) was boring header parsing, to print 
out the
column labels and the report "Period".  You can follow the link in the 
sidebar
if you want to examine that.

At this point, we've solved problems 1, 2 and 3 from the beginning of 
this
summary.  I want to talk a little about problem 4, then I promise to 
end this
ridiculously long message.

Christian Neukirchen's short solution (not shown) does away with 
sub-header and
subtotal lines.  That filter tacks on the Salesperson, Customer and SA 
Sort Code
to each product entry.  This makes all the needed information available 
on every
line.  This is a big win for using Excel's filters, making the report a 
lot
easier to examine and change.

When I did this project for work, I did the same thing.  However, I put 
the
metadata at the back of the line and left in the sub-headers, subtotals 
and
blanks.  I also added a column to identify each type of line:  Header, 
total,
data or blank.  This gave me all the same benefits as Christian's code 
has,
while keeping the familiar format.  I could filter down to the "data" 
lines to
have nearly identical results.  This feature turned out to be a big hit,
drastically increasing company productivity.

Sorry for sending in a quiz that "looked too much like real work".  I 
promise,
we're back to fun and games tomorrow...