Ron Jeffries <ronjeffries / REMOVEacm.org> wrote in message news:<A89DE76B766253AB.34293E197BFFC2CC.809AF8B964D7ECD2 / lp.airnews.net>...
> On 11 Dec 2001 06:55:24 -0800, rbinder / rbsc.com (Bob Binder) wrote:
> 
> >There is no evidence that this has (or would) have happened for an
> >actual implementation. Measuring coverage requires using a coverage
> >analyzer or equivalent instrumentation. Simply calling every method in
> >a (sub)class interface does not guarantee coverage of the statements
> >in the method implementations. Excluding toy problems, it is
> >completely impractical to attempt to assess coverage without using an
> >automated coverage analyzer.
> 
> Bob, thanks for a comprehensive report on coverage. I am a bit
> surprised, however, because you seem to  me to be (sort of) saying
> that theory says you can't accomplish what Beck actually accomplished
> in practice.

Unless you're referring to my saying that the example tests didn't get
statement coverage, I have no idea what you're talking about.  I
amended that remark in a later post (any one of the tests will get
statement coverage because the example code has no selection or
iteration.)

> 
> What seems to me to happen when one goes strictly test-first is that
> one gets very high coverage, of all branches. 

This can only happen in 3 cases: (1) you're extraordinarily lucky, (2)
you construct your tests from the code, or (3) you're dealing with
very simple code. With zero predicates in the scope of test and
no-peaking at the code, your chance of branch coverage is very good.
With 3, I'd say about 50%, with 4 or more, about 20%.

There are many published experiments and experience reports in
peer-reviewed literature showing that "black-box" only testing rarely
achieves better than 2/3s statement coverage.  For example awk
(Kernighan) and Knuth's TEX have an a extensive collection of black
box tests and downloadable source, have been in "production" for
years. The following experiment was done: the source was built with a
instrumenter and the test suites were run. Here's the coverage
obtained for these "excruciating", and "fiendishly clever" test
suites, which were the result of many years of development.

     Statement  Branch  Dataflow(1) Dataflow(2)
TEX   85         72       53          48
AWK   70         59       48          55

The authors reviewed the uncovered blocks and determined that the
uncovered code was not only doing exception handling.  See Joseph R.
Horgan and Aditya P. Mathur, "Software testing and reliability." in
Michael R. Lyu (ed). _Handbook of software reliability engineering._
Los Alamitos, Calif.: IEEE Computer Society Press, 1996. 531-566. BTW,
this study has a very nice analysis of the relationship between unit
test effectiveness and system reliability -- the basic result is that
both effective unit and system scope tests under the operational
profile are necessary achieve high reliability.

Keith Ray, in another post in this thread said:

"Someone on the XP mailing list said that their first measurement of
unit
test coverage, after doing XP for a while, was 68% (If I recall
correctly.) They used the measurements to improve their unit testing,
and I think it
got up to over 80% or maybe higher, and it remained high as they 
continued their project."

So, a *measured* data point for the XP test strategy hits exactly the
same limit seen for years in all other kinds of testing, which is
improved with feedback. This data point does not support your
assertions.

It bears repeating: testing can't find bugs in code that isn't tested.
 This is why we do coverage analysis. Coverage analysis checks that we
haven't missed some code -- it should never be a primary test design
strategy.

> what 
> There are exceptions ...
> and exceptions are one of them. Often in Java you can't even get a
> statement to compile unless you embed it in a try block. This forces
> the programmer to code both sides of the block while he only has a
> test for one side. 
> 

Skipping tests of exception handling because it is too difficult is
like arguing that testing automotive safety equipment (air bags, seat
belts, ...) should be skipped because you have to crash a perfectly
good car, and accidents don't happen all that often anyway, and trust
me, we're certain those belts will hold. In point of fact, if you
design exception handling for testability and use some straightforward
test driver patterns to deal with this problem, it isn't any harder to
test exception code than normal case code.


> Overall, however, the practice gets very high coverage, 

Coverage of what? How do you *know* this to be the case?  If you (or
anyone else) hasn't run a coverage analyzer, you're just guessing
about code coverage(toy problems excluded.)  Why are you able to do
what Knuth and Kernighan (and many others) routinely fail to do?

There are several freeware coverage analyzers. You can download
coverage analyzers with a 30 day try-before-you-buy license from many
leading vendors.  Why don't you all get some *real* data to support
your claims? This wouldn't take more than a few hours: download and
install a tool, instrument your app, run your test suite, and see what
you get.  Who knows -- maybe you'll be able to prove you really can
walk on water.


> which (while I
> agree it is at best the beginning, not the end of good testing) is
> very much higher than programmers "generally" accomplish.
> 
> What you didn't much address is whether this little example needs more
> tests or not.  In my view, it does not (after the one for not a
> triangle, which has been provided). It certainly seems not to require
> 22.

I addressed this question elsewhere in this thread. To repeat: (1)
Using code knowledge to exclude tests for "impossible" bugs is
defensible, and with the simple example in question, makes sense. 
Test design is always about finding criteria to reduce the
astronomical combinatorics of software. But code-based test design is
problematic for many reasons. (2) The example implementation is either
(a) a fragment (it relies on other code to reject and report invalid
input) and the XP-is-smaller-hence-better conclusion is based on an
apples-to-oranges comparison, or (b) the example code is buggy and
test suite is insufficient to reveal these bugs (it accepts values
which aren't triangles and reports that they are, and will (probably)
crash for the additional tests I proposed.) Either way, no one has
shown that a smaller and equally effective test suite is a necessary
result of the XP test approach or that the XP test approach is
necessarily superior to other test approaches in finding bugs, other
things being equal.

The 65 tests listed in my book for the Triangle class do not make any
assumptions about the implementation of the method which does the
classification (other than the class interface), and that
implementation uses line segment objects (each a pair of pixel
addresses) not ints, and lives at the bottom of a five-level
hierarchy.  We test because we don't where the bugs are -- if we knew,
we wouldn't have to test.  Every test you *don't* do is bet that there
is no bug for that input and sequence, and vice versa.

>
> With all due respect, your theoretical answer seems to me to try to
> sweep aside an interesting practical result, of which this toy is an
> exanple: the test-first practice seems to produce code which needs far
> fewer tests than black-box theory would suggest. 

My observations are based on the problem as stated. They are not
"theoretical" in the speculative sense you seem to be implying. What
"black-box theory" are you talking about? Can you do better than a
camp-fire bogey man?

Here are some speculative conclusions drawn from facts: I can see that
the XP process provides a clear incentive to write simpler,
easier-to-test code (I agree this is a good thing and that XP has a
unique and effective process to achieve this.) Compared with more
complex or poorly designed code, such code would require fewer tests
to achieve code coverage X, and would tend to be less buggy. However,
there can be no appreciable difference in the number of
specification-based tests for a given specification and test strategy,
regardless of a good or bad implementation. So, XP could result in
smaller test suites of equivalent effectiveness for a given
specification and code coverage criterion, but only because XP might
produce smaller and simpler code in the first place. In other words,
the XP approach to testing is effective not because it is good
testing, but because it is tightly coupled with good programming.
However, as the XP approach to test design is ad hoc and code coverage
is not required, it seems to me that what XP gains from simplicity
would tend to be offset by the opportunities it misses.


>It might just be an
> artifact of the example, except that many of us who are using the
> practice observe that it works well all around. 

I have yet to see anything like hard evidence and analysis that would
withstand peer-review and which would (a) substantiate the claims made
by many XP practitioners, and then (b) support a comparison of these
claims to credible baselines established for other approaches.

Every software methodology I've seen in the last 25 years has made
similar claims. My sense is the claims are mostly accurate when
discounted for the enthusiasm(?) of their advocates, but that the
specifics of the methodologies do not explain their success. For me,
that they are applied by charismatic, clever people who are already
highly skilled in their technology of choice and application space
explains most of the success. I'd be willing to bet money that a
properly constructed statistical model would show that methodological
variables are mostly noise for predicting quality and cost. For
example, the Clean-Room folks *prohibit* unit testing, and have some
published (albeit controversial) studies that show this approach to be
very effective.

> 
> My belief is that test-first is in some sense a different kind of
> testing and programming from the joe codes it and jack tests it kind
> that we have used in the past.

Strange, I thought that pair programming was exactly "joe codes it and
jack tests it".  If you're not talking about unit-scope testing or
small systems, are you suggesting that established practice of a
separate test group to do integration and system testing of large
systems should be replaced by tiny tests written by the merry coders
as they develop the app? That apps should be released when the merry
coders can't think of any more tests to not do?

As I have said many times in the past, I strongly endorse and applaud
the emphasis that XP places on testing. I have no real argument with
XP testing practices, as far as they go. I do object strongly to XP's
self-characterization of its testing slogans as necessarily sufficient
testing. I find it very disappointing that the same people who are
strong advocates of the idea of testing insist on trivializing and
ignoring established testing practice.
  

------------------------------------------------------------------------
Bob Binder           http://www.rbsc.com      RBSC Corporation
rbinder@ r b s c.com Advanced Test Automation 3 First National Plz 
312 214-3280  tel    Process Improvement      Suite 1400
312 214-3110  fax    Test Outsourcing         Chicago, IL 60602