Ron Jeffries <ronjeffries / REMOVEacm.org> wrote in message news:<A89DE76B766253AB.34293E197BFFC2CC.809AF8B964D7ECD2 / lp.airnews.net>... > On 11 Dec 2001 06:55:24 -0800, rbinder / rbsc.com (Bob Binder) wrote: > > >There is no evidence that this has (or would) have happened for an > >actual implementation. Measuring coverage requires using a coverage > >analyzer or equivalent instrumentation. Simply calling every method in > >a (sub)class interface does not guarantee coverage of the statements > >in the method implementations. Excluding toy problems, it is > >completely impractical to attempt to assess coverage without using an > >automated coverage analyzer. > > Bob, thanks for a comprehensive report on coverage. I am a bit > surprised, however, because you seem to me to be (sort of) saying > that theory says you can't accomplish what Beck actually accomplished > in practice. Unless you're referring to my saying that the example tests didn't get statement coverage, I have no idea what you're talking about. I amended that remark in a later post (any one of the tests will get statement coverage because the example code has no selection or iteration.) > > What seems to me to happen when one goes strictly test-first is that > one gets very high coverage, of all branches. This can only happen in 3 cases: (1) you're extraordinarily lucky, (2) you construct your tests from the code, or (3) you're dealing with very simple code. With zero predicates in the scope of test and no-peaking at the code, your chance of branch coverage is very good. With 3, I'd say about 50%, with 4 or more, about 20%. There are many published experiments and experience reports in peer-reviewed literature showing that "black-box" only testing rarely achieves better than 2/3s statement coverage. For example awk (Kernighan) and Knuth's TEX have an a extensive collection of black box tests and downloadable source, have been in "production" for years. The following experiment was done: the source was built with a instrumenter and the test suites were run. Here's the coverage obtained for these "excruciating", and "fiendishly clever" test suites, which were the result of many years of development. Statement Branch Dataflow(1) Dataflow(2) TEX 85 72 53 48 AWK 70 59 48 55 The authors reviewed the uncovered blocks and determined that the uncovered code was not only doing exception handling. See Joseph R. Horgan and Aditya P. Mathur, "Software testing and reliability." in Michael R. Lyu (ed). _Handbook of software reliability engineering._ Los Alamitos, Calif.: IEEE Computer Society Press, 1996. 531-566. BTW, this study has a very nice analysis of the relationship between unit test effectiveness and system reliability -- the basic result is that both effective unit and system scope tests under the operational profile are necessary achieve high reliability. Keith Ray, in another post in this thread said: "Someone on the XP mailing list said that their first measurement of unit test coverage, after doing XP for a while, was 68% (If I recall correctly.) They used the measurements to improve their unit testing, and I think it got up to over 80% or maybe higher, and it remained high as they continued their project." So, a *measured* data point for the XP test strategy hits exactly the same limit seen for years in all other kinds of testing, which is improved with feedback. This data point does not support your assertions. It bears repeating: testing can't find bugs in code that isn't tested. This is why we do coverage analysis. Coverage analysis checks that we haven't missed some code -- it should never be a primary test design strategy. > what > There are exceptions ... > and exceptions are one of them. Often in Java you can't even get a > statement to compile unless you embed it in a try block. This forces > the programmer to code both sides of the block while he only has a > test for one side. > Skipping tests of exception handling because it is too difficult is like arguing that testing automotive safety equipment (air bags, seat belts, ...) should be skipped because you have to crash a perfectly good car, and accidents don't happen all that often anyway, and trust me, we're certain those belts will hold. In point of fact, if you design exception handling for testability and use some straightforward test driver patterns to deal with this problem, it isn't any harder to test exception code than normal case code. > Overall, however, the practice gets very high coverage, Coverage of what? How do you *know* this to be the case? If you (or anyone else) hasn't run a coverage analyzer, you're just guessing about code coverage(toy problems excluded.) Why are you able to do what Knuth and Kernighan (and many others) routinely fail to do? There are several freeware coverage analyzers. You can download coverage analyzers with a 30 day try-before-you-buy license from many leading vendors. Why don't you all get some *real* data to support your claims? This wouldn't take more than a few hours: download and install a tool, instrument your app, run your test suite, and see what you get. Who knows -- maybe you'll be able to prove you really can walk on water. > which (while I > agree it is at best the beginning, not the end of good testing) is > very much higher than programmers "generally" accomplish. > > What you didn't much address is whether this little example needs more > tests or not. In my view, it does not (after the one for not a > triangle, which has been provided). It certainly seems not to require > 22. I addressed this question elsewhere in this thread. To repeat: (1) Using code knowledge to exclude tests for "impossible" bugs is defensible, and with the simple example in question, makes sense. Test design is always about finding criteria to reduce the astronomical combinatorics of software. But code-based test design is problematic for many reasons. (2) The example implementation is either (a) a fragment (it relies on other code to reject and report invalid input) and the XP-is-smaller-hence-better conclusion is based on an apples-to-oranges comparison, or (b) the example code is buggy and test suite is insufficient to reveal these bugs (it accepts values which aren't triangles and reports that they are, and will (probably) crash for the additional tests I proposed.) Either way, no one has shown that a smaller and equally effective test suite is a necessary result of the XP test approach or that the XP test approach is necessarily superior to other test approaches in finding bugs, other things being equal. The 65 tests listed in my book for the Triangle class do not make any assumptions about the implementation of the method which does the classification (other than the class interface), and that implementation uses line segment objects (each a pair of pixel addresses) not ints, and lives at the bottom of a five-level hierarchy. We test because we don't where the bugs are -- if we knew, we wouldn't have to test. Every test you *don't* do is bet that there is no bug for that input and sequence, and vice versa. > > With all due respect, your theoretical answer seems to me to try to > sweep aside an interesting practical result, of which this toy is an > exanple: the test-first practice seems to produce code which needs far > fewer tests than black-box theory would suggest. My observations are based on the problem as stated. They are not "theoretical" in the speculative sense you seem to be implying. What "black-box theory" are you talking about? Can you do better than a camp-fire bogey man? Here are some speculative conclusions drawn from facts: I can see that the XP process provides a clear incentive to write simpler, easier-to-test code (I agree this is a good thing and that XP has a unique and effective process to achieve this.) Compared with more complex or poorly designed code, such code would require fewer tests to achieve code coverage X, and would tend to be less buggy. However, there can be no appreciable difference in the number of specification-based tests for a given specification and test strategy, regardless of a good or bad implementation. So, XP could result in smaller test suites of equivalent effectiveness for a given specification and code coverage criterion, but only because XP might produce smaller and simpler code in the first place. In other words, the XP approach to testing is effective not because it is good testing, but because it is tightly coupled with good programming. However, as the XP approach to test design is ad hoc and code coverage is not required, it seems to me that what XP gains from simplicity would tend to be offset by the opportunities it misses. >It might just be an > artifact of the example, except that many of us who are using the > practice observe that it works well all around. I have yet to see anything like hard evidence and analysis that would withstand peer-review and which would (a) substantiate the claims made by many XP practitioners, and then (b) support a comparison of these claims to credible baselines established for other approaches. Every software methodology I've seen in the last 25 years has made similar claims. My sense is the claims are mostly accurate when discounted for the enthusiasm(?) of their advocates, but that the specifics of the methodologies do not explain their success. For me, that they are applied by charismatic, clever people who are already highly skilled in their technology of choice and application space explains most of the success. I'd be willing to bet money that a properly constructed statistical model would show that methodological variables are mostly noise for predicting quality and cost. For example, the Clean-Room folks *prohibit* unit testing, and have some published (albeit controversial) studies that show this approach to be very effective. > > My belief is that test-first is in some sense a different kind of > testing and programming from the joe codes it and jack tests it kind > that we have used in the past. Strange, I thought that pair programming was exactly "joe codes it and jack tests it". If you're not talking about unit-scope testing or small systems, are you suggesting that established practice of a separate test group to do integration and system testing of large systems should be replaced by tiny tests written by the merry coders as they develop the app? That apps should be released when the merry coders can't think of any more tests to not do? As I have said many times in the past, I strongly endorse and applaud the emphasis that XP places on testing. I have no real argument with XP testing practices, as far as they go. I do object strongly to XP's self-characterization of its testing slogans as necessarily sufficient testing. I find it very disappointing that the same people who are strong advocates of the idea of testing insist on trivializing and ignoring established testing practice. ------------------------------------------------------------------------ Bob Binder http://www.rbsc.com RBSC Corporation rbinder@ r b s c.com Advanced Test Automation 3 First National Plz 312 214-3280 tel Process Improvement Suite 1400 312 214-3110 fax Test Outsourcing Chicago, IL 60602