All Bugs Great and Small
(Acknowledgment: this article came out of a discussion with Manuel Oriol, Carlo Furia and Yi Wei. The material is largely theirs but the opinions are mine.)
A paper on automatic testing, submitted some time ago, received the following referee comment:
The case study seems unrealistic and biased toward the proposed technique. 736 unique faults found in 92 classes means at least 8 unique faults per class at the same time. I have never seen in all my life a published library with so many faults …
This would be a good start for a discussion of what is wrong with refereeing in computer science today (on the negativism of our field see [1]); we have a referee who mistakes experience for expertise, prejudice for truth, and refuses to accept carefully documented evidence because “in all his life”, presumably a rich and rewarding life, he has never seen anything of the sort. That is not the focus of the present article, however; arrogant referees eventually retire and good papers eventually get published. The technical problems are what matters. The technical point here is about testing.
Specifically, what bugs are worth finding, and are high bug rates extraordinary?
The paper under review was a step in the work around the automatic testing tool AutoTest (see [2] for a slightly older overall description and [3] for the precise documentation). AutoTest applies a fully automatic strategy, exercising classes and their routines without the need to provide test cases or test oracles. What makes such automation possible is the combination of random generation of tests and reliance on contracts to determine the success of tests.
For several years we have regularly subjected libraries, in particular the EiffelBase data structure library, to long AutoTest sessions, and we keep finding bugs (the better term is faults). The fault counts are significant; here they caught the referee’s eye. In fact we have had such comments before: I don’t believe your fault counts for production software; your software must be terrible!
Well, maybe.
My guess is that in fact EiffelBase has no more bugs, and possibly far fewer bugs, than other “production” code. The difference is that the AutoTest framework performs far more exhaustive tests than usually practiced.
This is only a conjecture; unlike the referee I do not claim any special powers that make my guesses self-evident. Until we get test harnesses comparable to AutoTest for environments other than Eiffel and, just as importantly, libraries that are fully equipped with contracts, enabling the detection of bugs that otherwise might not come to light, we will not know whether the explanation is the badness of EiffelBase or the goodness of AutoTest.
What concrete, incontrovertible evidence demonstrates is that systematic random testing does find faults that human testers typically do not. In a 2008 paper [4] with Ilinca Ciupa, Manuel Oriol and Alexander Pretschner, we ran AutoTest on some classes and compared the results with those of human testers (as well as actual bug reports from the field, since this was released software). We found that the two categories are complementary: human testers find faults that are still beyond the reach of automated tools, but they typically never find certain faults that AutoTest, with its stubborn dedication to leaving no stone unturned, routinely uncovers. We keep getting surprised at bugs that AutoTest detects and which no one had sought to test before.
A typical set of cases that human programmers seldom test, but which frequently lead to uncovering bugs, involves boundary values. AutoTest, in its “random-plus” strategy, always exercises special values of every type, such as MAXINT, the maximum representable integer. Programmers don’t. They should — all testing textbooks tell them so — but they just don’t, and perhaps they can’t, as the task is often too tedious for a manual process. It is remarkable how many routines using integers go bezerk when you feed them MAXINT or its negative counterpart. Some of the fault counts that seem so outrageous to our referee directly come from trying such values.
Some would say the cases are so extreme as to be insignificant. Wrong. Many documented software failures and catastrophes are due to untested extreme values. Perhaps the saddest is the case of the Patriot anti-missile system, which at the beginning of the first Gulf war was failing to catch Scud missiles, resulting in one case in the killing of twenty-eight American soldiers in an army barrack. It was traced to a software error [5]. To predict the position of the incoming missile, the computation multiplied time by velocity. The time computation used multiples of the time unit, a tenth of a second, stored in a 24-bit register and hence approximated. After enough time, long enough to elapse on the battlefield, but longer than what the tests had exercised, the accumulated error became so large as to cause a significant — and in the event catastrophic — deviation. The unique poser of automatic testing is that unlike human testers it is not encumbered by a priori notions of a situation being extreme or unlikely. It tries all the possibilities it can.
The following example, less portentous in its consequences but just as instructive, is directly related to AutoTest. For his work on model-based contracts [6] performed as part of his PhD completed in 2008 at ETH, Bernd Schoeller developed classes representing the mathematical notion of set. There were two implementations; it turned out that one of them, say SET1, uses data structures that make the subset operation ⊆ easy to program efficiently; in the corresponding class, the superset operation, a ⊇ b, is then simply implemented as b ⊆ a. In the other implementation, say SET2, it is the other way around: ⊇ is directly implemented, and a ⊆ b, is implemented as b ⊇ a. This all uses a nice object-oriented structure, with a general class SET defining the abstract notion and the two implementations inheriting from it.
Now you may see (if you have developed a hunch for automated testing) where this is heading: AutoTest knows about polymorphism and dynamic binding, and tries all the type combinations that make sense. One of the generated test cases has two variables s1 and s2 of type SET, and tries out s2 ⊆ s1; in one of the combinations that AutoTest tries, s1 is dynamically and polymorphically of type SET1 and s2 of type SET2. The version of ⊆ that it will use is from SET2, so it actually calls s1 ⊇ s2; but this tests the SET1 version of ⊇, which goes back to SET2. The process would go on forever, were it not for a timeout in AutoTest that uncovers the fault. Bernd Schoeller had tried AutoTest on these classes not in the particular expectation of finding bugs, but more as a favor to the then incipient development of AutoTest, to see how well the tool could handle model-based contracts. The uncovering of the fault, testament to the power of relentless, systematic automatic testing, surprised us all.
In this case no contract was violated; the problem was infinite recursion, due to a use of O-O techniques that for all its elegance had failed to notice a pitfall. In most cases, AutoTest finds the faults through violated postconditions or class invariants. This is one more reason to be cautious about sweeping generalizations of the kind “I do not believe these bug rates, no serious software that I have seen shows anything of the sort!”. Contracts express semantic properties of the software, which the designer takes care of stating explicitly. In run-of-the-mill code that does not benefit from such care, lots of things can go wrong but remain undetected during testing, only to cause havoc much later during some actual execution.
When you find such a fault, it is irrelevant that the case is extreme, or special, or rare, or trivial. When a failure happens it no longer matter that the fault was supposed to be rare; and you will only know how harmful it is when you deal with the consequences. Testing, single-mindedly devoted to the uncovering of faults [7], knows no such distinction: it hunts all bugs large and small.
References
[1] The nastiness problem in computer science, article on the CACM blog, 22 August 2011, available here.
[2] Bertrand Meyer, Ilinca Ciupa, Andreas Leitner, Arno Fiva, Yi Wei and Emmanuel Stapf: Programs that Test Themselves, IEEE Computer, vol. 42, no. 9, pages 46-55, September 2009, also available here.
[3] Online AutoTest documentation, available here at docs.eiffel.com.
[4] Ilinca Ciupa, Bertrand Meyer, Manuel Oriol and Alexander Pretschner: Finding Faults: Manual Testing vs. Random+ Testing vs. User Reports, in ISSRE ’08, Proceedings of the 19th IEEE International Symposium on Software Reliability Engineering, Redmond, November 2008, available here.
[5] US General Accounting Office: GAO Report: Patriot Missile Defense– Software Problem Led to System Failure at Dhahran, Saudi Arabia, February 4, 1992, available here.
[6] Bernd Schoeller, Tobias Widmer and Bertrand Meyer: Making Specifications Complete Through Models, in Architecting Systems with Trustworthy Components, eds. Ralf Reussner, Judith Stafford and Clemens Szyperski, Lecture Notes in Computer Science, Springer-Verlag, 2006, available here.
[7] Bertrand Meyer: Seven Principles of Software testing, in IEEE Computer, vol. 41, no. 10, pages 99-101, August 2008available here.
[…] Arrogant referees eventually retire and good papers eventually get published […]
This reminds me of the post by Joshua Bloch about the fact that most implementations of binary searches are buggy, because they fail to properly address overflow when computing “mid =(low + high) / 2”. http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html
That’s exactly right. Here’s another, recent example: http://googleonlinesecurity.blogspot.com/2011/08/fuzzing-at-scale.html
I am including here, with his permission, a comment received separately from Capers Jones.
———————————
Bertrand,
If testing is the only defect removal activity without prior inspections or static analysis, you will find many bugs but quality will be poor. Testing alone seldom tops 85% in cumulative defect removal efficiency so 15% of your bugs will be delivered to clients.
If you do inspections and static analysis prior to testing, you will find very few bugs during testing because most were already eliminated. Your cumulative defect removal efficiency will be above 95% and possibly approach 99%.
I sent some spreadsheets that show the differences between high quality and low quality. If others want to see them send an email to capers.jones3@gmail.com with a valid return.
Most forms of testing only find about 35% or one bug out of three. You need 8 to 12 test stages at that low level of efficiency. Inspections and static analysis, on the other hand, run from 65% to more than 85% in defect removal efficiency.
Regards,
Capers Jones
———————