The ABC of software engineering
Lack of a precise context can render discussions of software engineering and particularly of software quality meaningless. Take for example the (usually absurd) statement “We cannot expect that programmers will equip their programs with contracts”. Whom do you mean? A physicist who writes 50 lines of Matlab code to produce a graph illustrating his latest experiment? A member of the maintenance team for Microsoft Word? A programmer on the team for a flight control system? These are completely different constituencies, and the answer is also different. In the last case, the answer is probably that we do not care what the programmers like and do not like. When you buy an electrical device that malfunctions, would you accept from the manufacturer the excuse that differential equations are, really, you see, too hard for our electrical engineers?
In discussing the evolution of software methods and tools we must first specify what and whom we are talking about. The following ABC characterization is sufficient for most cases.
C is for Casual. Programs in that category do all kinds of useful things, and like anything else they should work properly, but if they are not ideal in software engineering terms of reliability, reusability, extendibility and so on — if sometimes they crash, sometimes produce not-quite-right results, cannot be easily understood or maintained by anyone other than their original developers, target just one platform, run too slowly, eat up too much memory, are not easy to change, include duplicated code — it is not the end of the world. I do not have any scientific figures, but I suspect that most of the world’s software is actually in that category, from JavaScript or Python code that runs web sites to spreadsheet macros. Obviously it has to be good enough to serve its needs, but “good enough” is good enough.
B is for Business. Programs in that category run key processes in the organization. While often far from impeccable, they must satisfy strict quality constraints; if they do not, the organization will suffer significantly.
A is for Acute. This is life-critical software: if it does not work — more precisely, if it does not work exactly right — someone will get killed, someone will lose huge amounts of money, or something else will go terribly wrong. We are talking transportation systems, software embedded in critical devices, make-or-break processes of an organization.
Even in a professional setting, and even within a single company, the three categories usually coexist. Take for example a large engineering or scientific organization. Some programs are developed to support experiments or provide an answer to a specific technical question. Some programs run the organization, both on the information systems side (enterprise management) and on the technical side (large scientific simulations, experiment set-up). And some programs play a critical role in making strategy decisions, or run the organization’s products.
The ABC classification is independent of the traditional division between enterprise and technical computing. Organizations often handle these two categories separately, whereas in fact they raise issues of similar difficulty and are subject to solutions of a similar nature. It is more important to assess the criticality of each software projects, along the ABC scale.
It is surprising that few organizations make that scale explicit. It is partly a consequence of that neglect that many software quality initiatives and company-wide software engineering policies are ineffective: they lump everything together, and since they tend to be driven by A-grade applications, for which the risk of bad quality is highest, they create a burden that can be too high for C- and even B-grade developments. People resent the constraints where they are not justified, and as a consequence ignore them where they would be critical. Whether your goal for the most demanding projects is to achieve CMMI qualification or to establish an effective agile process, you cannot impose the same rules on everyone. Sometimes the stakes are high; and sometimes a program is just a program.
The first step in establishing a successful software policy is to separate levels of criticality, and require every development to position itself along the resulting scale. The same observation qualifies just about any discussion of software methodology. Acute, Business or Casual: you must know your ABC.
Alistair Cockburn pointed to his scale: http://en.wikipedia.org/wiki/Cockburn_Scale, which is more sophisticated than the “ABC” classification but also integrates size.
The Cockburn scale makes in my opinion the same mistake as the SIL (Safety Integrity Level) concept. Both are system level properties and one cannot deduce that these properties will be achieved by following a certain process, even if more formal and more rigorous process will likely lead to higher quality systems. As such software is a component in a system and together with other components (e.g. the hardware) we can reach a certain safety level. The issue is that we have no real criterium for qualifying a component independently of the process followed. Therefore we came up with the Assured Reliability and Resilience Levels or ARRL for short. There are defined as follows:
– ARRLO: Nothing is garanteed (“use as is”).
– ARRL1: The functionality is guaranteed as far as it was tested. This leaves the untested cases as a potential domain of errors.
– ARRL2: The functionality is guaranteed in all cases if no fault occurs. This requires formal evidence covering all system states.
– ARRL3: The functionality is fail-safe (errors are not propagated) or switches to a reduced operational mode upon a fault. The fault behavior is predictable as well as the next state after the fault. This requires fault detection mechanisms as well monitoring so that errors are contained and the system can be brought into a controlled state again.
– ARRL4: If a major fault occurs, the functionality is maintained and the system is degraded to the ARRL3 level. Transient faults are masked out. This requires redundancy, e.g. TMR (Triple Modular Redundancy).
– ARRL5: To cope with residual common mode failures, the TMR is implemented using heterogeneous redundancy.
Awaiting a publication on this see a presentation at http://www.altreonic.com/content/cross-domain-systems-and-safety-engineering-it-feasible
I like the ABC characterization.
The danger I see is underestimating the criticality of software. Users will use software in unforeseen contexts and the behavior of some innocuous piece of software might become critical due to the context in which it is used. For example, let’s say a friend developed a piece of software to algorithmically generate “elevator music” on my phone. If I use that piece of software while driving a car, connecting my phone to the car stereo, I make it acute. There are many kinds of defects this software might have that could lead to accidents. For example, it could suddenly and abruptly blare some extremely loud noise over the car’s stereo causing shock and complete distraction from traffic.
Software gets used in ways and in contexts that were not anticipated when it was designed and written. I would therefore suspect that there is a general underestimation of the criticality of software.