Dwelling on the point
Once again, and we are not learning!
La Repubblica of last Thursday [1] and other Italian newspapers have reported on a “computer” error that temporarily brought thousands of accounts at the national postal service bank into the red. It is a software error, due to a misplacement of the decimal points in some transactions.
As usual the technical details are hazy; La Repubblica writes that:
“Because of a software change that did not succeed, the computer system did not always read the decimal point during transactions”.
As a result, it could for example happen that a 15.00-euro withdrawal was understood as 1500 euros.
I have no idea what “reading the decimal point ” means. (There is no mention of OCR, and the affected transactions seem purely electronic.) Only some of the 12 million checking or “Postamat” accounts were affected; the article cites a number of customers who could not withdraw money from ATMs because the system wrongly treated their accounts as over-drawn. It says that this was the only damage and that the postal service will send a letter of apology. The account leaves many questions unanswered, for example whether the error could actually have favored some customers, by allowing them to withdraw money they did not have, and if so what will happen.
The most important unanswered question is the usual one: what was the software error? As usual, we will probably never know. The news items will soon be forgotten, the postal service will somehow fix its code, life will go on. Nothing will be learned; the next time around similar causes will produce similar effects.
I criticized this lackadaisical attitude in an earlier column [2] and have to hammer its conclusion again: any organization using public money should be required, when it encounters a significant software malfunction, to let experts investigate the incident in depth and report the results publicly. As long as we keep forgetting our errors we will keep repeating them. Where would airline safety be in the absence of thorough post-accident reports? That a software error did not kill anyone is not a reason to ignore it. Whether it is the Italian post messing up, a US agency’s space vehicle crashing on the moon or any other software fault causing systems to fail, it is not enough to fix the symptoms: we must have a professional report and draw the lessons for the future.
Reference
[1] Luisa Grion: Poste in tilt per una virgola — conti gonfiati, stop ai prelievi. In La Repubblica, 26 November 2009, page 18 of the print version. (At the time of writing it does not appear at repubblica.it, but see the TV segment also titled “Poste in tilt per una virgola” on Primocanale Web TV here, and other press articles e.g. in Il Tempo here.)
[2] On this blog: The one sure way to advance software engineering (post of 21 August 2009).
It seems that even many software experts do not fully realize yet that software errors may be much more dangerous than just, say, a crashing program window.
Hi Bertrand,
I believe that our industry lacks a true professional attitude. To be honest, I don’t know very well how the software development practice is undertaken in other countries; I know to some extent the situation in Italy, where I live and work. Here most software development initiatives, even in medium and large organizations, are still amateurish. Software is hand-crafted. And this is primarily due by a cultural gap. The practice of Software Engineering (SE) encompasses a wide range of skills, not all of which are (or must to be) very formal. But the “engineering attitude” of SE means that we should base our work also on formal models and methods, on a corpus of well-established body of knowledge, and on a robust software development environment. All together these elements help our deveolpment practice to reach the right level of maturity, allowing us to create reliable software systems. If we do not invest at least some part of our time to acquire the right culture, we cannot be able to produce software systems which are both of acceptable quality and still economically convenient.
I remember a personal anecdote provided by Steve Maguire that affected tremendously my perspective on what should be considered “good” software development. He (Steve) talked about a software bug reported by the tester team on a database management product for Apple II. At that time, Steve was one of the programmers assigned to that product. After several debugging and testing sessions, the bug seemed to be disappeared and the first reaction of Steve was to ignore it. Perhaps someone else fixed it in the meantime! or maybe the bug report was simply wrong. After further work, however, Steve realized that the bug was there all of the time. Simply it was hidden by some other behavior, and that the right thing to do in that case was not to ignore the bag at all but to further examine the program in order to better understand the bug’s nature. Steve’s final remark was that, as software professionals, we have to perform three separate steps in order to better tackle software bugs:
1. we must carefully investigate the root cause of *every* bug, understanding why it was introduced in the program in the first place; then
2. we have to fix it, running regression testing suites (or any other verification strategy) in order to assure that the fix has not introduced new, unexpected side-effects in the program.
3. Finally, but not least important, we have to ask ourselves how we can *prevent* to reintroduce again the bug in our code. And his best suggestion is to translate this understanding into good assertions that will catch the bug automatically in the case it will be inadvertently reintroduced again.
If we would systematically reason in that way, I believe that there are good chances that our software products will be crafted better than now. Moreover, this attitude in the long term is also cost-effective because it reduces the effort required in maintenance. Ok, this of course is not a full-fledge methodology, it is only a good practice concerning bug removal. But in my experience it seems that this practice is not followed.
I don’t know if the development team working for the Poste Italiane tackled the issues reported in the article as Steve suggested; I only hope that our industry will eventually learn by his mistakes, e.g. starting by filling the software amateur cultural gap as soon as possible.
Andrea
[…] way to advance software engineering: this blog, see here. [2] Dwelling on the point: this blog, see here. var addthis_pub = ''; var addthis_language = 'en';var addthis_options = 'email, favorites, […]
[…] way to advance software engineering: this blog, see here. [2] Dwelling on the point: this blog, see here. [3] Dines Bjørner: The TSE Trading Rules, version 2, unpublished report, 22 February 2010, […]