Error Archives - Bertrand Meyer's technology+ blog

Once again, and we are not learning!

La Repubblica of last Thursday [1] and other Italian newspapers have reported on a “computer” error that temporarily brought thousands of accounts at the national postal service bank into the red. It is a software error, due to a misplacement of the decimal points in some transactions.

As usual the technical details are hazy; La Repubblica writes that:

“Because of a software change that did not succeed, the computer system did not always read the decimal point during transactions”.

As a result, it could for example happen that a 15.00-euro withdrawal was understood as 1500 euros.
I have no idea what “reading the decimal point ” means. (There is no mention of OCR, and the affected transactions seem purely electronic.) Only some of the 12 million checking or “Postamat” accounts were affected; the article cites a number of customers who could not withdraw money from ATMs because the system wrongly treated their accounts as over-drawn. It says that this was the only damage and that the postal service will send a letter of apology. The account leaves many questions unanswered, for example whether the error could actually have favored some customers, by allowing them to withdraw money they did not have, and if so what will happen.

The most important unanswered question is the usual one: what was the software error? As usual, we will probably never know. The news items will soon be forgotten, the postal service will somehow fix its code, life will go on. Nothing will be learned; the next time around similar causes will produce similar effects.

I criticized this lackadaisical attitude in an earlier column [2] and have to hammer its conclusion again: any organization using public money should be required, when it encounters a significant software malfunction, to let experts investigate the incident in depth and report the results publicly. As long as we keep forgetting our errors we will keep repeating them. Where would airline safety be in the absence of thorough post-accident reports? That a software error did not kill anyone is not a reason to ignore it. Whether it is the Italian post messing up, a US agency’s space vehicle crashing on the moon or any other software fault causing systems to fail, it is not enough to fix the symptoms: we must have a professional report and draw the lessons for the future.

Reference

[1] Luisa Grion: Poste in tilt per una virgola — conti gonfiati, stop ai prelievi. In La Repubblica, 26 November 2009, page 18 of the print version. (At the time of writing it does not appear at repubblica.it, but see the TV segment also titled “Poste in tilt per una virgola” on Primocanale Web TV here, and other press articles e.g. in Il Tempo here.)

[2] On this blog: The one sure way to advance software engineering (post of 21 August 2009).

Airplanes today are incomparably safer than 20, 30, 50 years ago: 0.05 deaths per billion kilometers. That’s not by accident.

Rather, it’s by accidents. What has turned air travel from a game of chance into one of the safest modes of traveling is the relentless study of crashes and other mishaps. In the US the National Transportation Safety Board has investigated more than 110,000 accidents since it began its operations in 1967. Any accident must, by law, be investigated thoroughly; airplanes themselves carry the famous “black boxes” whose only purpose is to provide evidence in the case of a catastrophe. It is through this systematic and obligatory process of dissecting unsafe flights that the industry has made almost all flights safe.

Now consider software. No week passes without the announcement of some debacle due to “computers” — meaning, in most cases, bad software. The indispensable Risks forum [1] and many pages around the Web collect software errors; several books have been devoted to the topic.

A few accidents have been investigated thoroughly; two examples are Nancy Leveson’s milestone study of the Therac-25 patient-killing medical device [2], and Gilles Kahn’s analysis of the Ariane 5 crash (which Jean-Marc Jézéquel and I used as a basis for our 1997 article [3]). Both studies improved our understanding of software engineering. But these are exceptions. Most of what we have elsewhere is made of hearsay and partial information, and plain urban legends (like the endlessly repeated story about the Venus probe that supposedly failed because a period was typed instead of a comma — most likely a canard).

Software disasters continue; they attract attention when they arise, and inevitably some kind of announcement is made that the problem is being corrected, or that a committee will study the causes; almost as inevitably, that is the last we hear of it. In the latest issue of Risks alone, you can find several examples (such as [4]). In the past months, breakdowns at Skype, Google and Twitter made headlines; we all learned about the failures, but have you seen precise analyses of what actually happened?

As another typical example, we heard a few months ago from the French press that an “IT error” (une erreur informatique) led to overestimating the pensions of about a million people; since (strangely!) no one was suggesting that they would be asked to pay the money back, the cost to taxpayers will be over 300 million euros. I looked in vain for any follow-up story: what happened? What was the actual error? Were the tools at fault? The quality assurance procedures? The programmers’ qualifications? Or was it a matter of bad deployment? Of erroneous data, and if so, what was the process for validating inputs? And so on. Most likely we will never know.

But we should know. Especially with public money, any such incident should have a post-mortem, with experts called in (surely at a fraction of the cost of the failure) to analyze what happened and produce a public report.

At least this was a public project, for which some disclosure was inevitable. The software engineering community buzzes with unconfirmed reports of huge software-induced errors, that go unreported because they happen in private companies eager to avoid bad publicity. It’s as if we had allowed aircraft manufacturers, decade after decade, to keep mum about accidents. Where then would air travel safety be today?

Progress in software engineering will come from many sources. Research is critical, including on topics which today appear exotic. But if anyone is looking for one practical, low-tech idea that has an iron-clad guarantee of improving software engineering, here it is: pass a law that requires extensive professional analysis of any large software failure.

The details are not so hard to refine. The initiative would probably have to start at the national level; any industrialized country could be the pioneer. (Or what about Europe as whole?) The law would have to define what constitutes a “large” failure; for example it could be any failure that may be software-related and has resulted in loss either of human life or of property beyond a certain threshold, say $50 million. In the latter case, to avoid accusations of government meddling in private matters, the law could initially be limited to cases involving public money; when it has shown its value, it could then be extended to private failures as well. Even with some limitations, such a law would have a tremendous effect. Only with a thorough investigation of software projects gone wrong can we help the majority of projects to go right.

We can no longer afford to let the IT industry get away with covering up its failures. Lobbying for a Software Incident Full Disclosure Law is the single most important step we can take today to make the world’s software better.

Note (2011)

Later articles have come back to the theme discussed here, and there will probably be more in the future as it remains ever current. They can be found by selecting the tag “Advance”.

References

[1] Peter G. Neumann, moderator: The Risks Digest Forum on Risks to the Public in Computers and Related Systems, available online (going back to 1985!).

[2] Nancy Leveson: Medical Devices: The Therac-25, extract from her book Safeware: System Safety and Computers, Addison-Wesley, 1995, available online.

[3] Jean-Marc Jézéquel and Bertrand Meyer: Design by Contract: The Lessons of Ariane, in Computer (IEEE), vol. 30, no. 1, January 1997, pages 129-130, also available here.

[4] Monty Solomon: Computer Error Caused Rent Troubles for Public Housing Tenants, in Risks 25.76, 15 August 2009, see here.

[5] Une erreur informatique à 300 millions d’euros, in Le Point, 12 May 2009, available here.