Archive for July 2010

The rise of empirical software engineering (II): what we are still missing


Recycled(This article was initially published in the CACM blog.)

The previous post under  the heading of empirical software engineering hailed the remarkable recent progress of this field, made possible in particular by the availability of large-scale open-source repositories and by the opening up of some commercial code bases.

Has the empirical side of software engineering become a full member of empirical sciences? One component of the experimental method is still not quite there: reproducibility. It is essential to the soundness of natural sciences; when you publish a result there, the expectation is that others will be able to replicate it. Perhaps such duplication does not happen as often and physicists and biologists would have us believe, but it does happen, and the mere possibility that someone could check your results (and make a name for himself, especially if you are famous, by disproving them) keeps experimenters on their toes. 

If we had the same norms in empirical software engineering, empirical papers would all contain a clause such as

Hampi’s source code and documentation, experimental data, and additional results are available at

This example is, in fact, a real quote, from a paper [1] at the 2009 ISSTA conference. It shows exactly what we expect for an experimental software engineering publication: below are my results, if you want to rerun the experiments here is the URL where you will find the code (source and binary) and the data.

Unfortunately, such professionalism is the exception rather than the rule. I performed a quick check — entirely informal, as this is a blog post, not an empirical research paper! — in the ISSTA ’09 proceedings. ISSTA, an ACM conference is a good sample point, since it covers testing (plus other approaches to program analysis) and almost every paper has an  “experiment” section. I found only a very small number that, like the one cited above, give explicit reproducibility information. (Disclosure: one of those papers is ours [2].)

I believe that the situation will change dramatically and that in a few years it will be impossible to submit an empirical paper without including such information. Computer science, or at least some areas of software engineering, should actually consider themselves privileged when it comes to allowing reproducibility: all that we have to do to reproduce a result, in testing for example, is to run a program. That is easier than for a zoologist — wishing to reproduce a colleague’s experiment precisely — to gather in his lab the appropriate number of flies, chimpanzees or killer whales.

In some types of empirical software research, such as the assessment of process models or design techniques, reproducing an experiment’s setup is harder than when all you have to do is to rerun a program. But regardless of the area we must develop a true  culture of reproducibility. It is not yet there. I have personally come to take experimental results with a grain of salt; not that I particulary suspect foul play, but I simply know how easy it is, in the absence of external validation, to make a mistake in the experiments and, unwittingly, publish a paper with wrong results.

Developing a culture of reproducibility also has an effect on the refereeing process. In submitting papers with precise instructions to reproduce our results, we have sometimes remarked that referees never contact us. I hope this means they always succeed; I suspect, however, that in many cases they just do not try. If you think further about the implications, providing reproducibility instructions for a submitted paper is scary: after all a software run may fail to run for marginal reasons, such as the wrong hardware configuration or a misunderstanding of the instructions. You do not want to perform all the extra work (of making your results reproducible) just to have the paper summarily rejected because the referee is running Windows 95. Ideally, then, referees should have the possibility to ask technical questions — but anonymously, since this is the way most refereeing works. Conferences and journals generally do not support such a process.

These obstacles are implementation issues, however, and will go away. What matters for the growth of the discipline is that it needs, like experimental sciences before it, to embrace a true culture of reproducibility.


[1] Adam Kieun, Vijay Ganesh, Philip J. Guo, Pieter Hooimeijer, Michael D. Ernst: HAMPI: A Solver for String Constraints, Proceedings of the 2009 ACM/SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’09), July 19-23, 2009, Chicago.

[2] Nadia Polikarpova, Ilinca Ciupa  and Bertrand Meyer: A Comparative Study of Programmer-Written and Automatically Inferred Contracts, Proceedings of the 2009 ACM/SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’09), July 19-23, 2009, Chicago.

VN:F [1.9.10_1130]
Rating: 6.0/10 (3 votes cast)
VN:F [1.9.10_1130]
Rating: +2 (from 4 votes)

The rise of empirical software engineering (I): the good news


RecycledIn the next few days I will post a few comments about a topic of particular relevance to the future of our field: empirical software engineering. I am starting by reposting two entries originally posted in the CACM blog. Here is the first. Let me use this opportunity to mention the LASER summer school [1] on this very topic — it is still possible to register.

Empirical software engineering papers, at places like ICSE (the International Conference on Software Engineering), used to be terrible.

There were exceptions, of course, most famously papers by Basili, Zelkowitz, Rombach, Tichy, Berry, Humphrey, Gilb, Boehm, Lehmann, Belady and a few others, who kept hectoring the community about the need to base our opinions and practices on evidence rather than belief. But outside of these cases the typical ICSE empirical paper — I sat through a number of them — was depressing: we made these measurements in our company, found these results, just believe us. A question here in the back? Can you reproduce our results? Access our code? We’d love you to, but unfortunately we work for a company — the Call for Papers said industry contributions were welcome, didn’t it? — and we can’t give you the details. So sorry. But trust us, we checked our results.

Actually, there was another kind of empirical paper, which did not suffer from such secrecy: the university study. Hi, I am professor Bright, the well-known author of the Bright method of software development. Everyone knows it’s the best, but we wanted to assess it scientifically through a rigorous empirical study. I gave the same programming problem to two groups of third-year undergraduates; one group was told to use the Bright method, the other not. Guess what? The Bright group performed 67.94% better! I see the session chair wanting to move to the next speaker; see the details in the paper.

For years, this was most of what we had: unverifiable industry reports and unconvincing student experiments.

And suddenly the scene has changed. Empirical software engineering studies are in full bloom; the papers are flowing, and many are good!

What triggered this radical change is the availability of open-source repositories. Projects such as Linux, Eclipse, Apache, EiffelStudio and many others have records going back 10, 15, sometimes 20 years. These records contain the true history of the project: commits (into the configuration management system), bug reports, bug fixes, test runs and their results, developers involved, and many more elements of project data. All of a sudden empirical research has what any empirical science needs: a large corpus of objects to analyze.

Open-source projects have given the decisive jolt, but now we can rely on industrial data as well: Microsoft and other companies have started making their own records selectively available to researchers. In the work of authors such as Zeller from Sarrebruck, Gall from Uni. Zurich or Nagappan from Microsoft, systematic statistical techniques yield answers, sometimes surprising, to questions on which we could only speculate. Do novices or experts cause more bugs? Does test coverage correlate with software quality, and if so, positively or negatively? Little by little, we are learning about the true properties of software products and processes, based not on fantasies but on quantitative analysis of meaningful samples.

The trend is unmistakable, and irreversible.

Not all is right yet; in the second installment of this post I will describe some of what still needs to be improved for empirical software engineering to achieve full scientific rigor.


[1] LASER summer school 2010, at

VN:F [1.9.10_1130]
Rating: 4.5/10 (2 votes cast)
VN:F [1.9.10_1130]
Rating: 0 (from 4 votes)