Archive for the ‘Research evaluation’ Category.

Informatics education in Europe: Just the facts

 

In 2005 a number of us started Informatics Europe [1], the association of university departments and industrial research labs in computer science in Europe. The association has now grown to 80 members across the entire continent; it organizes the annual European Computer Science Summit and has published a number of influential reports. The last one just came out: Informatics Education in Europe: Institutions, Degrees, Students, Positions, Salaries — Key Data 2008-2012 [2]. The principal author is Cristina Pereira, who collected and organized the relevant data over more than a year; I helped with the preparation of the final text.

At the beginning of Informatics Europe we considered with particular attention  the model of the Computing Research Association [3], which played a crucial role in giving computer science (informatics) its due place in the US academic landscape. Several past and current officers of the CRA,  such as Willy Zwaenepoel, Ed Lazowska, Bob Constable, Andy Bernat, Jeannette Wing, Moshe Vardi and J Strother Moore gave keynotes at our early conferences and we of course asked them for the secrets of their organization’s success. One answer that struck us was the central role played by data collection. Just gathering the  facts, such as degrees and salaries, established for the first time a solid basis for serious discussions. We took this advice to heart and the report is the first result.

Gathering the information is particularly difficult for Europe given the national variations and the absence of centralized statistical data. Even the list of names under which institutions teach informatics in Europe fills a large table in the report. Cristina’s decision was, from the start, to favor quality over quantity: to focus on impeccable data for countries for which we could get it, rather than trying to cover the whole continent with data of variable credibility.

The result is the first systematic repository of basic information on informatics education in Europe: institutions, degrees offered and numbers awarded, student numbers, position titles and definitions, and (a section which will not please everyone) salaries for PhD students, postdocs and professors of various ranks.

The report is a first step; it only makes sense if we can regularly continue to update it and particularly extend it to other countries. But even in its current form (and with the obvious observation that my opinion is not neutral)  I see it as a major step forward for the discipline in Europe. We need an impeccable factual basis to convince the public at large and political decision-makers to give informatics the place it deserves in today’s educational systems.

References

[1] Informatics Europe site, see here.

[2] Cristina Pereira and Bertrand Meyer: Informatics Education in Europe: Institutions, Degrees, Students, Positions, Salaries — Key Data 2008-2012, Informatics Europe report, 30 September 2013, available here.

[3] Computing Research Association (US), see here.

.

VN:F [1.9.10_1130]
Rating: 7.8/10 (4 votes cast)
VN:F [1.9.10_1130]
Rating: 0 (from 2 votes)

Conferences: Publication, Communication, Sanction

Recycled(This article was first published in the Communications of the ACM blog.)

A healthy discussion is taking place in the computer science community on our publication culture. It was spurred by Lance Fortnow’s 2009 article [1]; now Moshe Vardi has taken the lead to prepare a report on the topic, following a workshop in Dagstuhl in November [2]. The present article and one that follows (“The Waves of Publication”)  are intended as contributions to the debate.

One of the central issues is what to do with conferences. Fortnow had strong words for the computer science practice of using conferences as the selective publication venues, instead of relying on journals as traditional scientific disciplines do. The criticism is correct, but if we look at the problem from a practical perspective it is unlikely that top conferences will lose their role as certifiers of quality. This is not a scientific matter but one of power. People in charge of POPL or OOPSLA have decisive sway over the careers (one is tempted to say the lives) of academics, particularly young academics, and it is a rare situation in human affairs that people who have critical power voluntarily renounce it. Maybe the POPL committee will see the light: maybe starting in 2014 it will accept all reasonable papers somehow related to “principles of programming languages”, turn the event itself into a pleasant multi-track community affair where everyone in the field can network, and hand over the selection and stamp-of-approval job to a journal such as TOPLAS. Dream on; it is not going to happen.

We should not, however, remain stuck with the status quo and all its drawbacks. That situation is unsustainable. As a single illustration, consider the requirement, imposed by all conferences, that having a paper pass the refereeing process is not enough: you must also register. A couple of months before the conference, authors of accepted papers (at least, they thought their paper was accepted) receive a threatening email telling them that unless they register and pay their paper will not be published after all. Now assume an author, in a field where a conference is the top token of recognition, has his visa application rejected by the country of the conference — a not so uncommon situation — and does not register. (Maybe he does not mind paying the fee, but he does not want to lie by pretending he is going to attend whereas he knows he will not.) He has lost his opportunity for publication and perhaps severely harmed this career. What have such requirements to do with science?

To understand what can be done, we need to analyze the role of conferences. In an earlier article  [3] I described four “modes and uses” of publication: Publication, Exam, Business and Ritual. From the organizers’ viewpoint, ignoring the Business and Ritual aspects although they do play a significant role, a conference has three roles: Publication, Communication and Sanction. The publication part corresponds to the proceedings of the conference, which makes articles available to the community at large, not just the conference attendees. The communication part only addresses the attendees: it includes the presentation of papers as well as all other interactions made possible by being present at a conference. The sanction part (corresponding to the “exam” part of the more general classification) is the role of a renowned conference as a stamp of approval for the best work of the moment.

What we should do is separate these roles. A conference can play all three roles, but it can also select two of them, or even just one. A well-established, prestigious conference will want to retain its sanctioning role: accepted papers get the stamp of approval. It will also remain an event, where people meet. And it may distribute proceedings. But the three roles can also be untied:

  • Publication is the least critical, and can easily be removed from the other two, since everything will be available on the Web. In fact the very notion of proceedings is quickly becoming fuzzy: more and more conferences save money by not distributing printed proceedings to attendees, sometimes not printing any proceedings at all; and some even spare themselves the production of a proceedings-on-a-stick, putting the material on the Web instead. A conference may still decide to have its own proceedings, or it might outsource that part to a journal. Each conference will make these decisions based on its own culture, tradition, ambition and constraints. For authors, the decision does not particularly matter: what counts are the sanction, which is provided by the refereeing process, and the availability of their material to the world, which will be provided in any scenario (at least in computer science where we have, thankfully, the permission to put our papers on our own web sites, an acquired right that our colleagues from other disciplines do not all enjoy).
  • Separating sanction from communication is a natural step. Acceptance and participation are two different things.

Conference organizers should not be concerned about lost revenue: most authors will still want to participate in the conference, and will get the funding since institutions are used to pay for travel to present accepted papers; some new participants might come, attracted by more interaction-oriented conference styles; and organizers can replace the requirement to register by a choice between registering and paying a publication fee.

Separating the three roles does not mean that any established conference renounces its sanctioning status, acquired through the hard work of building the conference’s reputation, often over decades. But everyone gets more flexibility. Several combinations are possible, such as:

  • Sanction without communication or publication: papers are submitted for certification through peer-review, they are available on the Web anyway, and there is no need for a conference.
  • Publication without sanction or communication: an author puts a paper on his web page or on a self-publication site such as ArXiv.
  • Sanction and communication without publication: a traditional selective conference, which does not bother to produce proceedings.
  • Communication without sanction: a working conference whose sole aim is to advance the field through presentations and discussions, and accepts any reasonable submission. It may be by invitation (a kind of advance sanction). It may have proceedings (publication) or not.

Once we understand that the three roles are not inextricably tied, the stage is clear for removal for some impediments to a more effective publication culture. Some, not all. The more general problem is the rapidly changing nature of scientific publication, what may be called the concentric waves of publication. That will be the topic of the next article.

References

[1] Lance Fortnow: Time for Computer Science to Grow Up, in Communications of the ACM, Vol. 52, no. 8, pages 33-35, 2009, available here.

[2] Dagstuhl: Perspectives Workshop: Publication Culture in Computing Research, see here.

[3] Bertrand Meyer: The Modes and Uses of Scientific Publication, article on this blog, 22 November 2011, see here.

VN:F [1.9.10_1130]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.10_1130]
Rating: +1 (from 1 vote)

Who gets what

 

The following rules are not all that’s needed to understand life, but they go a long way.

1. It’s the tenor who gets the girl. (The baritone never stood a chance1.)

2. It’s the physicists who get the money.

 

Note

1Actually I can think of a couple of exceptions (both Russian, I wonder what that means): Ruslan and Ludmilla (bass), Prince Igor (he is old though).

VN:F [1.9.10_1130]
Rating: 10.0/10 (3 votes cast)
VN:F [1.9.10_1130]
Rating: +3 (from 3 votes)

ERC Advanced Investigator Grant: Concurrency Made Easy

In April we will be starting the  “Concurrency Made Easy” research project, the result of a just announced Advanced Investigator Grant from the European Research Council. Such ERC grants are awarded to a specific person, rather than a consortium of research organizations as in the usual EU funding scheme. The usual amount, which applies in my case, is 2.5 million euros (currently almost 3 .3 million dollars) over five years, on a specific theme. According to the ERC’s own description [1],

ERC Advanced Grants allow exceptional established research leaders of any nationality and any age to pursue ground-breaking, high-risk projects that open new directions in their respective research fields or other domains.

This is the most sought-after research funding instrument of the EU, with a success rate of about 12% [2], out of a group already preselected by the host institutions. What makes ERC Advanced Investigator Grants so coveted is the flexibility of the scheme (no constraints on the topic, light administrative baggage) and the trust that an award implies in a particular researcher and his ability to carry out advanced research.

The name of the CME project clearly signals its ambition: to turn concurrent programming into a normal, unheroic part of programming. Today adding concurrency to a program, usually in the form of multithreading, is very hard, complexity and risk of all kinds. Everyone is telling us that we must rethink programming, retrain programmers and revamp curricula to put the specific reasoning modes of concurrent programming at the center. I don’t think this can work; thinking concurrently is just too hard to become the default mode. Instead, we should adapt programming languages, theories and tools so that programmers can continue to apply the reasoning schemes that have proved so successful in classical programming, especially object-oriented programming with the benefit of Design by Contract.

The starting point is the SCOOP model, to which I started an introduction in an earlier article of this blog [3], with a sequel yet to come. SCOOP is a minimal extension to the O-O framework to support concurrency, yielding very simple (the S in the acronym) solutions to concurrent programming problems. As part of the CME project we plan to develop it in many different directions and establish a sound and effective formal basis.

I have put the project description — the scientific part of the actual proposal text accepted by the ERC — online [4].

In the next few weeks I will be publishing here specific announcements for the positions we are seeking to fill very quickly; they include postdocs, PhD students, and one research engineer. We are looking for candidates with excellent knowledge and practice of concurrency, Eiffel, formal techniques etc. The formal application procedure will be Web-based and is not in place yet but you can contact me if you fit the profile and are interested.

We can defeat the curse: concurrent programming (an obligatory condition of any path towards a successful future for information technology) does not have to be black magic. It can be made simple and efficient. Such is the challenge of the CME project.

References

[1] European Research Council: Advanced Grants, available here.

[2] European Research Council: Press release on 2011 Advanced Investigator Grants, 24 January 2012, available here.

[3] Concurrent Programming is Easy, article from this blog, available here.

[4] CME Advanced Investigator Grant project description, available here.

VN:F [1.9.10_1130]
Rating: 10.0/10 (4 votes cast)
VN:F [1.9.10_1130]
Rating: +2 (from 2 votes)

Guest article: funding great research

In a blog article posted in its original version on this blog [1] and in a revised version on the Communications of the ACM blog [2], I emphasized the relevance of incremental research. Recently Mikkel Thorup sent me some interesting comments, which I am publishing here as the first Guest Column of this blog.

References

[1] Bertrand Meyer: One Cheer for Incremental Research, in the present blog, 10 August 2009, available here

[2] Bertrand Meyer: Long Live Incremental Research, in Communications of the ACM Blog, 13 June 2011, available here.

Guest article by Mikkel Thorup: Funding Great Research

Research foundations want great research projects. However, a while back Bertrand Meyer wrote an interesting blog post: Long Live Incremental Research [2]. With examples he showed that many of the greatest results of research could not possibly be the projected results of great sounding project descriptions. His conclusion is that we should drop the high-flying ambitions from project descriptions, and instead support more incremental research proposals, hoping that great stuff will happen on the way. Indeed incremental research is perfect for research projects with predictable deliverables. However, I suggest the opposite conclusion; namely that we for some of the funding drop the project description.

The basic idea is that foundations should encourage researchers to look for results far better than those that can reasonably be projected. In particular, researchers should be free to follow their inspiration when they see new exiting opportunities. This is not done by tying researchers to incremental projects. Instead we can sometimes switch to result based funding, that is, funding based on results already achieved (with emphasis on the more recent past). Such result based funding is more like rewards for great results, and it offers researchers the perfect incentive to do their very best so as to secure future funding.

Consider a researcher with a history of brilliant ideas taking research in surprising new directions. If we try casting this as a project, the referees will rightly complain: “It is not clear how the applicant will come up with a brilliant idea, nor is it clear what the surprise will be”. With such lack of focus and feasibility, a low project score is expected.  If the project description has a predefined weight of, say, 40%, then the overall score will be too low for funding, regardless of the researcher’s established track record of succeeding in unlikely situations.  However, research needs great new ideas. Therefore we need some result based funding so that we can support researchers with a proven talent for generating great new ideas even if we do not quite understand how it will happen.

The above problem is often very real in my field of theoretical computer science. Like in other fields, theoretical research is only interesting if it contains surprises (otherwise it is more like development). A project plan would make sense if the starting point was a surprising idea or approach that it would take years to develop, but in theory, the most exciting ideas are often strikingly simple. When first you have such an idea, you are typically close to done, ready to start writing a paper. Thus, if you have a great idea when you apply for a grant, you will typically be done long before you get the grant. The essence of the research is thus the unpredictable search for powerful ideas and insights. The most appropriate project description is therefore just a description of the importance of the area to be researched and the type of results aimed for. The track record shows which researchers have the talent to succeed.

Dropping the how-part of the project description will greatly increase methodological diversity, allowing researchers to use the strategy that has proved most suitable for their area and their own talent and skills.  As a simple example, Bertrand suggested funding incremental research, hoping that great surprising things would turn up on the way. My strategy is the opposite. I try to spend as much time as possible on overly ambitious targets. Most of the time I fail, but I rarely come home empty-handed, for by studying the unknown I nearly always discover something new, sometimes even more interesting than the original target. From the perspective of ambition, I see it as an advantage that I minimize time spend on easy targets, but foundations seem to prefer that you take a planned path with some guaranteed targets on the way. The point here is not to argue whether one strategy is superior to the other, but rather to embrace the diversity of strategies that may work depending on the area and the individual researcher.

Perhaps more seriously, if a target is hard to achieve, it may be because it requires a crazy approach that would not look reasonable to anyone else, but which may work for a researcher thanks to his special talents and intuition. Indeed I have often been positively surprised seeing how others succeeded using an approach I had myself dismissed.  As a project, such crazy approaches would fail on perceived feasibility, but the point in result based funding is that researchers are free to use whatever approach they find most efficient. Funding is given to those who prove successful. This gives the perfect incentive to do great work, securing future funding.

Result-based funding would also reduce resources needed to evaluate applications. It is very hard for a general panel to evaluate the methodology and success probability of a project.  Moreover, it requires an intimate knowledge of a field to evaluate how big a difference a result would make relative to what is already known. However, handling published results, we know what happened and we can rely on peer-review for the difference it made to the field. All the panel has to do is to evaluate how the successes meet with the objectives of the foundation.

Let us, as an example, take something like the ERC Advanced Investigator Grant which welcomes high risk high gain research. It would seem that aiming for surprising breakthroughs in an important area would fall well within this scope. Having researchers with proven skills explore the area and follow their inspiration may be the optimal strategy. Uncertainty about what they would find should not be worse than high risk. In fact, based on past performance, it may be safe to assume that they will discover something interesting if not ground-breaking. However, when projects are scored on focused feasibility, such projects will fail even if their expected return is very high. It has to be possible to get a high overall score for promising research even if standard project parameters like focus and feasibility would be counterproductive.  At the end of the day, what we want are results, not project descriptions, so what should determine the overall score is which proposal is expected to yield the greatest results.

Long live great research!

Mikkel Thorup

VN:F [1.9.10_1130]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.10_1130]
Rating: +2 (from 2 votes)

The Modes and Uses of Scientific Publication

p> 

Recycled(This article was initially published in the CACM blog.)
Publication is about helping the advancement of humankind. Of course.

Let us take this basis for granted and look at the other, possibly less glamorous aspects.

Publication has four modes: Publicity; Exam; Business; and Ritual.

1. Publication as Publicity

The first goal of publication is to tell the world that you have discovered something: “See how smart I am!” (and how much smarter than all the others out there!). In a world devoid of material constraints for science, or where the material constraints are handled separately, as in 19th-century German universities where professors were expected to fund their own labs, this would be the only mode and use of publication. Science today is a more complex edifice.

A good sign that Publication as Publicity is only one of the modes is that with today’s technology we could easily skip all the others. If all we cared about were to make our ideas and results known, we would simply put out our papers on ArXiv or just our own Web page. But almost no one stops there; researchers submit to conferences and journals, demonstrating how crucial the other three modes are to the modern culture of science.

2. Publication as Exam

Academic careers depend on a publication record. Actually this is not supposed to be the case; search and tenure committees are officially interested in “impact,” but any candidate is scared of showing a short publication list where competitors have tens or (commonly) hundreds of items.

We do not just publish; we want to be chosen for publication. Authors are proud of the low acceptance rates of conferences at which their papers have been accepted; in the past few years it has in fact become common practice, in publication lists attached to CVs, to list this percentage next to each accepted article. Acceptance rates are carefully tracked; see for example [2] for software engineering.

As Jeff Naughton has pointed out [1], this mode of working amounts to giving researchers the status of students forced to take exams again and again. Maybe that part is inevitable; the need to justify ourselves anew every morning may be an integral part of being a scientist, especially one funded by other people’s money. Two other consequences of this phenomenon are, I believe, more damaging.

The first risk directly affects the primary purpose of publication (remember the advancement of humankind?): a time-limited review process with low acceptance rates implies that some good papers get rejected and some flawed ones accepted. Everyone in software engineering knows (and recent PC chairs have admitted) that getting a paper accepted at the International Conference on Software Engineering is in part a lottery; with an acceptance rate hovering around 13%, this is inevitable. The mistakes occur both ways: papers accepted or even getting awards, then shown a few months later to be inaccurate; and innovative papers getting rejected because some sentence rubbed the referees the wrong way, or some paper was not cited. With a 4-month review cycle, and the next deadline coming several months later, the publication of a truly important result can be delayed significantly.

The second visible damage is publication inflation. Today’s research environment channels productive research teams towards an LPU (Least Publishable Unit) publication practice, causing an explosion of small contributions and the continuous decrease of the ratio of readers to writers. When submitting a paper I have always had, as my personal goal, to be read; but looking at the overall situation of computer science publication today suggests that this is not the dominant view: the overwhelming goal of publication is publication.

3. Publication as Business

Publishing requires an infrastructure, and money plays a role. Conferences in particular are a business. They have a budget to balance, not always an easy task, although a truly successful conference can be a big money-maker for its sponsor, commercial or non-profit. The financial side of conference publication has its consequences on authors: if you do not pay your fees, not only will you be unable to participate, but your paper will not be published.

One can deplore these practices, in particular their effect on authors from less well-endowed institutions, but they result from today’s computer science publication culture with its focus on the conference, what Lance Fortnow has called “A Journal in a Hotel”.

Sometimes the consequences border on the absurd. The ASE conference (Automated Software Engineering) accepts some contributions as “short papers”. Fair enough. At ASE 2009, “short paper” did not mean a shorter conference presentation but the permission to put up a poster and stand next to it for a while and answer passersby’s questions. For that privilege — and the real one: a publication in the conference volume — one had to register for the conference. ASE 2009 was in New Zealand, the other end of the world for a majority of authors. I ceded to the injunction: who was I to tell the PhD student whose work was the core of the submission, and who was so happy to have a paper accepted at a well-ranked conference, that he was not going to be published after all? But such practices are dubious. It would be more transparent to set up an explicit pay-for-play system, with page charges: at least the money would go to a scientific society or a university. Instead we ended up funding (in addition to the conference, which from what I heard was an excellent experience) airlines and hotels.

What makes such an example remarkable is that a reasonable justification exists for every one of its components: a highly selective refereeing process to maintain the value of the publication venue; limiting the number of papers selected for full presentation, to avoid a conference with multiple parallel tracks (and the all too frequent phenomenon of conference sessions whose audience consists of the three presenters plus the session chair); making sure that authors of published papers actually attend the event, so that it is a real conference with personal encounters, not just an opportunity to increment one’s publication count. The concrete result, however, is that authors of short papers have the impression of being ransomed without getting the opportunity to present their work in a serious way. Literally seconds as I was going to hit the “publish” button for the present article, an author of an accepted short paper for ASE 2012 (where the process appears similar) sent an email to complain, triggering a new discussion. We clearly need to find better solutions to resolve the conflicting criteria.

4. Publication as Ritual

Many of the seminal papers in science, including some of the most influential in computer science, defy classification and used a distinctive, one-of-a-kind style. Would they stand a chance in one of today’s highly ranked conferences, such as ICSE in software or VLDB in databases? It’s hard to guess. Each community has developed its own standard look-and-feel, so that after a while all papers start looking the same. They are like a classical mass with its Te Deum, Agnus Dei and Kyrie Eleison. (The “Te Deum” part is, in a conference submission, spread throughout the paper, in the form of adoring citations of the program committee members’ own divinely inspired articles, good for their H-indexes if they bless your own offering.)

All empirical software engineering papers, for example, have the obligatory “Threats to Validity” section, which is has developed into a true art form. The trick is the same as in the standard interview question “What can you say about your own deficiencies?”, to which every applicant know the key: describe a personality trait so that you superficially appear self-critical but in reality continue boasting, as in “sometimes I take my work too much to heart” [3]. The “Threats to Validity” section follows the same pattern: you try to think of all possible referee objections, the better to refute them.

Another part of the ritual is the “related work” section, treacherous because you have to make sure not to omit anything that a PC member finds important; also, you must walk a fine line between criticizing existing research too much, which could offend someone, or not enough, which enables the referee to say that you are not bringing anything significantly new. I often wonder who, besides the referees, reads those sections. But here too it is easier to lament than to fault the basic idea or propose better solutions. We do want to avoid wasting our time on papers whose authors are not aware of previous work. The related work section allows referees to perform this check. Its importance in the selection process has, however, grown out of proportion. It is one thing to make sure that a paper is state-of-the-art, but another to reject it (as often happens) because it fails to cite a particular contribution whose results would not directly affect its own. Here we move from the world of the rational to the world of the ritual. An extreme and funny recent example — funny to me, not necessarily to the coauthors — is a rejection from  APSEC 2011, the Australia-Pacific Software Engineering Conference, based on one review (the others were positive) that stated: “How novel is this? Are [there] not any cloud-based IDEs out there that have [a] similar awareness model integrated into their CM? This is something the related work [section] fails to describe precisely. [4] The ritual here becomes bizarre: as far as we know, no existing system discusses a similar model; the reviewer too does not know of any; but he blasts the paper all the same for not citing work that he thinks must have been done by someone, somehow, somewhere. APSEC is a fine conference — it has to be, from the totally unbiased criterion that it accepted another one of our submissions this year! — and this particular paper may or may not have been ready for publication; judge it for yourself [5]. Such examples suggest, however, that the ritual of computer science publication has its limits.

Publicity, Exam, Business, Ritual: to which one of the four modes of publication are you most attuned? Oh, sorry, I forgot: in your case, it is solely for the advancement of humankind.

References and notes

[1] Jeffrey F. Naughton, DBMS Research: First 50 Years, Next 50 Years, slides of keynote at 26th IEEE International Conference on Data Engineering, 2010, available at lazowska.cs.washington.edu/naughtonicde.pdf .

[2] Tao Xie, Software Engineering Conferences, at people.engr.ncsu.edu/txie/seconferences.htm .

[3] I once saw on French TV a hilarious interview of an entrepreneur who had started a software company in Vietnam, where job candidates just did not know “the code”, and moved on, in response to such a question, to tell the interviewer about being rude to their mother and all the other horrible things they had done in their lives.

[4] The words in brackets were not in the review but I added them for clarity.

[5] Martin Nordio, H.-Christian Estler, Carlo A. Furia and Bertrand Meyer: Collaborative Software Development on the Web, available at arxiv.org/abs/1105.0768 .

(This article was first published on the CACM blog in September 2011.)

VN:F [1.9.10_1130]
Rating: 9.3/10 (11 votes cast)
VN:F [1.9.10_1130]
Rating: +9 (from 9 votes)

Averaging

 

A statistical textbook [1] contains this gem of wisdom:

Only a fool would conclude that a painting that was judged as excellent by one person and contemptible by another ought therefore to be classified as mediocre.

Common sense indeed; but does the procedure not recall how the typical conference program committee works, with averages obligingly computed by the supporting web-based systems?

Reference

[1] David C. Howell: Statistical Methods for Psychology, sixth edition, Thomson-Wadsworth, 2007

VN:F [1.9.10_1130]
Rating: 10.0/10 (5 votes cast)
VN:F [1.9.10_1130]
Rating: +5 (from 5 votes)

European Computer Science Summit 2011

The program for ECSS 2011 (Milan, 7-9 November) has just been put online [1]. The European Computer Science Summit, held yearly since 2005, is the annual conference of Informatics Europe and a unique opportunity to discuss issues of interest to the computer science / informatics research and education community; much of the audience is made of deans, department heads, lab directors, researchers and senior faculty. Keynote speakers this year include Stefano Ceri, Mary Fernández, Monika Henzinger, Willem Jonker, Miron Livny, John Mylopoulos, Xavier Serra and John White.

ECSS is not a typical scientific conference; like Snowbird, its counterpart in the US, it is focused on professional and policy issues, and also a place to hear from technology leaders about their research visions. For me it is one of the most interesting events of the year.

References

[1] ECSS home page including advance program, here.

VN:F [1.9.10_1130]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.10_1130]
Rating: +1 (from 1 vote)

Scopus’s view of computer science research

I posted on the Informatics Europe blog  a short note about what Scopus sees as the hottest articles in computer science.

VN:F [1.9.10_1130]
Rating: 9.0/10 (6 votes cast)
VN:F [1.9.10_1130]
Rating: +3 (from 3 votes)

Testing insights

Lionel Briand and his group at the Simula Research Laboratory in Oslo have helped raise the standard for empirical research in testing and other software engineering practices by criticizing work that in their opinion relies on wrong assumptions or insufficiently supported evidence. In one of their latest papers [1] they take aim at “Adaptive Random Testing” (ART); one of the papers they criticize is from our group at ETH, on the ARTOO extension [2] to this testing method. Let’s examine the criticism!

We need a bit of background on random testing, ART, and ARTOO:

  • Random testing tries inputs based on a random process rather than attempting a more sophisticated strategy; it was once derided as silly [3], but has emerged in recent years as a useful technique. Our AutoTest tool [4], now integrated in EiffelStudio, has shown it to be particularly effective when applied to code equipped with contracts, which provide built-in test oracles. As a result of this combination, testing can be truly automatic: the two most tedious tasks of traditional testing, test case preparation and test oracle definition, can be performed without human intervention.
  • ART, developed by Chen and others [5], makes random testing not entirely random by ensuring that the inputs are spread reasonably evenly in the input domain.
  • ARTOO, part of Ilinca Ciupa’s PhD thesis on testing defended in 2008,   generalized ART to object-oriented programs, by defining a notion of distance between objects; the ARTOO strategy  avoids choosing objects that are too close to each other. The distance formula, which you can find in[2], combines three elementary distances: between the types of the objects involved,  the values in their primitive fields (integers etc.), and, recursively, the objects to which they have references.

Arcuri and Briand dispute the effectiveness of ART and criticize arguments that various papers have used to show its effectiveness. About the ARTOO paper they write

The authors concluded that ART was better than random testing since it needed to sample less test cases before finding the first failure. However, ART was also reported as taking on average 1.6 times longer due to the distance calculations!

To someone not having read our paper the comment and the exclamation mark would seem to suggest that the paper somehow downplays this property of random testing, but in fact it stresses it repeatedly. The property appears for example in boldface as part of the caption to Table 2: In most cases ARTOO requires significantly less tests to find a fault, but entails a time overhead, and again in boldface in the caption to Table 3: The overhead that the distance calculations introduce in the testing process causes ARTOO to require on average 1.6 times more time than RAND to find the first fault.

There is no reason, then, to criticize the paper on this point. It reports the results clearly and fairly.

If we move the focus from the paper to the method, however, Arcuri and Briand have a point. As they correctly indicate, the number of tests to first fault is not a particularly useful criterion. In fact I argued against it in another paper on testing [6]

The number of tests is not that useful to managers, who need help deciding when to stop testing and ship, or to customers, who need an estimate of fault densities. More relevant is the testing time needed to uncover the faults. Otherwise we risk favoring strategies that uncover a failure quickly but only after a lengthy process of devising the test; what counts is total time. This is why, just as flies get out faster than bees, a seemingly dumb strategy such as random testing might be better overall.

(To understand the mention of flies and bees you need to read [6].) The same article states, as its final principle:

Principle 7: Assessment criteria A testing strategy’s most important property is the number of faults it uncovers as a function of time.

The ARTOO paper, which appeared early in our testing work, used “time to first failure” because it has long been a standard criterion in the testing literature, but it should have applied our own advice and focused on more important properties of testing strategies.

The “principles” paper [6] also warned against a risk awaiting anyone looking for new test strategies:

Testing research is vulnerable to a risky thought process: You hit upon an idea that seemingly promises improvements and follow your intuition. Testing is tricky; not all clever ideas prove helpful when submitted to objective evaluation.

The danger is that the clever ideas may result in so much strategy setup time that any benefit on the rest of the testing process is lost. This danger threatens testing researchers, including those who are aware of it.

The idea of ARTOO and object distance remains attractive, but more work is needed to make it an effective contributor to automated random testing and demonstrate that effectiveness. We can be grateful to Arcuri and Briand for their criticism, and I hope they continue to apply their iconoclastic zeal to empirical software engineering work, ours included.

I have objections of my own to their method. They write that “all the work in the literature is based either on simulations or case studies with unreasonably high failure rates”. This is incorrect for our work, which does not use simulations, relying instead on actual, delivered software, where AutoTest routinely finds faults in an automatic manner.

In contrast, however, Arcuri and Briand rely on fault seeding (also known as fault introduction or fault injection):

To obtain more information on how shapes appear in actual SUT, we carried out a large empirical analysis on 11 programs. For each program, a series of mutants were generated to introduce faults in these programs in a systematic way. Faults generated through mutation [allow] us to generate a large number of faults, in an unbiased and varied manner. We generated 3727 mutants and selected the 780 of them with lower detection probabilities to carry out our empirical analysis of faulty region shapes.

In the absence of objective evidence attesting to the realism of fault seeding, I do not believe any insights into testing obtained from such a methodology. In fact we adopted, from the start of our testing work, the principle that we would never rely on fault seeding. The problem with seeded faults is that there is no guarantee they reflect the true faults that programmers make, especially the significant ones. Techniques for fault seeding are understandably good at introducing typographical mistakes, such as a misspelling or the replacement of a “+” by a “-”; but these are not interesting kinds of fault, as they are easily caught by the compiler, by inspection, by low-tech static tools, or by simple tests. Interesting faults are those resulting from a logical error in the programmer’s mind, and in my experience (I do not know of good empirical studies on this topic) seeding techniques do not generate them.

For these reasons, all our testing research has worked on real software, and all the faults that AutoTest has found were real faults, resulting from a programmer’s mistake.

We can only apply this principle because we work with software equipped with contracts, where faults will be detected through the automatic oracle of a violated assertion clause. It is essential, however, to the credibility and practicality of any testing strategy; until I see evidence to the contrary, I will continue to disbelieve any testing insights resulting from studies based on artificial fault injection.

References

[1] Andrea Arcuri and Lionel Briand: Adaptive Random Testing: An Illusion of Effectiveness, in ISSTA 2011 (International Symposium on Software Testing and Analysis), available here.

[2] Ilinca Ciupa, Andreas Leitner, Manuel Oriol and Bertrand Meyer: ARTOO: Adaptive Random Testing for Object-Oriented Software, in ICSE 2008: Proceedings of 30th International Conference on Software Engineering, Leipzig, 10-18 May 2008, IEEE Computer Society Press, 2008, also available here.

[3] Glenford J. Myers. The Art of Software Testing. Wiley, New York, 1979. Citation:

Probably the poorest methodology of all is random-input testing: the process of testing a program by selecting, at random, some subset of all possible input values. In terms of the probability of detecting the most errors, a randomly selected collection of test cases has little chance of being an optimal, or close to optimal, subset. What we look for is a set of thought processes that allow one to select a set of test data more intelligently. Exhaustive black-box and white-box testing are, in general, impossible, but a reasonable testing strategy might use elements of both. One can develop a reasonably rigorous test by using certain black-box-oriented test-case-design methodologies and then supplementing these test cases by examining the logic of the program (i.e., using white-box methods).

[4] Bertrand Meyer, Ilinca Ciupa, Andreas Leitner, Arno Fiva, Yi Wei and Emmanuel Stapf: Programs that Test Themselves, IEEE Computer, vol. 42, no. 9, pages 46-55, September 2009, available here. For practical uses of AutoTest within EiffelStudio see here.

[5] T. Y. Chen, H Leung and I K Mak: Adaptive Random Testing, in  Advances in Computer Science, ASIAN 2004, Higher-Level Decision Making,  ed. M.J. Maher,  Lecture Notes in Computer Science 3321, Springer-Verlag, pages 320-329, 2004, available here.

[6] Bertrand Meyer: Seven Principles of Software testing, in IEEE Computer, vol. 41, no. 10, pages 99-101, August 2008, also available here.

VN:F [1.9.10_1130]
Rating: 8.5/10 (6 votes cast)
VN:F [1.9.10_1130]
Rating: +1 (from 7 votes)

Incremental research again

After some updating, I republished in the Communications of the ACM blog my discussion of the value of incremental research, which first appeared as an article here .

VN:F [1.9.10_1130]
Rating: 9.8/10 (4 votes cast)
VN:F [1.9.10_1130]
Rating: +2 (from 2 votes)

Assessing concurrency models

By describing a  poorly conceived hypothetical experiment, last week’s article described the “Professor Smith syndrome” consisting of four risks that threaten the validity of empirical software engineering experiments relying on students in a course:

  • Professor Smith Risk 1: possible bias if the evaluator has a stake in the ideas or tools under assessment.
  • Professor Smith Risk 2: creating different levels of motivation in the different groups (Hawthorne effect).
  • Professor Smith Risk 3: extrapolating from students to professionals.
  • Professor Smith Risk 4: violation of educational ethics if the experiment may cause some students to learn better than others.

If you have developed a great new method or tool and would like to assess it, the best way to address Risk 1 is to find someone else to do the assessment. What if  this solution is not practical? Recently we wanted to get some empirical evidence on the merits of the SCOOP (Simple Concurrent Object-Oriented Programming) approach to concurrency [1, 2], on which I have worked for a long time and which is now part of EiffelStudio since the release of 6.8 a couple of weeks ago. We wanted to see if, despite the Professor Smith risks, we could do a credible study ourselves.

The ETH Software Architecture course[3], into which we introduced some introductory material on concurrency last year (as part of a general effort to push more concurrency into software courses at ETH), looked like a good place to try an evaluation; it is a second-year course, where students, or so we thought, would have little prior experience in concurrent software design.

The study’s authors — Sebastian Nanz, Faraz Torshizi and Michela Pedroni — paid special attention to the methodological issues. To judge for yourself whether we addressed them properly, you can read the current version of our paper to be presented at ESEM 2011 [4]. Do note that it is a draft and that we will improve the paper for final publication.

Here is some of what we did. I will not address the Professor Smith Risk 3, the use of students, which (as Lionel Briand has pointed out in a comment on the previous article) published work has studied; in a later article I will give  references to some of that work. But we were determined to tackle the other risks explicitly, so as to obtain credible results.

The basic experiment was a session in which the students were exposed to two different design methods for concurrent software: multithreaded programming in Java, which I’ll call “Java Threads”, and SCOOP. We wanted to explore whether it is easier to program in SCOOP than in Java. This is too general a hypothesis, so it was refined into three concrete hypotheses: is it easier to understand a SCOOP program? Is it easier to find errors in SCOOP programs? Do programmers using SCOOP make fewer errors?

A first step towards reducing the effect — Professor Smith Risk 1 — of any emotional attachment of the experimenters  to one of the approaches, SCOOP in our case, was to generalize the study. Although what directly interested us was to compare SCOOP against Java Threads, we designed the study as a general scheme to compare concurrency approaches; SCOOP and Java Threads are just an illustration, but anyone else interested in assessing concurrency techniques — say Erlang versus C# concurrency — can apply the same methodology. This decision had two benefits: it freed the study from dependency on the particular techniques, hence, we hope, reducing bias; and as side attraction of the kind that is hard for researchers to resist, it increased the publishability of the results.

Circumstances unexpectedly afforded us another protection against any for-SCOOP bias: unbeknownst to us at the time of the study’s design, a first-year course had newly added (in 2009, whereas our study was performed in 2010) an introduction to concurrent programming — using Java Threads! While we had thought that concurrency in any form would be new to most students, in fact almost all of them had now seen Java Threads before. (The new material in the first-year course was taken by ETH students only, but many transfer students had also already had an exposure to Java Threads.) On the other hand, students had not had any prior introduction to SCOOP. So any advantage that one of the approaches may have had because of students’ prior experience would work against our hypotheses. This unexpected development would not help if the study’s results heavily favored Java Threads, but if they favored SCOOP it would reinforce their credibility.

A particular pedagogical decision was made regarding the teaching of our concurrency material: it started with a self-study rather than a traditional lecture. One of the reasons for this decision was purely pedagogical: we felt (and the course evaluations confirmed) that at that stage of the semester the students would enjoy a break in the rhythm of the course. But another reason was to avoid any bias that might have arisen from any difference in the lecturers’ levels of enthusiasm and effectiveness in teaching the two approaches. In the first course session devoted to concurrency, students were handed study materials presenting Java Threads and SCOOP and containing a test to be taken; the study’s results are entirely based on their answers to these tests. The second session was a traditional lecture presenting both approaches again and comparing them. The purpose of this lecture was to make sure the students got the full picture with the benefit of a teacher’s verbal explanations.

The study material was written carefully and with a tone as descriptive and neutral as possible. To make comparisons meaningful, it does not follow a structure specific to Java Threads or  SCOOP  (as we might have used had we taught only one of these approaches); instead it relies in both cases on the same overall plan  (figure 2 of the paper), based on a neutral analysis of concurrency concepts and issues: threads, mutual exclusion, deadlock etc. Each section then presents, for one such general concurrency question, the solution proposed by Java Threads or SCOOP.

This self-study material, as well as everything else about the study, is freely available on the Web; see the paper for the links.

In the self-study, all students studied both the Java Threads and SCOOP materials. They were randomly assigned to two groups, for which the only difference was the order of studying the approaches. We feel that this decision addresses the ethical issue (Professor Smith Risk 4): any pedagogical effect of reading about A before B rather than the reverse, in the course of a few hours, has to be minimal if you end up reading about the two of them, and on the next day follow a lecture that also covers both.

Having all students study both approaches — a crossover approach in the terminology of [5] — should also address the Hawthorne effect (Professor Smith Risk 2): students have no particular incentive to feel that one of the approaches is more hip than the other. While they are not told that SCOOP is partly the work of the instructors, some of them may know or guess this information; the consequences, positive or negative, are limited, since they are asked in both cases to do as well as they can in answering the assessment questions.

The design of that evaluation is another crucial element in trying to avoid bias. We tried, to the extent possible, to base the assessment on objective criteria. For the first hypothesis (program understanding) the technique was to ask the students to predict the output of some simple concurrent programs. To address the risk of a binary correct/incorrect assessment, and get a more fine-grained view, we devised the programs so that they would produce output strings and measured the Levenshtein (edit) distance to the correct result. For the second hypothesis (ease of program debugging), we gave students programs exhibiting typical errors in both approaches and asked them to tell us both the line number of any error they found and an explanation. Assessing the explanation required human analysis; the idea of also assigning partial credit for pointing out a line number without providing a good explanation is that it may be meaningful that a student found that something is amiss even without being quite able to define what it is. The procedure for the third hypothesis (producing programs with fewer errors) was more complex and required two passes over the result; it requires some human analysis, as you will see in the article, but we hope that the two-pass process removes any bias.

This description of the study is only partial and you should read the article [4] for the full details of the procedure.

So what did we find in the end? Does SCOOP really makes concurrency easier to learn, concurrent programs easier to debug, and concurrent programmers less error-prone? Here too  I will refer you to the article. Let me simply mention that the results held some surprises.

In obtaining these results we tried very hard to address the Professor Smith syndrome and its four risks. Since all of our materials, procedures and data are publicly accessible, described in some detail in the paper, you can determine for yourself how well we met this objective, and whether it is possible to perform credible assessments even of one’s own work.

References

Further reading: for general guidelines on how to conduct empirical research see [5]; for ethical guidelines, applied to psychological research but generalizable, see [6].

[1] SCOOP Eiffel documentation, available here.

[2] SCOOP project documentation at ETH, available here.

[3] Software Architecture course at ETH, course page (2011).

[4] Sebastian Nanz, Faraz Torshizi, Michela Pedroni and Bertrand Meyer: Design of an Empirical Study for Comparing the Usability of Concurrent Programming Languages, to appear in ESEM 2011 (ACM/IEEE International Symposium on Empirical Software Engineering and Measurement), 22-23 September 2011. Draft available here.

[5] Barbara A. Kitchenham, Shari L. Pfleeger, Lesley M. Pickard, Peter W. Jones, David C. Hoaglin, Khaled El-Emam and Jarrett Rosenberg: Preliminary Guidelines for Empirical Research in Software Engineering, national Research Council Canada (NRC-CNRC), Report ERB-1082, 2001, available here.

[6] Robert Rosenthal: Science and ethics in conducting, analyzing, and reporting psychological research, in  Psychological Science, 5, 1994, p127-134. I found a copy cached by a search engine here.

VN:F [1.9.10_1130]
Rating: 10.0/10 (8 votes cast)
VN:F [1.9.10_1130]
Rating: +7 (from 7 votes)

The Professor Smith syndrome: Part 2

As stated in the Quiz of a few days ago (“Part 1 ”), we consider the following hypothetical report in experimental software engineering ([1], [2]):

Professor Smith has developed a new programming technique, “Suspect-Oriented Programming” (SOP). To evaluate SOP, he directs half of the students in his “Software Methodology” class to do the project using traditional techniques, and the others to use SOP.

He finds that projects by the students using SOP have, on the average, 15% fewer bugs than the others, and reports that SOP increases software reliability.

What’s wrong with this story?

Professor Smith’s attempt at empirical software engineering is problematic for at least four reasons. Others could arise, but we do not need to consider them if Professor Smith has applied the expected precautions: the number of students should be large enough (standard statistical theory will tell us how much to trust the result for various sample sizes); the students should be assigned to one of the two groups on a truly random basis; the problem should be amenable to both SOP and non-SOP techniques; and the assessment of the number of bugs should in the results should be based on fair and if possible automated evaluation. Some respondents to the quiz cited these problems, but they would apply to any empirical study and we can assume they are being taken care of.

The first problem to consider is that the evaluator and the author of the concept under evaluation are the same person. This is an approach fraught with danger. We have no reason to doubt Professor Smith’s integrity, but he is human. Deep down, he wants SOP to be better than the alternative. That is bound to affect the study. It would be much more credible if someone else, with no personal stake in SOP, had performed it.

The second problem mirrors the first on the students’ side. The students from group 1 were told that they used Professor Smith’s great idea, those from group 2 that they had to use old, conventional, boring stuff. Did both groups apply the same zeal to their work? After all, the students know that Professor Smith created SOP, and maybe he is an convincing advocate, so group 1 students will (consciously or not) do their best; those from group 2 have less incentive to go the extra mile. What we may have at play here is a phenomenon known as the Hawthorne effect [3]: if you know you are being tested for a new technique, you typically work harder and better — and may produce better results even if the technique is worthless! Experiments dedicated to studying this effect show that even  a group that is in reality using the same technique as another does better, at least at the beginning, if it is told that it is using a new, sexy technique.

The first and second problems arise in all empirical studies, software-related or not. They are the reason why medical experiments use placebos and double-blind techniques (where neither the subjects nor the experimenters themselves know who is using which variant). These techniques often do not directly transpose to software experiments, but we should all the same be careful about empirical studies of assessments of one’s own work and about possible Hawthorne effects.

The third problem, less critical, is the validity of a study relying on students. To what extent can we extrapolate from the results to a situation in industry? Software engineering students are on their way to becoming software professionals, but they are not professionals yet. This is a difficult issue because universities, rather than industry, are usually and understandably the place where experiments take place, an sometimes there is no other choice than using students. But then one can question the validity of the results. It depends on the nature of the questions being asked: if the question under study is whether a certain idea is easy to learn, using students is reasonable. But if it is, for example, whether a technique produces less buggy programs, the results can depend significantly on the subjects’ experience, which is different for students and professionals.

The last problem does not by itself affect the validity of the results, but it is a show-stopper nonetheless: Professor Smith’s experiment is unethical! If is is indeed true that SOP is better than the alternative, he is harming students from group 2; in the reverse case, he is harming students from group 1. Only in the case of the null hypothesis (using SOP makes no statistically significant difference) is the experiment ethical, but then it is also profoundly uninteresting. The rule in course-related experiments is a variant of the Hippocratic oath: before all, do not harm. The first purpose of a course is to enrich the students’ knowledge and skills; secondary aims, such as helping the professor’s research, are definitely acceptable, but must never impede the first. The setup described above is all the less acceptable that the project results presumably count towards the course grade, so the students who were forced to use the less good technique, if there demonstrably was one, have grounds to complain.

Note that Professor Smith could partially address this fairness problem by letting students choose their group, instead of assigning them randomly to group 1 or group 2 (based for example on the first letter of their names). But then the results would lose credibility, because this technique introduces self-selection and hence bias: the students who choose SOP may be the more intellectually curious students, and hence possibly the ones who do better anyway.

If Professor Smith cannot ensure fairness, he can still use students for his experiment, but has to run it outside of a course, for example by paying students, or running the experiment as a competition with some prizes for those who produce the programs with fewest bugs. This technique can work, although it introduces further dangers of self-selection. As part of a course, however, you just cannot assign students, on your own authority, to different techniques that might have an different effect on the core goal of the course: the learning experience.

So Professor Smith has a long way to go before he can run experiments that will convey a significant argument in favor of SOP.

Over the years I have seen, as a reader and sometimes as a referee, many Professor Smith papers: “empirical” evaluation of a technique by its own authors, using questionable techniques and not applying the necessary methodological precautions.

A first step is, whenever possible, to use experimenters who are from a completely different group from the developers of the ideas, as in two studies [4] [5] about the effectiveness of pair programming.

And yet! Sometimes no one else is available, and you do want to obtain objective empirical evidence about the merits of your own ideas. You are aware of the risk, and ready to face the cold reality, including if the results are unfavorable. Can you do it?

A recent attempt of ours seems to suggest that this is possible if you exert great care. It will presented in a paper at the next ESEM (Empirical Software Engineering and Measurement) and even though it discusses assessing some aspects of our own designs, using students, as part of the course project which counts for grading, and separating them into groups, we feel it was fair and ethical, and </modesty_filter_on>an ESEM referee wrote: “this is one of the best designed, conducted, and presented empirical studies I have read about recently”<modesty_filter_on>.

How did we proceed? How would you have proceeded? Think about it; feel free to express your ideas as comments to this post. In the next installment of this blog (The Professor Smith Syndrome: Part 3), I will describe our work, and let you be the judge.

References

[1] Bertrand Meyer: The rise of empirical software engineering (I): the good news, this blog, 30 July 2010, available here.
[2] Bertrand Meyer: The rise of empirical software engineering (II): what we are still missing, this blog, 31 July 2010, available here.

[3] On the Hawthorne effect, there is a good Wikipedia entry. Acknowledgment: I first heard about the Hawthorne effect from Barry Boehm.

[4] Jerzy R. Nawrocki, Michal Jasinski, Lukasz Olek and Barbara Lange: Pair Programming vs. Side-by-Side Programming, in EuroSPI 2005, pages 28-38. I do not have a URL for this article.

[5] Matthias Müller: Two controlled Experiments concerning the Comparison of Pair Programming to Peer Review, in  Journal of Systems and Software, vol. 78, no. 2, pages 166-179, November 2005; and Are Reviews an Alternative to Pair Programming ?, in  Journal of Empirical Software Engineering, vol. 9, no. 4, December 2004. I don’t have a URL for either version. I am grateful to Walter Tichy for directing me to this excellent article.

VN:F [1.9.10_1130]
Rating: 10.0/10 (2 votes cast)
VN:F [1.9.10_1130]
Rating: +2 (from 2 votes)

From duplication to duplicity: a short history of the technology of knowledge transfer

1440:

Gutenberg discovers the secret of producing, out of one text, many books for the benefit of many people.

2007:

Von und Zu Guttenberg discovers the secret of producing, out of many texts, one book for the benefit of one person.

 

 

VN:F [1.9.10_1130]
Rating: 9.9/10 (16 votes cast)
VN:F [1.9.10_1130]
Rating: +10 (from 12 votes)

The rise of empirical software engineering (II): what we are still missing

p> 

Recycled(This article was initially published in the CACM blog.)

The previous post under  the heading of empirical software engineering hailed the remarkable recent progress of this field, made possible in particular by the availability of large-scale open-source repositories and by the opening up of some commercial code bases.

Has the empirical side of software engineering become a full member of empirical sciences? One component of the experimental method is still not quite there: reproducibility. It is essential to the soundness of natural sciences; when you publish a result there, the expectation is that others will be able to replicate it. Perhaps such duplication does not happen as often and physicists and biologists would have us believe, but it does happen, and the mere possibility that someone could check your results (and make a name for himself, especially if you are famous, by disproving them) keeps experimenters on their toes. 

If we had the same norms in empirical software engineering, empirical papers would all contain a clause such as

Hampi’s source code and documentation, experimental data, and additional results are available at http://people.csail.mit.edu/akiezun/hampi

This example is, in fact, a real quote, from a paper [1] at the 2009 ISSTA conference. It shows exactly what we expect for an experimental software engineering publication: below are my results, if you want to rerun the experiments here is the URL where you will find the code (source and binary) and the data.

Unfortunately, such professionalism is the exception rather than the rule. I performed a quick check — entirely informal, as this is a blog post, not an empirical research paper! — in the ISSTA ’09 proceedings. ISSTA, an ACM conference is a good sample point, since it covers testing (plus other approaches to program analysis) and almost every paper has an  “experiment” section. I found only a very small number that, like the one cited above, give explicit reproducibility information. (Disclosure: one of those papers is ours [2].)

I believe that the situation will change dramatically and that in a few years it will be impossible to submit an empirical paper without including such information. Computer science, or at least some areas of software engineering, should actually consider themselves privileged when it comes to allowing reproducibility: all that we have to do to reproduce a result, in testing for example, is to run a program. That is easier than for a zoologist — wishing to reproduce a colleague’s experiment precisely — to gather in his lab the appropriate number of flies, chimpanzees or killer whales.

In some types of empirical software research, such as the assessment of process models or design techniques, reproducing an experiment’s setup is harder than when all you have to do is to rerun a program. But regardless of the area we must develop a true  culture of reproducibility. It is not yet there. I have personally come to take experimental results with a grain of salt; not that I particulary suspect foul play, but I simply know how easy it is, in the absence of external validation, to make a mistake in the experiments and, unwittingly, publish a paper with wrong results.

Developing a culture of reproducibility also has an effect on the refereeing process. In submitting papers with precise instructions to reproduce our results, we have sometimes remarked that referees never contact us. I hope this means they always succeed; I suspect, however, that in many cases they just do not try. If you think further about the implications, providing reproducibility instructions for a submitted paper is scary: after all a software run may fail to run for marginal reasons, such as the wrong hardware configuration or a misunderstanding of the instructions. You do not want to perform all the extra work (of making your results reproducible) just to have the paper summarily rejected because the referee is running Windows 95. Ideally, then, referees should have the possibility to ask technical questions — but anonymously, since this is the way most refereeing works. Conferences and journals generally do not support such a process.

These obstacles are implementation issues, however, and will go away. What matters for the growth of the discipline is that it needs, like experimental sciences before it, to embrace a true culture of reproducibility.

References

[1] Adam Kieun, Vijay Ganesh, Philip J. Guo, Pieter Hooimeijer, Michael D. Ernst: HAMPI: A Solver for String Constraints, Proceedings of the 2009 ACM/SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’09), July 19-23, 2009, Chicago.

[2] Nadia Polikarpova, Ilinca Ciupa  and Bertrand Meyer: A Comparative Study of Programmer-Written and Automatically Inferred Contracts, Proceedings of the 2009 ACM/SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’09), July 19-23, 2009, Chicago.

VN:F [1.9.10_1130]
Rating: 6.0/10 (3 votes cast)
VN:F [1.9.10_1130]
Rating: +2 (from 4 votes)

The rise of empirical software engineering (I): the good news

 

RecycledIn the next few days I will post a few comments about a topic of particular relevance to the future of our field: empirical software engineering. I am starting by reposting two entries originally posted in the CACM blog. Here is the first. Let me use this opportunity to mention the LASER summer school [1] on this very topic — it is still possible to register.

Empirical software engineering papers, at places like ICSE (the International Conference on Software Engineering), used to be terrible.

There were exceptions, of course, most famously papers by Basili, Zelkowitz, Rombach, Tichy, Berry, Humphrey, Gilb, Boehm, Lehmann, Belady and a few others, who kept hectoring the community about the need to base our opinions and practices on evidence rather than belief. But outside of these cases the typical ICSE empirical paper — I sat through a number of them — was depressing: we made these measurements in our company, found these results, just believe us. A question here in the back? Can you reproduce our results? Access our code? We’d love you to, but unfortunately we work for a company — the Call for Papers said industry contributions were welcome, didn’t it? — and we can’t give you the details. So sorry. But trust us, we checked our results.

Actually, there was another kind of empirical paper, which did not suffer from such secrecy: the university study. Hi, I am professor Bright, the well-known author of the Bright method of software development. Everyone knows it’s the best, but we wanted to assess it scientifically through a rigorous empirical study. I gave the same programming problem to two groups of third-year undergraduates; one group was told to use the Bright method, the other not. Guess what? The Bright group performed 67.94% better! I see the session chair wanting to move to the next speaker; see the details in the paper.

For years, this was most of what we had: unverifiable industry reports and unconvincing student experiments.

And suddenly the scene has changed. Empirical software engineering studies are in full bloom; the papers are flowing, and many are good!

What triggered this radical change is the availability of open-source repositories. Projects such as Linux, Eclipse, Apache, EiffelStudio and many others have records going back 10, 15, sometimes 20 years. These records contain the true history of the project: commits (into the configuration management system), bug reports, bug fixes, test runs and their results, developers involved, and many more elements of project data. All of a sudden empirical research has what any empirical science needs: a large corpus of objects to analyze.

Open-source projects have given the decisive jolt, but now we can rely on industrial data as well: Microsoft and other companies have started making their own records selectively available to researchers. In the work of authors such as Zeller from Sarrebruck, Gall from Uni. Zurich or Nagappan from Microsoft, systematic statistical techniques yield answers, sometimes surprising, to questions on which we could only speculate. Do novices or experts cause more bugs? Does test coverage correlate with software quality, and if so, positively or negatively? Little by little, we are learning about the true properties of software products and processes, based not on fantasies but on quantitative analysis of meaningful samples.

The trend is unmistakable, and irreversible.

Not all is right yet; in the second installment of this post I will describe some of what still needs to be improved for empirical software engineering to achieve full scientific rigor.

Reference

[1] LASER summer school 2010, at http://se.ethz.ch/laser.

VN:F [1.9.10_1130]
Rating: 4.5/10 (2 votes cast)
VN:F [1.9.10_1130]
Rating: 0 (from 4 votes)

The other impediment to software engineering research

In the decades since structured programming, many of the advances in software engineering have come out of non-university sources, mostly of four kinds:

  • Start-up technology companies  (who played a large role, for example, in the development of object technology).
  • Industrial research labs, starting with Xerox PARC and Bell Labs.
  • Independent (non-university-based) author-consultants. 
  • Independent programmer-innovators, who start open-source communities (and often start their own businesses after a while, joining the first category).

 Academic research has had its part, honorable but limited.

Why? In earlier posts [1] [2] I analyzed one major obstacle to software engineering research: the absence of any obligation of review after major software disasters. I will come back to that theme, because the irresponsible attitude of politicial authorities hinders progress by depriving researchers of some of their most important potential working examples. But for university researchers there is another impediment: the near-impossibility of developing serious software.

If you work in theory-oriented parts of computer science, the problem is less significant: as part of a PhD thesis or in preparation of a paper you can develop a software prototype that will support your research all the way to the defense or the publication, and can be left to wither gracefully afterwards. But software engineering studies issues that arise for large systems, where  “large” encompasses not only physical size but also project duration, number of users, number of changes. A software engineering researcher who only ever works on prototypes will be denied the opportunity to study the most significant and challenging problems of the field. The occasional consulting job is not a substitute for this hands-on experience of building and maintaining large software, which is, or should be, at the core of research in our field.

The bodies that fund research in other sciences understood this long ago for physics and chemistry with their huge labs, for mechanical engineering, for electrical engineering. But in computer science or any part of it (and software engineering is generally viewed as a subset of computer science) the idea that we would actually do something , rather than talk about someone else’s artifacts, is alien to the funding process.

The result is an absurd situation that blocks progress. Researchers in experimental physics or mechanical engineering employ technicians: often highly qualified personnel who help researchers set up experiments and process results. In software engineering the equivalent would be programmers, software engineers, testers, technical writers; in the environments that I have seen, getting financing for such positions from a research agency is impossible. If you have requested a programmer position as part of a successful grant request, you can be sure that this item will be the first to go. Researchers quickly understand the situation and learn not even to bother including such requests. (I have personally never seen a counter-example. If you have a different experience, I will be interested to learn who the enlightened agency is. )

The result of this attitude of funding bodies is a catastrophe for software engineering research: the only software we can produce, if we limit ourselves to official guidelines, is demo software. The meaningful products of software engineering (large, significant, usable and useful open-source software systems) are theoretically beyond our reach. Of course many of us work around the restrictions and do manage to produce working software, but only by spending considerable time away from research on programming and maintenance tasks that would be far more efficiently handled by specialized personnel.

The question indeed is efficiency. Software engineering researchers should program as part of their normal work:  only by writing programs and confronting the reality of software development can we hope to make relevant contributions. But in the same way that an experimental physicist is helped by professionals for the parts of experimental work that do not carry a research value, a software engineering researcher should not have to spend time on porting the software to other architectures, performing configuration management, upgrading to new releases of the operating system, adapting to new versions of the libraries, building standard user interfaces, and all the other tasks, largely devoid of research potential, that software-based innovation requires.

Until  research funding mechanisms integrate the practical needs of software engineering research, we will continue to be stymied in our efforts to produce a substantial effect on the quality of the world’s software.

References

[1] The one sure way to advance software engineering: this blog, see here.
[2] Dwelling on the point: this blog, see here.

VN:F [1.9.10_1130]
Rating: 8.3/10 (18 votes cast)
VN:F [1.9.10_1130]
Rating: +6 (from 8 votes)

Rejection letter classic

Part of the experience of being a scientist, in the industrial age of publication, is the rejection letter; especially the damning review whose author, anonymous of course, does not appear particularly competent. I have my own treasured collection, which I will publish one day. For a fiction so artfully designed as to be almost as good as the real thing, you can check  Simone Santini’s hilarious parody [1], a true classic.

Although there are a few references to it around the Web, I do not think it is as well known as it deserves to be. What Santini did was to imagine rejection letters for famous papers. He stated [2] that:

The reviews are a collage of reviews that I have seen of some papers (mine and of other people) that have been rejected because, I thought, the reviewer had completely misunderstood the paper. After a rejection at a database conference for what I thought were completely preposterous reasons, I had the idle thought that today even Codd’s paper on relational data bases (the foundation of the whole field) would never make it into a major data base conference…Many of the sentences that I use in the article are from actual reviews.

A sample from the imaginary Codd rejection letter:

E.F. CODD “A Relational Model of Data for Large Shared Data Banks.”  … The formalism is needlessly complex and mathematical, using concepts and notation with which the average data bank practitioner is unfamiliar. The paper doesn’t tell us how to translate its arcane operations into executable block access.

Adding together the lack of any real-world example, performance experiment, and implementation indication or detail, we are left with an obscure exercise using unfamiliar mathematics and of little or no practical consequence. It can be safely rejected.

All the others are gems too: Turing’s Entscheidungsproblem paper (“If the article is accepted, Turing should remember that the language of this journal is English and change the title accordingly”); Dijstra’s Goto considered harmful; Hoare’s 1969 axiomatic semantics paper (the author “should also extend the method to be applicable to a standard programming language such as COBOL or PL/I and provide the details of his implementation, possibly with a few graphics to show how the system works in practice”) etc.

To avoid a spoiler I will  cite no more;  you should read the paper if you do not know it yet. It rings so true.

References

[1] Simone Santini: We Are Sorry to Inform You …, in IEEE Computer, vol. 38, no. 12, pp. 128, 126-127, December 2005,  online on the IEEE site. There is also a copy here.

 [2] http://www.omlettesoft.com/newjournal.php3?who=Lord+Omlette&id=1134629858.

VN:F [1.9.10_1130]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.10_1130]
Rating: +2 (from 2 votes)

One cheer for incremental research

[Note: an updated version of this article (June 2011) appears in the Communications of the ACM blog.]

The world of research funding, always a little strange, has of late been prey to a new craze: paradigm-shift mania. We will only fund twenty curly-haired cranky-sounding visionaries in the hope that one of them will invent relativity. The rest of you — bit-players! Petty functionaries! Slaves toiling at incremental research!  — should be ashamed of even asking.

Take this from the US National Science Foundation’s current description of funding for Computer Systems Research [1]:

CSR-funded projects will enable significant progress on challenging high-impact problems, as opposed to incremental progress on familiar problems.

 The European Research Council is not to be left behind [2]:

Projects being highly ambitious, pioneering and unconventional

Research proposed for funding to the ERC should aim high, both with regard to the ambition of the envisaged scientific achievements as well as to the creativity and originality of proposed approaches, including unconventional methodologies and investigations at the interface between established disciplines. Proposals should rise to pioneering and far-reaching challenges at the frontiers of the field(s) addressed, and involve new, ground-breaking or unconventional methodologies, whose risky outlook is justified by the possibility of a major breakthrough with an impact beyond a specific research domain/discipline.

Frontiers! Breakthrough! Rise! Aim high! Creativity! Risk! Impact! Pass me the adjective bottle. Ground-breaking! Unconventional! Highly ambitious! Major! Far-reaching! Pioneering! Galileo and Pasteur only please — others need not apply.

As everyone knows including the people who write such calls, this is balderdash. First, 99.97% of all research (precise statistic derived from my own ground-breaking research, further funding welcome) is incremental. Second, when a “breakthrough” does happen — the remaining 0.03%  — it was often not planned as a breakthrough.

Incremental research is a most glorious (I have my own supply of adjectives) mode of doing science. Beginning PhD students can be forgiven for believing the myth of the lone genius who penetrates the secrets of time and space by thinking aloud during long walks with his best friend [3]; we all, at some stage, shared that delightful delusion. But every researcher, presumably including those who go on to lead research agencies,  quickly grows up and learns that it is not how things happen. You read someone else’s solution to a problem, and you improve on it. Any history of science will tell you that for every teenager who from getting hit by a falling apple intuits the structure of the universe there are hundreds of great researchers who look at the state of the art and decide they can do a trifle better.

Here is a still recent example, particularly telling because we have the account from the scientist himself. It would not be much of an exaggeration to characterize the entire field of program proving over the past four decades as a long series of variations on Tony Hoare’s 1969 Axiomatic Semantics paper [4]. Here Hoare’s recollection, from his Turing Award lecture [5]:

In October 1968, as I unpacked my papers in my new home in Belfast, I came across an obscure preprint of an article by Bob Floyd entitled “Assigning Meanings to Programs.” What a stroke of luck! At last I could see a way to achieve my hopes for my research. Thus I wrote my first paper on the axiomatic approach to computer programming, published in the Communications of the ACM in October 1969.

(See also note [6].) Had the research been submitted for funding, we can imagine the reaction: “Dear Sir, as you yourself admit, Floyd has had the basic idea [7] and you are just trying to present the result better. This is incremental research; we are in the paradigm-shift business.” And yet if Floyd had the core concepts right it is Hoare’s paper that reworked and extended them into a form that makes practical semantic specifications and proofs possible. Incremental research at its best.

The people in charge of research programs at the NSF and ERC are themselves scientists and know all this. How come they publish such absurd pronouncements? There are two reasons. One is the typical academic’s fascination with industry and its models. Having heard that venture capitalists routinely fund ten projects and expect one to succeed, they want to transpose that model to science funding; hence the emphasis on “risk”. But the transposition is doubtful because venture capitalists assess their wards all the time and, as soon as they decide a venture is not going to break out, they cut the funding overnight, often causing the company to go under. This does not happen in the world of science: most projects, and certainly any project that is supposed to break new ground, gets funded for a minimum of three to five years. If the project peters out, the purse-holder will only realize it after spending all the money.

The second reason is a sincere desire to avoid mediocrity. Here we can sympathize with the funding executives: they have seen too many “here is my epsilon addition to the latest buzzword” proposals. The last time I was at ECOOP, in 2005, it seemed every paper was about bringing some little twist to aspect-oriented programming. This kind of research benefits no one and it is understandable that the research funders want people to innovate. But telling submitters that every project has to be epochal (surprisingly, “epochal” is missing from the adjectives in the descriptions above  — I am sure this will soon be corrected) will not achieve this result.

It achieves something else, good neither for research nor for research funding: promise inflation. Being told that they have to be Darwin or nothing, researchers learn the game and promise the moon; they also get the part about “risk” and emphasize how uncertain the whole thing is and how high the likelihood it will fail. (Indeed, since — if it works — it will let cars run from water seamlessly extracted from the ambient air, and with the excedent produce free afternoon tea.)

By itself this is mostly entertainment, as no one believes the hyped promises. The real harm, however, is to honest scientists who work in the normal way, proposing to bring an important contribution to the  solution of an important problem. They risk being dismissed as small-timers with no vision.

Some funding agencies have kept their heads cool. How refreshing, after the above quotes, to read the general description of funding by the Swiss National Science Foundation [8]:

The central criteria for evaluation are the scientific quality, originality and project methodology as well as qualifications and track record of the applicants. Grants are awarded on a competitive basis.

In a few words, it says all there is to say. Quality, originality, methodology, and track record. Will the research be “ground-breaking” or “incremental”? We will know when it is done.

I am convinced that the other agencies will come to their senses and stop the paradigm-shift nonsense. One reason for hope is in the very excesses of the currently fashionable style. The European Research Council quote includes, by my count, nineteen ways of saying that proposals must be daring. Now it is a pretty universal rule of life that someone who finds it necessary to say the same thing nineteen times in a single paragraph does not feel sure about it. He is trying to convince himself. At some point the people in charge will realize that such hype does not breed breakthroughs; it breeds more hype.

Until that happens there is something that some of us can do: refuse to play the game. Of course we are all convinced that our latest idea is the most important one ever conceived by humankind, and we want to picture it in the most favorable light. But we should resist the promise inflation. Such honesty comes at a risk. (I still remember a project proposal, many years ago, which came back with glowing reviews: the topic was important, the ideas right, the team competent. The agency officer’s verdict: reject. The proposers are certain to succeed, so it’s not research.) For some people, there is really no choice but to follow the lead: if your entire career depends on getting external funding, no amount of exhortation will prevent you from saying what the purse-holders want to hear. But those of us who do have a choice (that is to say, will survive even if a project is rejected) should refuse the compromission. We should present our research ideas for what they are.

So: one cheer for incremental research.

Wait, isn’t the phrase supposed to be “two cheers” [9]?

All right, but let’s go at it incrementally. One and one-tenth cheer for incremental research. 

References

 

[1]  National Science Foundation, Division of Computer and Network Systems: Computer Systems Research  (CSR), at http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=13385.

[2] European Research Council: Advanced Investigators Grant, at http://erc.europa.eu/index.cfmfuseaction=page.display&topicID=66.

[3] The Berne years; see any biography of Albert Einstein.

[4] C.A.R. Hoare: An axiomatic basis for computer programming, in Communications of the ACM, vol. 12, no 10, pages 576–580,583, October 1969.

[5] C.A.R. Hoare: The Emperor’s Old Clothes, in Communications of the ACM, vol. 24, no.  2, pages 75 – 83, February 1981.

[6] In the first version of this essay I wrote “Someone should celebrate the anniversary!”. Moshe Vardi, editor of Communications of the ACM, has informed me that the October 2009 issue will include a retrospective by Hoare on the 1969 paper. I cannot wait to see it.

[7] Robert W. Floyd: Assigning meanings to programs, in Proceedings of the American Mathematical Society Symposia on Applied Mathematics, vol. 19, pp. 19–31, 1967.

[8] Swiss National Science Foundation:  Projects – Investigator-Driven Research, at http://www.snf.ch/E/funding/projects/Pages/default.aspx. Disclosure: The SNSF kindly funds some of my research.

[9] E.M. Forster: Two Cheers for Democracy, Edward Arnold, 1951.

VN:F [1.9.10_1130]
Rating: 8.3/10 (3 votes cast)
VN:F [1.9.10_1130]
Rating: +1 (from 3 votes)

The good and the ugly

Once in a while one hits a tool that is just right. An example worth publicizing is the EasyChair system for conference management [1], which  — after a first experience as reviewer —  I have selected whenever I was in a position to make the choice for a new conference in recent years.

At first sight, a conference management system does not seem so hard to put together; it is in fact a traditional project topic for software engineering courses. But this apparent simplicity is deceptive, as a usable system must accommodate countless small and large needs. To take just one example, you can be a member of a program committee for a conference and also submit a paper to it; this implies strict rules about what you can see, for example reviews of other people’s papers with the referees’ names, and what you should not see. Taking care of myriad such rules and requirements requires in-depth domain knowledge about conferences, and a thorough analysis.

EasyChair is based on such an analysis. It knows what a conference is, and understands what its users need. Here for example is my login screen on EasyChair:

easychair

EasyChair knows about me: I only have one user name and one password. It knows the conferences in which I have been involved (and found them by itself). It knows about my various roles: chair, author etc., and will let me do different things depending on the role I choose.

The rest of the tool is up to the standards set by this initial screen. Granted, the Web design is very much vintage 1994; a couple of hours on the site by a professional graphics designer would not hurt, but, really, who cares? What matters is the functionality, and it is not by accident that EasyChair’s author is a brilliant logician [2]. Here is someone who truly understands the business of organizing and refereeing a conference, has translated this understanding into a solid logical model, and has at every step put himself in the shoes of the participants in the process. As a user you feel that everything has been done to make you feel comfortable  and perform efficiently, while protecting you from hassle.

Because this is all so simple and natural, you might forget that the system required extensive design. If you need proof, it suffices to consider, by contrast, the ScholarOne system, which as punishment for our sins both ACM and IEEE use for their journals.

Even after the last user still alive has walked away, ScholarOne will remain in the annals of software engineering, as a textbook illustration of how not to design a system and its user interface. Not the visuals; no doubt that site had a graphics designer. But everything is designed to make the system as repellent as possible for its users. You keep being asked for information that you have already entered. If you are a reviewer for Communications of the ACM and submit a paper to an IEEE Computer Society journal, the system does not remember you, since CACM has its own sub-site; you must re-enter everything. Since your identifier is your email address, you will have two passwords with the same id, which confuses the browser. (I keep forgetting the appropriate password, which the site obligingly emails me, in clear.) IEEE publications have a common page, but here is how it looks:

scholarone-detail

See the menu on the right? It is impossible  to see the full names of each of the “Transactio…”. (No tooltips, of course.) Assume you just want to know what one of them is, for example “th-cs”: if you select it you are prompted to provide all kinds of information (which you have entered before for other publications), before you can even proceed.

This user interface design (the minuscule menu, an example of what Scott Meyers calls the “Keyhole problem” [3]) is only a small part of usability flaws that plague the system. The matter is one of design: the prevailing viewpoint is that of the  designers and administrators, not the users. I was not really surprised when I found out that the system comes from the same source as the ISI Web of Science system (which should never be used for computer science, see [4]).

It is such a pleasure in contrast to see a system like EasyChair  — for all I know a one-man effort — with its attention to user needs, its profound understanding of the problem domain, and its constant improvements over the years.

References

[1] EasyChair system, at http://www.easychair.org.

[2] Andrei Voronkov, http://www.voronkov.com/.

[3] Scott Meyers, The Keyhole Problem, at http://www.aristeia.com/TKP/draftPaper.pdf; see also slides at http://se.ethz.ch/~meyer/publications/OTHERS/scott_meyers/keyhole.pdf

[4]  Bertrand Meyer, Christine Choppy, Jan van Leeuwen, Jørgen Staunstrup: Research Evaluation for Computer Science, in  Communications  of the ACM, vol. 52, no. 4, pages 131-134, online at http://portal.acm.org/citation.cfm?id=1498765.1498780 (requires subscription). Longer version, available at http://www.informatics-europe.org/docs/research_evaluation.pdf (free access).

VN:F [1.9.10_1130]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.10_1130]
Rating: +1 (from 1 vote)