The Professor Smith syndrome: Part 2

As stated in the Quiz of a few days ago (“Part 1 ”), we consider the following hypothetical report in experimental software engineering ([1], [2]):

Professor Smith has developed a new programming technique, “Suspect-Oriented Programming” (SOP). To evaluate SOP, he directs half of the students in his “Software Methodology” class to do the project using traditional techniques, and the others to use SOP.

He finds that projects by the students using SOP have, on the average, 15% fewer bugs than the others, and reports that SOP increases software reliability.

What’s wrong with this story?

Professor Smith’s attempt at empirical software engineering is problematic for at least four reasons. Others could arise, but we do not need to consider them if Professor Smith has applied the expected precautions: the number of students should be large enough (standard statistical theory will tell us how much to trust the result for various sample sizes); the students should be assigned to one of the two groups on a truly random basis; the problem should be amenable to both SOP and non-SOP techniques; and the assessment of the number of bugs should in the results should be based on fair and if possible automated evaluation. Some respondents to the quiz cited these problems, but they would apply to any empirical study and we can assume they are being taken care of.

The first problem to consider is that the evaluator and the author of the concept under evaluation are the same person. This is an approach fraught with danger. We have no reason to doubt Professor Smith’s integrity, but he is human. Deep down, he wants SOP to be better than the alternative. That is bound to affect the study. It would be much more credible if someone else, with no personal stake in SOP, had performed it.

The second problem mirrors the first on the students’ side. The students from group 1 were told that they used Professor Smith’s great idea, those from group 2 that they had to use old, conventional, boring stuff. Did both groups apply the same zeal to their work? After all, the students know that Professor Smith created SOP, and maybe he is an convincing advocate, so group 1 students will (consciously or not) do their best; those from group 2 have less incentive to go the extra mile. What we may have at play here is a phenomenon known as the Hawthorne effect [3]: if you know you are being tested for a new technique, you typically work harder and better — and may produce better results even if the technique is worthless! Experiments dedicated to studying this effect show that even  a group that is in reality using the same technique as another does better, at least at the beginning, if it is told that it is using a new, sexy technique.

The first and second problems arise in all empirical studies, software-related or not. They are the reason why medical experiments use placebos and double-blind techniques (where neither the subjects nor the experimenters themselves know who is using which variant). These techniques often do not directly transpose to software experiments, but we should all the same be careful about empirical studies of assessments of one’s own work and about possible Hawthorne effects.

The third problem, less critical, is the validity of a study relying on students. To what extent can we extrapolate from the results to a situation in industry? Software engineering students are on their way to becoming software professionals, but they are not professionals yet. This is a difficult issue because universities, rather than industry, are usually and understandably the place where experiments take place, an sometimes there is no other choice than using students. But then one can question the validity of the results. It depends on the nature of the questions being asked: if the question under study is whether a certain idea is easy to learn, using students is reasonable. But if it is, for example, whether a technique produces less buggy programs, the results can depend significantly on the subjects’ experience, which is different for students and professionals.

The last problem does not by itself affect the validity of the results, but it is a show-stopper nonetheless: Professor Smith’s experiment is unethical! If is is indeed true that SOP is better than the alternative, he is harming students from group 2; in the reverse case, he is harming students from group 1. Only in the case of the null hypothesis (using SOP makes no statistically significant difference) is the experiment ethical, but then it is also profoundly uninteresting. The rule in course-related experiments is a variant of the Hippocratic oath: before all, do not harm. The first purpose of a course is to enrich the students’ knowledge and skills; secondary aims, such as helping the professor’s research, are definitely acceptable, but must never impede the first. The setup described above is all the less acceptable that the project results presumably count towards the course grade, so the students who were forced to use the less good technique, if there demonstrably was one, have grounds to complain.

Note that Professor Smith could partially address this fairness problem by letting students choose their group, instead of assigning them randomly to group 1 or group 2 (based for example on the first letter of their names). But then the results would lose credibility, because this technique introduces self-selection and hence bias: the students who choose SOP may be the more intellectually curious students, and hence possibly the ones who do better anyway.

If Professor Smith cannot ensure fairness, he can still use students for his experiment, but has to run it outside of a course, for example by paying students, or running the experiment as a competition with some prizes for those who produce the programs with fewest bugs. This technique can work, although it introduces further dangers of self-selection. As part of a course, however, you just cannot assign students, on your own authority, to different techniques that might have an different effect on the core goal of the course: the learning experience.

So Professor Smith has a long way to go before he can run experiments that will convey a significant argument in favor of SOP.

Over the years I have seen, as a reader and sometimes as a referee, many Professor Smith papers: “empirical” evaluation of a technique by its own authors, using questionable techniques and not applying the necessary methodological precautions.

A first step is, whenever possible, to use experimenters who are from a completely different group from the developers of the ideas, as in two studies [4] [5] about the effectiveness of pair programming.

And yet! Sometimes no one else is available, and you do want to obtain objective empirical evidence about the merits of your own ideas. You are aware of the risk, and ready to face the cold reality, including if the results are unfavorable. Can you do it?

A recent attempt of ours seems to suggest that this is possible if you exert great care. It will presented in a paper at the next ESEM (Empirical Software Engineering and Measurement) and even though it discusses assessing some aspects of our own designs, using students, as part of the course project which counts for grading, and separating them into groups, we feel it was fair and ethical, and </modesty_filter_on>an ESEM referee wrote: “this is one of the best designed, conducted, and presented empirical studies I have read about recently”<modesty_filter_on>.

How did we proceed? How would you have proceeded? Think about it; feel free to express your ideas as comments to this post. In the next installment of this blog (The Professor Smith Syndrome: Part 3), I will describe our work, and let you be the judge.

References

[1] Bertrand Meyer: The rise of empirical software engineering (I): the good news, this blog, 30 July 2010, available here.
[2] Bertrand Meyer: The rise of empirical software engineering (II): what we are still missing, this blog, 31 July 2010, available here.

[3] On the Hawthorne effect, there is a good Wikipedia entry. Acknowledgment: I first heard about the Hawthorne effect from Barry Boehm.

[4] Jerzy R. Nawrocki, Michal Jasinski, Lukasz Olek and Barbara Lange: Pair Programming vs. Side-by-Side Programming, in EuroSPI 2005, pages 28-38. I do not have a URL for this article.

[5] Matthias Müller: Two controlled Experiments concerning the Comparison of Pair Programming to Peer Review, in  Journal of Systems and Software, vol. 78, no. 2, pages 166-179, November 2005; and Are Reviews an Alternative to Pair Programming ?, in  Journal of Empirical Software Engineering, vol. 9, no. 4, December 2004. I don’t have a URL for either version. I am grateful to Walter Tichy for directing me to this excellent article.

VN:F [1.9.10_1130]
Rating: 10.0/10 (2 votes cast)
VN:F [1.9.10_1130]
Rating: +2 (from 2 votes)
The Professor Smith syndrome: Part 2, 10.0 out of 10 based on 2 ratings
Be Sociable, Share!

8 Comments

  1. Werner Lioen says:

    I see for myself four important parameters which could be of importance for the accuracy of measuring on software development:

    P1. The quality of the content of the exercise; how good does it address what must be measured.
    P2. The quantity of work to be done for the exercise; the larger the exercise the better the accuracy of the measurement.
    P3. The quality of a test group; how representative the group is (the most interesting group is of cause the industry)
    P4. The size or quantity of a test group

    Another parameter is of importance for a measurement.

    P5. The number of exercises to be performed by a single person of the group before the Hawthorne effect will drop away. This parameter depends of cause on the other parameter P2.

    When enlarging every parameter the accuracy of the measurement will become better. Maybe a kind of accuracy volume AV can be defined, something like:

    AV = P1 * P2 * P3 * P4

    The higher the AV, the more work has to be done by the test group and/or testers. For cost effectiveness the accuracy to be obtained should be well tuned. I can imagine too that for small effects the accur0acy must be higher.

    Another parameter which is of importance is:

    P5. The duration of the Hawthorne effect.

    For performing measurements I would select an as large as possible population of software developers, preferably not knowing about eachother and geographically distributed like the target group. I would give the software developers a contract in which they have to state that they will not share information with others about their exercises. The selection of the test groups can then be done by a good pseudo random generator, running on a computer. There will probably be allways a Hawthorne effect because some exercises are more interesting than others and the test persons maybe get bored. I would give to each test person several exercises, one by one, in a random order, which will probably take away this bias. I would make a study of the Hawthorne effect and especially how it behaves in time (number of exercises done by a test person). Several exercises have to be done before the Hawthorne effect flows away. An interpolation with a known Hawthorne formula could predict the stable measurement values with a certain accuracy.

    *&@$<+=%! did I describe the Hawthorne efect?

    VN:F [1.9.10_1130]
    Rating: 0.0/5 (0 votes cast)
    VN:F [1.9.10_1130]
    Rating: 0 (from 0 votes)
    • Werner Lioen says:

      I accidently made it a description of a real world experiment. I am not allways pragmatic ;-). I have an addition though.

      There’s maybe something with parameter P2. The larger P2, the better the quality, though in combanition with taking in account the Hawthorne effect in the experiment described, P2 may not be too large to make possible to see the Hawthorne effect flow away. So the “granularity” of P2 is of importance.

      VN:F [1.9.10_1130]
      Rating: 0.0/5 (0 votes cast)
      VN:F [1.9.10_1130]
      Rating: 0 (from 0 votes)
    • Werner Lioen says:

      I made a mistake to supply the exercises in a random order. When interested in the Hawthorne effect, the order should allways be the same.

      VN:F [1.9.10_1130]
      Rating: 0.0/5 (0 votes cast)
      VN:F [1.9.10_1130]
      Rating: 0 (from 0 votes)
    • Werner Lioen says:

      Concerning the ethical issue I would give the most beautiful exercises to persons who were harmed most in their life.

      VN:F [1.9.10_1130]
      Rating: 0.0/5 (0 votes cast)
      VN:F [1.9.10_1130]
      Rating: 0 (from 0 votes)
  2. ccachero says:

    From my point of view, the main problem of the experiment lies in its lack of external validity, that is, it is impossible to assure that technique 1 is better than technique 2 when you just try them on a couple of systems. The experiment should be replicated with different systems from different domains and of different complexity: here the challenge is to build a sound family of experiments, more than a single experiment.

    Regarding the ethical issue, I agree, although only in the case when the teacher is completely sure that one technique is definitely better than another, and still he decides to prevent some students from learning it due to ‘research’ reasons. Fortunately, this is not the case most of the time: we are faced with a huge set of techniques, modeling languages, etc. and we simply cannot teach all in a typical course. Making different set of students learn different techniques is a way to spread the knowledge without overwhelming the students with too much content. Given the fact that students talk to each other, it is likely that interested students will end up learning both techniques, even if we have not explicitly taught them both. Even more interesting, if we disclose the results of the experiment to our students after the analysis, we are making them aware of the importance of choosing the right method based on empirical data (hopefully in their case taken from their own working environment), instead of just taking the first one at hand, or the fancy one. That is, we are teaching scientific thought.

    This notwithstanding, in this case, and provided that the systems are not too complex, from my point of view an intra-subject design could overcome this problem: just make students apply Technique 1 on one system and technique 2 on the other one, and change order and systems (factorial design) so that threats to validity due to order, fatigue or system complexity are mitigated.

    Many other details should be known in order to perform a deeper analysis: was the experiment supervised or unsupervised? How long did it take the students to complete the assignment? and so on…

    Cheers!
    Cris

    VN:F [1.9.10_1130]
    Rating: 0.0/5 (0 votes cast)
    VN:F [1.9.10_1130]
    Rating: 0 (from 0 votes)
  3. Werner Lioen says:

    I have thought about how such an experiment could be done at a university, taking into account the practical an ethical issues.

    When for instance testing two software methodologies X and Y, I would again divide the students in two groups A and B. I would try hard to create two exercises, exercise 1 and 2, which will students approximatly think of as being both equally interesting. Group A I would give exercise 1 applying methodology X and exercise 2 applying methodology Y. Group B I would give exercise 1 applying methodology Y and exercise 2 applying methodology X. The students should belong to one of the two groups A or B at random. I think the ethical issue will be solved in this way and no harm will be done to the students.

    I would take care that in the experiment, no “Professor Smith” is present and I would let the experiment be led by a committee existing of persons with different views on software development. I would select the students who are in the last phase of their study. They will be less influenceable because they have developed an own view on software development most. They will be more experienced too and will come more close to what’s going on in industry.

    In the experiment as I suggested, the measurement is in fact done twice because of the ethical reason. Averaging on the two measurements will make the results more trustworth!

    What I would do too, is look at the reproduceability of the results. I would do the experiment at least at two universities. In the ideal case, I would even do the experiment at more than 2 universities. In the last case, a statistical distribution (normal distribution?) could been seen in the results. The width of the statistical distribution would tell much about the reproduceability of the measurement. As I understand reproduceability is an important issue in emperical software engineering.

    VN:F [1.9.10_1130]
    Rating: 0.0/5 (0 votes cast)
    VN:F [1.9.10_1130]
    Rating: 0 (from 0 votes)
    • Werner Lioen says:

      When testing more methodologies at once, there must be of course more groups and exercises defined because of the ethical issue. With n methodologies, there must be n groups and n exercises. The results must than be averaged over the n measurements.

      VN:F [1.9.10_1130]
      Rating: 0.0/5 (0 votes cast)
      VN:F [1.9.10_1130]
      Rating: 0 (from 0 votes)

Leave a Reply

You must be logged in to post a comment.