Bertrand Meyer's technology+ blog Essay Archives - Bertrand Meyer's technology+ blog

Archive for the ‘Essay’ Category.

Logical beats sequential

12 December 2022, 10:08

Often, “we do this and then we do that” is just a lazy way of stating “to do that, we must have achieved this.” The second form is more general than the first, since there may be many things you can “do” to achieve a certain condition.

The extra generality is welcome for software requirements, which should describe essential properties without over-specifying, in particular without prescribing a specific ordering of operations when it is only one possible sequence among several, thereby restricting the flexibility of designers and implementers.

This matter of logical versus sequential constraints is at the heart of the distinction between scenario-based techniques — use cases, user stories… — and object-oriented requirements. This article analyzes the distinction. It is largely extracted from my recent textbook, the Handbook of Requirements and Business Analysis [1], which contains a more extensive discussion.

1. Scenarios versus OO

Scenario techniques, most significantly use cases and user stories, have become dominant in requirements. They obviously fill a need and are intuitive to many people. As a general requirement technique, however, they lack abstraction. Assessed against object-oriented requirements techniques, they suffer from the same limitations as procedural (pre-OO) techniques against their OO competitors in the area of design and programming. The same arguments that make object technology subsume non-OO approaches in those areas transpose to requirements.

Scenario techniques describe system properties in terms of a particular sequence of interactions with the system. A staple example of a use case is ordering a product through an e-commerce site, going through a number of steps. In contrast, an OO specification presents a certain number of abstractions and operations on them, chracterized by their logical properties. This description may sound vague, so we move right away to examples.

2. Oh no, not stacks again

Yes, stacks. This example is rather computer-sciency so it is not meant to convince anyone but just to explain the ideas. (An example more similar to what we deal with in the requirements of industry projects is coming next.)

A stack is a LIFO (Last-In, First-Out) structure. You insert and remove elements at the same end.

Think of a stack of plates, where you can deposit one plate at a time, at the top, and retrieve one plate at a time, also at the top. We may call the two operations put and remove. Both are commands (often known under the alternative names push and pop). We will also use an integer query count giving the number of elements.

Assume we wanted to specify the behavior of a stack through use cases. Possible use cases (all starting with an empty stack) are:

/1/

put
put ; put
put ; put ; put — etc.: any number of successive put (our stacks are not bounded)
put ; remove
put ; put ; remove
put ; put ; remove ; remove
put ; put ; remove ; remove ; put ; remove

We should also find a way to specify that the system does not support such use cases as

/2/

remove ; put

or even just

/3/

remove

We could keep writing such use cases forever — some expressing normal sequences of operations, others describing erroneous cases — without capturing the fundamental rule that at any stage, the number of put so far has to be no less than the number of remove.

A simple way to capture this basic requirement is through logical constraints, also known as contracts, relying on assertions: preconditions which state the conditions under which an operation is permitted, and postconditions which describe properties of its outcome. In the example we can state that:

put has no precondition, and the postcondition

count = old count + 1

using the old notation to refer to the value of an expression before the operation (here, the postcondition states that put increases count by one).

remove has the precondition

count > 0

and the postcondition

count = old count – 1

since it is not possible to remove an element from an empty stack. More generally the LIFO discipline implies that we cannot remove more than we have put.(Such illegal usage sequences are sometimes called “misuse cases.”)

(There are other properties, but the ones just given suffice for this discussion.)

The specification states what can be done with stacks (and what cannot) at a sufficiently high level of abstraction to capture all possible use cases. It enables us to keep track of the value of count in the successive steps of a use case; it tells us for example that all the use cases under /1/ above observe the constraints: with count starting at 0, taking into account the postconditions of put and remove, the precondition of every operation will be satisfied prior to all of its calls. For /2/ and /3/ that is not the case, so we know that these use cases are incorrect.

Although this example covers a data structure, not requirements in the general sense, it illustrates how logical constraints are more general than scenarios:

Use cases, user stories and other forms of scenario only describe specific instances of behavior.
An OO model with contracts yields a more abstract specification, to which individual scenarios can be shown to conform, or not.

3. Avoiding premature ordering decisions

As the stack example illustrates, object-oriented specifications stay away from premature time-order decisions by focusing on object types (classes) and their operations (queries and commands), without making an early commitment to the order of executing these operations.

In the book, I use in several places a use-case example from one of the best books about use cases (along with Ivar Jacobson’s original one of course): Alistair Cockburn’s Writing Effective Use Cases (Pearson Education, 2001). A simplified form of the example is:

1. A reporting party who is aware of the event registers a loss to the insurance company.

2. A clerk receives and assigns claim to a claims agent.

3. The assigned claims adjuster:

3.1 Conducts an investigation.
3.2 Evaluates damages.
3.3 Sets reserves.
3.4 Negotiates the claim.
3.5 Resolves the claim and closes it.

(A reserve in the insurance business is an amount that an insurer, when receiving a claim, sets aside as to cover the financial liability that may result from the claim.)

As a specification, this scenario is trying to express useful things; for example, you must set reserves before starting to negotiate the claim. But it expresses them in the form of a strict sequence of operations, a temporal constraint which does not cover the wide range of legitimate scenarios. As in the stack example, describing a few such scenarios is helpful as part of requirements elicitation, but to specify the resulting requirements it is more effective to state the logical constraints.

Here is a sketch (in Eiffel) of how a class INSURANCE_CLAIM could specify them in the form of contracts. Note the use of require to introduce a precondition and ensure for postconditions.

class INSURANCE_CLAIM feature

— Boolean queries (all with default value False):
is_investigated, is_evaluated, is_reserved,is_agreed,is_imposed, is_resolved:

BOOLEAN

   investigate
                — Conduct investigation on validity of claim. Set is_investigated.
        deferred
        ensure
            is_investigated
        end

   evaluate
                — Assess monetary amount of damages.
        require
            is_investigated
        deferred
        ensure
            is_evaluated
            — Note: is_investigated still holds (see the invariant at the end of the class text).
        end

   set_reserve
                — Assess monetary amount of damages. Set is_reserved.
        require
            is_investigated
            — Note: we do not require is_evaluated.
        deferred
        ensure
            is_reserved
        end

    negotiate
                — Assess monetary amount of damages. Set is_agreed only if negotiation
                — leads to an agreement with the claim originator.
        require
       is_reserved
is_evaluated
        deferred
        ensure
            is_reserved
            — See the invariant for is_evaluated and is_investigated.
        end

   impose (amount: INTEGER)
                — Determine amount of claim if negotiation fails. Set is_imposed.
        require
            not is_agreed
            is_reserved
        deferred
        ensure
            is_imposed
        end

   resolve
                — Finalize handling of claim. Set is_resolved.
        require
            is_agreed or is_imposed
        deferred
        ensure
            is_resolved
        end

invariant — “⇒” is logical implication.

is_evaluated ⇒ is_investigated
is_reserved ⇒ is_evaluated
is_resolved ⇒ is_agreed or is_imposed
is_agreed ⇒ is_evaluated
is_imposed ⇒ is_evaluated
is_imposed ⇒ not is_agreed
— Hence, by laws of logic, is_agreed ⇒ not is_imposed

end

Notice the interplay between the preconditions, postconditions and class invariant, and the various boolean-valued queries they involve (is_investigated, is_evaluated, is_reserved…). You can specify a strict order of operations o₁, o₂ …, as in a use case, by having a sequence of assertions p_i such that operation o_i has the contract clauses require p_i and ensure p_i+1; but assertions also enable you to specify a much broader range of allowable orderings as all acceptable.
The class specification as given is only a first cut and leaves many aspects untouched. It will be important in practice, for example, to include a query payment describing the amount to be paid for the claim; then impose has the postcondition payment = amount, and negotiate sets a certain amount for payment.
Even in this simplified form, the specification includes a few concepts that the original use case left unspecified, in particular the notion of imposing a payment (through the command impose) if negotiation fails. Using a logical style typically uncovers such important questions and provides a framework for answering them, helping to achieve one of the principal goals of requirements engineering.

4. Logical constraints are more general than sequential orderings

The specific sequence of actions described in the original use case (“main success scenario”) is compatible with the logical constraints: you can check that in the sequence

investigate
evaluate
set_reserve
negotiate
resolve

the postcondition of each step implies the precondition of the next one (the first has no precondition). In other words, the temporal specification satisfies the logical one. But you can also see that prescribing this order is a case of overspecification: other orderings also satisfy the logical specification. It may be possible for example — subject to confirmation by Subject-Matter Experts — to change the order of evaluate and set_reserve, or to perform these two operations in parallel.

The specification does cover the fundamental sequencing constraints; for example, the pre- and postcondition combinations imply that investigation must come before evaluation and resolution must be preceded by either negotiation or imposition. But they avoid the non-essential constraints which, in the use case, were only an artifact of the sequential style of specification, not a true feature of the problem.

The logical style is also more conducive to conducting a fruitful dialogue with domain experts and stakeholders:

With a focus on use cases, the typical question from a requirements engineer (business analyst) is “do you do A before doing B?” Often the answer will be contorted, as in “usually yes, but only if C, oh and sometimes we might start with B if D holds, or we might work on A and B in parallel…“, leading to vagueness and to more complicated requirements specifications.
With logic-based specifications, the two fundamental question types are: “what conditions do you need before doing B?” and “does doing A ensure condition C?”. They force stakeholders to assess their own practices and specify precisely the relations between operations of interest.

5. What use for scenarios?

Use-cases and more generally scenarios, while more restrictive than logical specifications, remain important as complements to specifications. They serve as both input and output to more abstract requirements specifications (such as OO specifications with contracts):

As input to requirements: initially at least, stakeholders and Subject-Matter Experts often find it intuitive to describe typical system interactions, and their own activities, in the form of scenarios. Collecting such scenarios is an invaluable requirements elicitation technique. The requirements engineer must remember that any such scenario is just one example walk through the system, and must abstract from these examples to derive general logical rules.
As output from requirements: from an OO specification with its contracts, the requirements engineers can produce valid use cases. “Valid” means that the operation at every step satisfies the applicable precondition, as a consequence of the previous steps’ postconditions and of the class invariant. The requirements engineers can then submit these use cases to the SMEs and through them to stakeholders to confirm that they make sense, update the logical conditions if they do not (to rule out bad use cases), and check the results they are expected to produce.

6. Where do scenarios fit?

While many teams will prefer to write scenarios (for the purposes just described) in natural language, it is possible to go one step further and, in an object-oriented approach to requirements, gather scenarios in classes. But that point exceeds the scope of the present sketch. We will limit ourselves here to the core observation: logical constraints subsume sequential specifications; you can deduce the ltter from the former, but not the other way around; and focusing on abstract logical specifications leads to a better understanding of the requirements.

Reference

Bertrand Meyer: Handbook of Requirements and Business Analysis, Springer, 2022. See the book page with sample chapters and further material here.

(This article was first published on the Communications of the ACM blog.)

Rating: 10.0/10 (3 votes cast)

Rating: +1 (from 1 vote)

Category: Design by Contract, Eiffel, Essay, Publication announcement, Requirements, Software engineering, Software process, Software design | Comment

Mr. and Mrs. Bei Uns

19 December 2021, 22:44

It is customary for an article to carry some kind of lesson or moral. This one does not. Or to be more exact it does have a lesson, several perhaps, but they are left to the reader to draw.

It is also customary, for an article that is written as a tribute to deceased people, that the writer would have known them. I never knew the protagonists of my chronicle. But my sister and I — along with a dear cousin, and I hope her children and grandchildren — are among the few people who still know they existed. Hence the need for a tribute lest the last traces of their stay on earth vanish forever.

They were German: Louis Bernheimer, born on 5 December 1875 in Issenhausen in Alsace, then part of Germany, and his wife Paola, born in Bayreuth on 12 February 1879, yes, that same Bayreuth where Wagner had premièred his quintessential German opera, Das Rheingold, three years before. They were German and seemingly, as we shall see, very German.

I know little about them, nothing else in fact than reported in this little note. One thing I do know is the nickname by which people in Paris called them behind their backs: “Mr. and Mrs. Bei Uns”. I know it because my father mentioned it to me. Just once, a long time ago, but I remember.

We need a bit of context. Herr und Frau Bernheimer flew Nazi Germany in the thirties with their son Fred, a young professional photographer, and settled to a safe place, or a place they thought was safe: Paris. There Fred met my father’s sister Éliane and married her; they had two children, my cousins. Now we are talking about the only people in this story whom I did know. Éliane was a strong personality, a dedicated feminist and activist. When her husband was hit with cancer and she abruptly found herself a widow with two young children and no resources, she took over his photography studio, learned the trade — about which she had known nothing — and made the business prosper. After the war, Studio Bernheim (the name shortened so that it would sound less German) became one of the fashionable addresses in Paris, thanks to both Éliane and her son Marc who trained himself to become its chief photographer while still a teenager.

Bei uns in German means the same as “chez nous” in French and translates as “at our home”, although that is not a good translation because English lacks a preposition that would accurately reflect the French “chez” or (in that sense) the German “bei”, which mean something like “in the home of”. “Home” in a very intense and cozy sense, not just the physical house, but encompassing culture, country, community. Russians similarly say “U Nas”, literally “at ours” and, when talking about people, “Nashi”, literally “ours”, our people that is. In a Chekhov novella entitled A boring story (full text available online here), when the antisemitic narrator wants to mention that at the theater last night the person seated in front of him was a Jew, he says that he was sitting behind an “iz nasikh”, a deformation with a fake Jewish accent of “iz nashikh”, one from our own, to suggest — ironically in light of the rest of my own boring story — a member of a tightly knit community.

It seems that the Bernheimers (to come back to them) were seen by their new extended family as stuffy Germans fitting the stereotypes. Not just stuffy: critical. Apparently, they went around commenting that whatever was being done in their new French surroundings was not being done right, and explaining the way it was done back home, “bei uns”. If so, it was perhaps not the best way to ingratiate themselves with their hosts, and it is not surprising that people in the family started referring to them acerbically, according to my father, as Mr. and Mrs. Bei Uns.

As noted, I never knew the Bernheimers, although in a different turn of the story I would — I should — have known them as a child. Therefore I cannot guess whether I would have yielded to family opinions and found them insufferable, or liked them as delightful, exotic older relatives having gone through hard times and now doting on their children, grandchildren, nephews and nieces. Maybe both. I feel a certain remote sympathy for them in any case, having probably been resented, like anyone who has lived in countries where people insist on the “korrekt” way of doing things and comes back to more lackadaisical cultures, as a bit of a Mr. Bei Uns myself.

The irony is that in the eyes of many people, including many who would never consider themselves antisemites, Jews still have the reputation of harboring a feeling of solidarity with their own kin that transcends borders and trumps national allegiance. Here we have the reverse. Highly assimilated families on both sides, French Jews and German Jews, getting into a cultural conflict because some were French and some were German. Ever since the revolution emancipated French Jews, they have been passionately French. German Jews were just as passionately German (in the style of Heinrich Heine’s I think of Germany in the night, the poem entitled Nachtgedanken, written in exile in Paris, see its text here). French Jews do not ask themselves how French their are, since their Frenchness is as obvious to them as the air they breathe; it’s others who want them to prove it again and again — something that no one ever seems to require of people from certain regions of France such as Brittany whose inhabitants have a loudly proclaimed attachment to their terroir of origin. Unbelievably, the question still resurfaces regularly; it is even a theme in the current presidential campaign.

Why did I never get to decide by myself who Mr. and Mrs. Bei Uns really were: chauvinistic scolds, or a charming old-world couple? If they thought of themselves as German, as part of “uns”, the “uns” ruling Germany had a different understanding. When Germany invaded France in 1940, the Bernheimers flew, like many others, to the South of France, which until 1942 remained a supposedly “free” zone. Then the Germans invaded the “free” zone too. In August of 1943 Mr. and Mrs Bei Uns were rounded up near Bayonne. The town is close to the Spanish border; I do not know if they had hoped to cross over, as others managed to do. They were interned in the Mérignac camp, where Bordeaux airport lies today. From Bordeaux they were transferred to the infamous camp at Drancy, near Paris. From there they were put on convoy number 26 to Auschwitz, where they were murdered.

Rating: 6.2/10 (17 votes cast)

Rating: +6 (from 6 votes)

Category: Essay, History, Personal, Politics | Comment

Getting a program right, in nine episodes

26 March 2020, 18:03

About this article: it originated as a series of posts on the Communications of the ACM blog. I normally repost such articles here. (Even though copy-paste is usually not good, there are three reasons for this duplication: the readership seems to be largely disjoint; I can use better formatting, since their blog software is more restrictive than WordPress; and it is good to have a single repository for all my articles, including both those who originated on CACM and those who did not.) The series took the form of nine articles, where each of the first few ended with a quiz, to which the next one, published a couple of days later, provided an answer. Since all these answers are now available it would make no sense to use the same scheme, so I am instead publishing the whole thing as a single article with nine sections, slightly adapted from the original.

I was too lazy so far to collect all the references into a single list, so numbers such as [1] refer to the list at the end of the corresponding section.

A colleague recently asked me to present a short overview of axiomatic semantics as a guest lecture in one of his courses. I have been teaching courses on software verification for a long time (see e.g. here), so I have plenty of material; but instead of just reusing it, I decided to spend a bit of time on explaining why it is good to have a systematic approach to software verification. Here is the resulting tutorial.

1. Introduction and attempt #1

Say “software verification” to software professionals, or computer science students outside of a few elite departments, and most of them will think “testing”. In a job interview, for example, show a loop-based algorithm to a programmer and ask “how would you verify it?”: most will start talking about devising clever test cases.

Far from me to berate testing [1]; in fact, I have always thought that the inevitable Dijkstra quote about testing — that it can only show the presence of errors, not their absence [2] — which everyone seems to take as an indictment and dismissal of testing (and which its author probably intended that way) is actually a fantastic advertisement for testing: a way to find bugs? Yes! Great! Where do I get it? But that is not the same as verifying the software, which means attempting to ascertain that it has no bugs.

Until listeners realize that verification cannot just mean testing, the best course material on axiomatic semantics or other proof techniques will not attract any interest. In fact, there is somewhere a video of a talk by the great testing and public-speaking guru James Whittaker where he starts by telling his audience not to worry, this won’t be a standard boring lecture, he will not start talking about loop invariants [3]! (Loop invariants are coming in this article, in fact they are one of its central concepts, but in later sections only, so don’t bring the sleeping bags yet.) I decided to start my lecture by giving an example of what happens when you do not use proper verification. More than one example, in fact, as you will see.

A warning about this article: there is nothing new here. I am using an example from my 1990 book Introduction to the Theory of Programming Languages (exercise 9.12). Going even further back, a 1983 “Programming Pearls” Communications of the ACM article by Jon Bentley [4] addresses the same example with the same basic ideas. Yet almost forty years later these ideas are still not widely known among practitioners. So consider these articles as yet another tutorial on fundamental software engineering stuff.

The tutorial is a quiz. We start with a program text:

from

i := 1 ; j := n — Result initialized to 0.

until i = j loop

m := (i + j) // 2 — Integer division

if t [m] ≤ x then i := m else j := m end

end

if x = t [i] then Result := i end

All variables are of integer type. t is an up-sorted array of integers, indexed from 1 to n . We do not let any notation get between friends. A loop from p until e loop q end executes p then, repeatedly: stops if e (the exit condition) is true, otherwise executes q. (Like {p ; while not e do {q}} in some other notations.) “:=” is assignment, “=” equality testing. “//” is integer division, e.g. 6 //3 = 7 //3 = 2. Result is the name of a special variable whose final value will be returned by this computation (as part of a function, but we only look at the body). Result is automatically initialized to zero like all integer variables, so if execution does not assign anything to Result the function will return zero.

First question: what is this program trying to do?

OK, this is not the real quiz. I assume you know the answer: it is an attempt at “binary search”, which finds an element in the array, or determines its absence, in a sequence of about log₂ (n) steps, rather than n if we were use sequential search. (Remember we assume the array is sorted.) Result should give us a position where x appears in the array, if it does, and otherwise be zero.

Now for the real quiz: does this program meet this goal?

The answer should be either yes or no. (If no, I am not asking for a correct version, at least not yet, and in any case you can find some in the literature.) The situation is very non-symmetric, we might say Popperian:

To justify a no answer it suffices of a single example, a particular array t and a particular value x, for which the program fails to set Result as it should.
To justify a yes answer we need to provide a credible argument that for every t and x the program sets Result as it should.

Notes to section 1

[1] The TAP conference series (Tests And Proofs), which Yuri Gurevich and I started, explores the complementarity between the two approaches.

[2] Dijkstra first published his observation in 1969. He did not need consider the case of infinite input sets: even for a trivial finite program that multiplies two 32-bit integers, the number of cases to be examined, 2⁶⁴, is beyond human reach. More so today with 64-bit integers. Looking at this from a 2020 perspective, we may note that exhaustive testing of a finite set of cases, which Dijkstra dismissed as impossible in practice, is in fact exactly what the respected model checking verification technique does; not on the original program, but on a simplified — abstracted — version precisely designed to keep the number of cases tractable. Dijkstra’s argument remains valid, of course, for the original program if non-trivial. And model-checking does not get us out of the woods: while we are safe if its “testing” finds no bug, if it does find one we have to ensure that the bug is a property of the original program rather than an artifact of the abstraction process.

[3] It is somewhere on YouTube, although I cannot find it right now.

[4] Jon Bentley: Programming Pearls: Writing Correct Programs, in Communications of the ACM, vol. 26, no. 12, pp. 1040-1045, December 1983, available for example here.

2. Attempt #2

Was program #1 correct? If so it should yield the correct answer. (An answer is correct if either Result is the index in t of an element equal to x, or Result = 0 and x does not appear in t.)

This program is not correct. To prove that it is not correct it suffices of a single example (test case) for which the program does not “yield the correct answer”. Assume x = 1 and the array t has two elements both equal to zero (n = 2, remember that arrays are indexed from 1):

t = [0 0]

The successive values of the variables and expressions are:

m i j i + j + 1

After initialization: 1 2 3

i ≠ j, so enter loop: 1 1 2 6 — First branch of “if” since t [1] ≤ x
— so i gets assigned the value of m

But then neither of the values of i and j has changed, so the loop will repeat its body identically (taking the first branch) forever. It is not even that the program yields an incorrect answer: it does not yield an answer at all!

Note (in reference to the famous Dijkstra quote mentioned in the first article), that while it is common to pit tests against proofs, a test can actually be a proof: a test that fails is a proof that the program is incorrect. As valid as the most complex mathematical proof. It may not be the kind of proof we like most (our customers tend to prefer a guarantee that the program is correct), but it is a proof all right.

We are now ready for the second attempt:

— Program attempt #2.

from

i := 1 ; j := n

until i = j or Result > 0 loop

m := (i + j) // 2 — Integer division

if t [m] ≤ x then

i := m + 1

elseif t [m] = x then

Result := m

else — In this case t [m] > x

j := m – 1

end

Unlike the previous one this version always changes i or j, so we may hope it does not loop forever. It has a nice symmetry between i and j.

Same question as before: does this program meet its goal?

3. Attempt #3

The question about program #2, as about program #1: was: it right?

Again no. A trivial example disproves it: n = 1, the array t contains a single element t [1] = 0, x = 0. Then the initialization sets both i and j to 1, i = j holds on entry to the loop which stops immediately, but Result is zero whereas it should be 1 (the place where x appears).

Here now is attempt #3, let us see it if fares better:

— Program attempt #3.

from

i := 1 ; j := n

until i = j loop

m := (i + j + 1) // 2

if t [m] ≤ x then

i := m + 1

else

j := m

end

if 1 ≤ i and i ≤ n then Result := i end
— If not, Result remains 0.

What about this one?

3. Attempt #4 (also includes 3′)

The first two program attempts were wrong. What about the third?

I know, you have every right to be upset at me, but the answer is no once more.

Consider a two-element array t = [0 0] (so n = 2, remember that our arrays are indexed from 1 by convention) and a search value x = 1. The successive values of the variables and expressions are:

m i j i + j + 1

After initialization: 1 2 4

i ≠ j, so enter loop: 2 3 2 6 — First branch of “if” since t [2] < x

i ≠ j, enter loop again: 3 ⚠ — Out-of-bounds memory access!
— (trying to access non-existent t [3])

Oops!

Note that we could hope to get rid of the array overflow by initializing i to 0 rather than 1. This variant (version #3′) is left as a bonus question to the patient reader. (Hint: it is also not correct. Find a counter-example.)

OK, this has to end at some point. What about the following version (#4): is it right?

— Program attempt #4.

from

i := 0 ; j := n + 1

until i = j loop

m := (i + j) // 2

if t [m] ≤ x then

i := m + 1

else

j := m

end

if 1 ≤ i and i ≤ n then Result := i end

5. Attempt #5

Yes, I know, this is dragging on. But that’s part of the idea: witnessing how hard it is to get a program right if you just judging by the seat of your pants. Maybe we can get it right this time?

Are we there yet? Is program attempt #4 finally correct?

Sorry to disappoint, but no. Consider a two-element array t = [0 0], so n = 2, and a search value x = 1 (yes, same counter-example as last time, although here we could also use x = 0). The successive values of the variables and expressions are:

m i j i + j

After initialization: 0 3 3

i ≠ j, so enter loop: 1 2 3 5 — First branch of “if”

i ≠ j, enter loop again: 2 3 3 6 — First branch again

i = j, exit loop

The condition of the final “if” is true, so Result gets the value 3. This is quite wrong, since there is no element at position 3, and in any case x does not appear in t.

But we are so close! Something like this should work, should it not?

So patience, patience, let us tweak it just one trifle more, OK?

— Program attempt #5.

from

i := 1 ; j := n + 1

until i ≥ j or Result > 0 loop

m := (i + j) // 2

if t [m] < x then

i := m + 1

elseif t [m] > x then

j := m

else

Result := m

end

Does it work now?

6. Attempt #6

The question about program #5 was the same as before: is it right, is it wrong?

Well, I know you are growing more upset at me with each section, but the answer is still that this program is wrong. But the way it is wrong is somewhat specific; and it applies, in fact, to all previous variants as well.

This particular wrongness (fancy word for “bug”) has a history. As I pointed out in the first article, there is a long tradition of using binary search to illustrate software correctness issues. A number of versions were published and proved correct, including one in the justly admired Programming Pearls series by Jon Bentley. Then in 2006 Joshua Bloch, then at Google, published a now legendary blog article [2] which showed that all these versions suffered from a major flaw: to obtain m, the approximate mid-point between i and j, they compute

(i + j) // 2

which, working on computer integers rather than mathematical integers, might overflow! This in a situation in which both i and j, and hence m as well, are well within the range of the computer’s representable integers, 2^-n to 2ⁿ (give or take 1) where n is typically 31 or, these days, 63, so that there is no conceptual justification for the overflow.

In the specification that I have used for this article, i starts at 1, so the problem will only arise for an array that occupies half of the memory or more, which is a rather extreme case (but still should be handled properly). In the general case, it is often useful to use arrays with arbitrary bounds (as in Eiffel), so we can have even a small array, with high indices, for which the computation will produce an overflow and bad results.

The Bloch gotcha is a stark reminder that in considering the correctness of programs we must include all relevant aspects and consider programs as they are executed on a real computer, not as we wish they were executed in an ideal model world.

(Note that Jon Bentley alluded to this requirement in his original article: while he did not explicitly mention integer overflow, he felt it necessary to complement his proof by the comment that that “As laborious as our proof of binary search was, it is still unfinished by some standards. How would you prove that the program is free of runtime errors (such as division by zero, word overflow, or array indices out of bounds)?” Prescient words!)

It is easy to correct the potential arithmetic overflow bug: instead of (i + j) // 2, Bloch suggested we compute the average as

i + (j – i) // 2

which is the same from a mathematician’s viewpoint, and indeed will compute the same value if both variants compute one, but will not overflow if both i and j are within range.

So we are ready for version 6, which is the same as version 5 save for that single change:

— Program attempt #6.

from

i := 1 ; j := n + 1

until i ≥ j or Result > 0 loop

m := i + (j – i) // 2

if t [m] < x then

i := m + 1

elseif t [m] > x then

j := m

else

Result := m

end

Now is probably the right time to recall the words by which Donald Knuth introduces binary search in the original 1973 tome on Sorting and Searching of his seminal book series The Art of Computer Programming: knuth

Although the basic idea of binary search is comparatively straightforward, the details can be somewhat tricky, and many good programmers have done it wrong the first few times they tried.

Do you need more convincing? Be careful what you answer, I have more variants up my sleeve and can come up with many more almost-right-but-actually-wrong program attempts if you nudge me. But OK, even the best things have an end. This is not the last section yet, but that was the last program attempt. To the naturally following next question in this running quiz, “is version 6 right or wrong”, I can provide the answer: it is, to the best of my knowledge, a correct program. Yes! [3].

But the quiz continues. Since answers to the previous questions were all that the programs were not correct, it sufficed in each case to find one case for which the program did not behave as expected. Our next question is of a different nature: can you find an argument why version #6 is correct?

References for section 6

[1] (In particular) Jon Bentley: Programming Pearls — Writing Correct Programs, in Communications of the ACM, vol. 26, no. 12, December 1983, pages 1040-1045, available here.

[2] Joshua Bloch: Extra, Extra — Read All About It: Nearly All Binary Searches and Mergesorts are Broken, blog post, on the Google AI Blog, 2 June 2006, available here.

[3] A caveat: the program is correct barring any typos or copy-paste errors — I am starting from rigorously verified programs (see the next posts), but the blogging system’s UI and text processing facilities are not the best possible for entering precise technical text such as code. However carefully I check, I cannot rule out a clerical mistake, which of course would be corrected as soon as it is identified.

7. Using a program prover

Preceding sections presented candidate binary search algorithms and asked whether they are correct. “Correct” means something quite precise: that for an array t and a value x, the final value of the variable Result is a valid index of t (that is to say, is between 1 and n, the size of t) if and only if x appears at that index in t.

The last section boldly stated that program attempt #6 was correct. The question was: why?

In the case of the preceding versions, which were incorrect, you could prove that property, and I do mean prove, simply by exhibiting a single counter-example: a single t and x for which the program does not correctly set Result. Now that I asserting the program to be correct, one example, or a million examples, do not suffice. In fact they are almost irrelevant. Test as much as you like and get correct results every time, you cannot get rid of the gnawing fear that if you had just tested one more time after the millionth test you would have produced a failure. Since the set of possible tests is infinite there is no solution in sight [1].

We need a proof.

I am going to explain that proof in the next section, but before that I would like to give you an opportunity to look at the proof by yourself. I wrote in one of the earlier articles that most of what I have to say was already present in Jon Bentley’s 1983 Programming Pearls contribution [2], but a dramatic change did occur in the four decades since: the appearance of automated proof system that can handle significant, realistic programs. One such system, AutoProof, was developed at the Chair of Software engineering at ETH Zurich [3] (key project members were Carlo Furia, Martin Nordio, Nadia Polikarpova and Julian Tschannen, with initial contributions by Bernd Schoeller) on the basis of the Boogie proof technology from Microsoft Research).

AutoProof is available for online use, and it turns out that one of the basic tutorial examples is binary search. You can go to the corresponding page and run the proof.

I am going to let you try this out (and, if you are curious, other online AutoProof examples as well) without too many explanations; those will come in the next section. Let me simply name the basic proof technique: loop invariant. A loop invariant is a property INV associated with a loop, such that:

A. After the loop’s initialization, INV will hold.
B. One execution of the loop’s body, if started with INV satisfied (and the loop’s exit condition not satisfied, otherwise we wouldn’t be executing the body!), satisfies INV again when it terminates.

This idea is of course the same as that of a proof by induction in mathematics: the initialization corresponds to the base step (proving that P (0) holds) and the body property to the induction step (proving that from P (n) follows P (n + 1). With a traditional induction proof we deduce that the property (P (n)) holds for all integers. For the loop, we deduce that when the loop finishes its execution:

The invariant still holds, since executing the loop means executing the initialization once then the loop body zero or more times.
And of course the exit condition also holds, since otherwise we would still be looping.

That is how we prove the correctness of a loop: the conjunction of the invariant and the exit condition must yield the property that we seek (in the example, the property, stated above of Result relative to t and x).

We also need to prove that the loop does terminate. This part involves another concept, the loop’s variant, which I will explain in the next section.

For the moment I will not say anything more and let you look at the AutoProof example page (again, you will find it here), run the verification, and read the invariant and other formal elements in the code.

To “run the verification” just click the Verify button on the page. Let me emphasize (and emphasize again and again and again) that clicking Verify will not run the code. There is no execution engine in AutoProof, and the verification does not use any test cases. It processes the text of the program as it appears on the page and below. It applies mathematical techniques to perform the proof; the core property to be proved is that the proposed loop invariant is indeed invariant (i.e. satisfies properties A and B above).

The program being proved on the AutoProof example page is version #6 from the last section, with different variable names. So far for brevity I have used short names such as i, j and m but the program on the AutoProof site applies good naming practices with variables called low, up, middle and the like. So here is that version again with the new variable names:

— Program attempt #7 (identical to #6 with different variable names) .

from

low := 0 ; up := n

until low ≥ up or Result > 0 loop

middle := low + ((up – low) // 2)

if a [middle] < value then — The array is now called a rather than t

low := middle + 1

elseif a [middle] > value then

up := middle

else

Result := middle

end

This is exactly the algorithm text on the AutoProof page, the one that you are invited to let AutoProof verify for you. I wrote “algorithm text” rather than “program text” because the actual program text (in Eiffel) includes variant and invariant clauses which do not affect the program’s execution but make the proof possible.

Whether or not these concepts (invariant, variant, program proof) are completely new to you, do try the prover and take a look at the proof-supporting clauses. In the next article I will remove any remaining mystery.

Note and references for section 7

[1] Technically the set of possible [array, value] pairs is finite, but of a size defying human abilities. As I pointed out in the first section, the “model checking” and “abstract interpretation” verification techniques actually attempt to perform an exhaustive test anyway, after drastically reducing the size of the search space. That will be for some other article.

[2] Jon Bentley: Programming Pearls: Writing Correct Programs, in Communications of the ACM, vol. 26, no. 12, pp. 1040-1045, December 1983, available for example here.

[3] The AutoProof page contains documentations and numerous article references.

8. Understanding the proof

The previous section invited you to run the verification on the AutoProof tutorial page dedicated to the example. AutoProof is an automated proof system for programs. This is just a matter of clicking “Verify”, but more importantly, you should read the annotations added to the program text, particularly the loop invariant, which make the verification possible. (To avoid any confusion let me emphasize once more that clicking “Verify” does not run the program, and that no test cases are used; the effect is to run the verifier, which attempts to prove the correctness of the program by working solely on the program text.)

Here is the program text again, reverting for brevity to the shorter identifiers (the version on the AutoProof page has more expressive ones):

from

i := 1 ; j := n + 1

until i ≥ j or Result > 0 loop

m := i + (j – i) // 2

if t [m] < x then

i := m + 1

elseif t [m] > x then

j := m

else

Result := m

end

Let us now see what makes the proof possible. The key property is the loop invariant, which reads

A: 1 ≤ i ≤ j ≤ n + 1
B:   0 ≤ Result ≤ n
C:   ∀ k: 1 .. i –1 | t [k] < x
D: ∀ k: j .. n | t [k] > x
E:    (Result > 0) ⇒   (t [Result] = x)

The notation is slightly different on the Web page to adapt to the Eiffel language as it existed at the time it was produced; in today’s Eiffel you can write the invariant almost as shown above. Long live Unicode, allowing us to use symbols such as ∀ (obtained not by typing them but by using smart completion, e.g. you start typing “forall” and you can select the ∀ symbol that pops up), ⇒ for “implies” and many others

Remember that the invariant has to be established by the loop’s initialization and preserved by every iteration. The role of each of its clauses is as follows:

A: keep the indices in range.
B: keep the variable Result, whose final value will be returned by the function, in range.
C and D: eliminate index intervals in which we have determined that the sought value, x, does not appear. Before i, array values are smaller; starting at j, they are greater. So these two intervals, 1..i and j..n, cannot contain the sought value. The overall idea of the algorithm (and most other search algorithms) is to extend one of these two intervals, so as to narrow down the remaining part of 1..n where x may appear.
E: express that as soon as we find a positive (non-zero) Result, its value is an index in the array (see B) where x does appear.

Why is this invariant useful? The answer is that on exit it gives us what we want from the algorithm. The exit condition, recalled above, is

i ≥ j or Result > 0

Combined with the invariant, it tells us that on exit one of the following will hold:

Result > 0, but then because of E we know that x appears at position Result.
i < j, but then A, C and D imply that x does not appear anywhere in t. In that case it cannot be true that Result > 0, but then because of B Result must be zero.

What AutoProof proves, mechanically, is that under the function’s precondition (that the array is sorted):

The initialization ensures the invariant.
The loop body, assuming that the invariant is satisfied but the exit condition is not, ensures the loop invariant again after it executes.
The combination of the invariant and the exit condition ensures, as just explained, the postcondition of the function (the property that Result will either be positive and the index of an element equal to x, or zero with the guarantee that x appears nowhere in t).

Such a proof guarantees the correctness of the program if it terminates. We (and AutoProof) must prove separately that it does terminate. The technique is simple: find a “loop variant”, an integer quantity v which remains non-negative throughout the loop (in other words, the loop invariant includes or implies v ≥ 0) and decreases on each iteration, so that the loop cannot continue executing forever. An obvious variant here is j – i + 1 (where the + 1 is needed because j – i may go down to -1 on the last iteration if x does not appear in the array). It reflects the informal idea of the algorithm: repeatedly decrease an interval i .. j – 1 (initially, 1 .. n) guaranteed to be such that x appears in t if and only if it appears at an index in that interval. At the end, either we already found x or the interval is empty, implying that x does not appear at all.

A great reference on variants and the techniques for proving program termination is a Communications of the ACM article of 2011: [3].

The variant gives an upper bound on the number of iterations that remain at any time. In sequential search, j – i + 1 would be our best bet; but for binary search it is easy to show that log₂(j – i + 1) is also a variant, extending the proof of correctness with a proof of performance (the key goal of binary search being to ensure a logarithmic rather than linear execution time).

This example is, I hope, enough to highlight the crucial role of loop invariants and loop variants in reasoning about loops. How did we get the invariant? It looks like I pulled it out of a hat. But in fact if we go the other way round (as advocated in classic books [1] [2]) and develop the invariant and the loop together the process unfolds itself naturally and there is nothing mysterious about the invariant.

Here I cannot resist quoting (thirty years on!) from my own book Introduction to the Theory of Programming Languages [4]. It has a chapter on axiomatic semantics (also known as Hoare logic, the basis for the ideas used in this discussion), which I just made available: see here [5]. Its exercise 9.12 is the starting point for this series of articles. Here is how the book explains how to design the program and the invariant [6]:

In the general case [of search, binary or not] we aim for a loop body of the form

m := ‘‘Some value in 1.. n such that i ≤ m < j’’;

if t [m] ≤ x then

i := m + 1

else

j := m

end

It is essential to get all the details right (and easy to get some wrong):

The instruction must always decrease the variant j – i, by increasing i or decreasing j. If the the definition of m specified just m ≤ j rather than m < j, the second branch would not meet this goal.

This does not transpose directly to i: requiring i < m < j would lead to an impossibility when j – i is equal to 1. So we accept i ≤ m but then we must take m + 1, not m, as the new value of i in the first branch.

The conditional’s guards are tests on t [m], so m must always be in the interval 1 . . n. This follows from the clause 0 ≤ i ≤ j ≤ n + 1 which is part of the invariant.

If this clause is satisfied, then m ≤ n and m > 0, so the conditional instruction indeed leaves this clause invariant.

You are invited to check that both branches of the conditional also preserve the rest of the invariant.

Any policy for choosing m is acceptable if it conforms to the above scheme. Two simple choices are i and j – 1; they lead to variants of the sequential search algorithm [which the book discussed just before binary search].

For binary search, m will be roughly equal to the average of i and j.

“Roughly” because we need an integer, hence the // (integer division).

In the last section, I will reflect further on the lessons we can draw from this example, and the practical significance of the key concept of invariant.

References and notes for section 8

[1] E.W. Dijkstra: A Discipline of Programming, Prentice Hall, 1976.

[2] David Gries: The Science of Programming, Springer, 1989.

[3] Byron Cook, Andreas Podelski and Andrey Rybalchenko: Proving program termination, in Communications of the ACM, vol. 54, no. 11, May 2011, pages 88-98, available here.

[4] Bertrand Meyer, Introduction to the Theory of Programming Languages, Prentice Hall, 1990. The book is out of print but can be found used, e.g. on Amazon. See the next entry for an electronic version of two chapters.

[5] Bertrand Meyer Axiomatic semantics, chapter 9 from [3], available here. Note that the PDF was reconstructed from an old text-processing system (troff); the figures could not be recreated and are missing. (One of these days I might have the patience of scanning them from a book copy and adding them. Unless someone wants to help.) I also put online, with the same caveat, chapter 2 on notations and mathematical basis: see here.

[6] Page 383 of [4] and [5]. The text is verbatim except a slight adaptation of the programming notation and a replacement of the variables: i in the book corresponds to i – 1 here, and j to j – 1. As a matter of fact I prefer the original conventions from the book (purely as a matter of taste, since the two are rigorously equivalent), but I changed here to the conventions of the program as it appears in the AutoProof page, with the obvious advantage that you can verify it mechanically. The text extract is otherwise exactly as in the 1990 book.

9. Lessons learned

What was this journey about?

We started with a succession of attempts that might have “felt right” but were in fact all wrong, each in its own way: giving the wrong answer in some cases, crashing (by trying to access an array outside of its index interval) in some cases, looping forever in some cases. Always “in some cases”, evidencing the limits of testing, which can never guarantee that it exercises all the problem cases. A correct program is one that works in all cases. The final version was correct; you were able to prove its correctness with an online tool and then to understand (I hope) what lies behind that proof.

To show how to prove such correctness properties, I have referred throughout the series to publications from the 1990s (my own Introduction to The Theory of Programming Languages), the 1980s (Jon Bentley’s Programming Pearls columns, Gries’s Science of Programming), and even the 1970s (Dijkstra’s Discipline of Programming). I noted that the essence of my argument appeared in a different form in one of Bentley’s Communications articles. What is the same and what has changed?

The core concepts have been known for a long time and remain applicable: assertion, invariant, variant and a few others, although they are much better understood today thanks to decades of theoretical work to solidify the foundation. Termination also has a more satisfactory theory.

On the practical side, however, the progress has been momentous. Considerable engineering has gone into making sure that the techniques scaled up. At the time of Bentley’s article, binary search was typical of the kind of programs that could be proved correct, and the proof had to proceed manually. Today, we can tackle much bigger programs, and use tools to perform the verification.

Choosing binary search again as an example today has the obvious advantage that everyone can understand all the details, but should not be construed as representative of the state of the art. Today’s proof systems are far more sophisticated. Entire operating systems, for example, have been mechanically (that is to say, through a software tool) proved correct. In the AutoProof case, a major achievement was the proof of correctness [1] of an entire data structure (collections) library, EiffelBase 2. In that case, the challenge was not so much size (about 8,000 source lines of code), but the complexity of both:

The scope of the verification, involving the full range of mechanisms of a modern object-oriented programming language, with classes, inheritance (single and multiple), polymorphism, dynamic binding, generics, exception handling etc.
The code itself, using sophisticated data structures and algorithms, involving in particular advanced pointer manipulations.

In both cases, progress has required advances on both the science and engineering sides. For example, the early work on program verification assumed a bare-bones programming language, with assignments, conditionals, loops, routines, and not much more. But real programs use many other constructs, growing ever richer as programming languages develop. To cover exception handling in AutoProof required both theoretical modeling of this construct (which appeared in [2]) and implementation work.

More generally, scaling up verification capabilities from the small examples of 30 years ago to the sophisticated software that can be verified today required the considerable effort of an entire community. AutoProof, for example, sits at the top of a tool stack relying on the Boogie environment from Microsoft Research, itself relying on the Z3 theorem prover. Many person-decades of work make the result possible.

tool_stack

Beyond the tools, the concepts are esssential. One of them, loop invariants, has been illustrated in the final version of our program. I noted in the first article the example of a well-known expert and speaker on testing who found no better way to announce that a video would not be boring than “relax, we are not going to talk about loop invariants.” Funny perhaps, but unfair. Loop invariants are one of the most beautiful concepts of computer science. Not so surprisingly, because loop invariants are the application to programming of the concept of mathematical induction. According to the great mathematician Henri Poincaré, all of mathematics rests on induction; maybe he exaggerated, maybe not, but who would think of teaching mathematics without explaining induction? Teaching programming without explaining loop invariants is no better.

Below is an illustration (if you will accept my psychedelic diagram) of what a loop is about, as a problem-solving technique. Sometimes we can get the solution directly. Sometimes we identify several steps to the solution; then we use a sequence (A ; B; C). Sometimes we can find two (or more) different ways of solving the problem in different cases; then we use a conditional (if c then A else B end). And sometimes we can only get a solution by getting closer repeatedly, not necessarily knowing in advance how many times we will have to advance towards it; then, we use a loop.

loop_strategy

We identify an often large (i.e. very general) area where we know the solution will lie; we call that area the loop invariant. The solution or solutions (there may be more than one) will have to satisfy a certain condition; we call it the exit condition. From wherever we are, we shoot into the invariant region, using an appropriate operation; we call it the initialization. Then we execute as many times as needed (maybe zero if our first shot was lucky) an operation that gets us closer to that goal; we call it the loop body. To guarantee termination, we must have some kind of upper bound of the distance to the goal, decreasing each time discretely; we call it the loop variant.

This explanation is only an illustration, but I hope it makes the ideas intuitive. The key to a loop is its invariant. As the figure suggests, the invariant is always a generalization of the goal. For example, in binary search (and many other search algorithms, such as sequential search), our goal is to find a position where either x appears or, if it does not, we can be sure that it appears nowhere. The invariant says that we have an interval with the same properties (either x appears at a position belonging to that interval or, if it does not, it appears nowhere). It obviously includes the goal as a special case: if the interval has length 1, it defines a single position.

An invariant should be:

Strong enough that we can devise an exit condition which in the end, combined with the invariant, gives us the goal we seek (a solution).
Weak enough that we can devise an initialization that ensures it (by shooting into the yellow area) easily.
Tuned so that we can devise a loop body that, from a state satifying the invariant, gets us to a new one that is closer to the goal.

In the example:

The exit condition is simply that the interval’s length is 1. (Technically, that we have computed Result as the single interval element.) Then from the invariant and the exit condition, we get the goal we want.
Initialization is easy, since we can just take the initial interval to be the whole index range of the array, which trivially satisfies the invariant.
The loop body simply decreases the length of the interval (which can serve as loop variant to ensure termination). How we decrease the length depends on the search strategy; in sequential search, each iteration decreases the length by 1, correct although not fast, and binary search decreases it by about half.

The general scheme always applies. Every loop algorithm is characterized by an invariant. The invariant may be called the DNA of the algorithm.

To demonstrate the relevance of this principle, my colleagues Furia, Velder, and I published a survey paper [6] in ACM Computing Surveys describing the invariants of important algorithms in many areas of computer science, from search algorithms to sorting (all major algorithms), arithmetic (long integer addition, squaring), optimization and dynamic programming (Knapsack, Levenshtein/Edit distance), computational geometry (rotating calipers), Web (Page Rank)… I find it pleasurable and rewarding to go deeper into the basis of loop algorithms and understand their invariants; like a geologist who does not stop at admiring the mountain, but gets to understand how it came to be.

Such techniques are inevitable if we want to get our programs right, the topic of this article. Even putting aside the Bloch average-computation overflow issue, I started with 5 program attempts, all kind of friendly-looking but wrong in different ways. I could have continued fiddling with the details, following my gut feeling to fix the flaws and running more and more tests. Such an approach can be reasonable in some cases (if you have an algorithm covering a well-known and small set of cases), but will not work for non-trivial algorithms.

Newcomers to the concept of loop invariant sometimes panic: “this is all fine, you gave me the invariants in your examples, how do I find my own invariants for my own loops?” I do not have a magic recipe (nor does anyone else), but there is no reason to be scared. Once you have understood the concept and examined enough examples (just a few of those in [6] should be enough), writing the invariant at the same time as you are devising a loop will come as a second nature to you.

As the fumbling attempts in the first few sections should show, there is not much of an alternative. Try this approach. If you are reaching these final lines after reading what preceded them, allow me to thank you for your patience, and to hope that this rather long chain of reflections on verification will have brought you some new insights into the fascinating challenge of writing correct programs.

References

[1] Nadia Polikarpova, Julian Tschannen, and Carlo A. Furia: A Fully Verified Container Library, in Proceedings of 20th International Symposium on Formal Methods (FM 15), 2015. (Best paper award.)

[2] Martin Nordio, Cristiano Calcagno, Peter Müller and Bertrand Meyer: A Sound and Complete Program Logic for Eiffel, in Proceedings of TOOLS 2009 (Technology of Object-Oriented Languages and Systems), Zurich, June-July 2009, eds. M. Oriol and B. Meyer, Springer LNBIP 33, June 2009.

[3] Boogie page at MSR, see here for publications and other information.

[4] Z3 was also originally from MSR and has been open-sourced, one can get access to publications and other information from its Wikipedia page.

[5] Carlo Furia, Bertrand Meyer and Sergey Velder: Loop invariants: Analysis, Classification and Examples, in ACM Computing Surveys, vol. 46, no. 3, February 2014. Available here.

[6] Dynamic programming is a form of recursion removal, turning a recursive algorithm into an iterative one by using techniques known as “memoization” and “bottom-up computation” (Berry). In this transformation, the invariant plays a key role. I will try to write this up some day as it is a truly elegant and illuminating explanation.

Rating: 10.0/10 (8 votes cast)

Rating: +4 (from 4 votes)

Tags: AutoProof, Donald Knuth, Eiffel, Jon Bentley, Joshua Bloch
Category: Algorithms, Computer science, Design by Contract, Education, Eiffel, Essay, Programming techniques, Software engineering, Software verification, Testing, Theory | Comment

In the scary land of irrational discourse

4 March 2020, 21:51

A chemistry researcher published a paper in Science with two junior collaborators and, a few months later, found flaws and retracted the article.

She commented “I am totally bummed to announce that we have retracted last year’s paper on enzymatic synthesis of beta-lactams” and “it is painful to admit, but important to do so” and “the work has not been reproducible” and “I apologize to all” and “I was a bit busy when this was submitted, and did not do my job well”.

Not very unusual news; this kind of thing happens all the time as part of the normal process of research and publication. (This just in! Scientists are human! They make mistakes once in a while! Full story at 11!)

Perhaps this one is slightly more worthy of notice because the lead author is a Nobel prize winner. Time for some rejoicing (Schadenfreude as it is called in good English) for anyone who is not a Nobel prize winner: haha, you think you are so smart but you mess up too. It never hurts to have an occasional reminder that we should not deify anyone. But hardly prime-time news.

Well, it is prime-time news for Fox News, which devotes a whole article to the matter. OK, I know, Fox News. And yes, it does pain me to include a hyperlink to a foxnews.com page in this otherwise perfectly decent, civilized, family-safe blog. But in fact that particular article is not by itself outrageous. Suspicious, yes: why such a sudden focus on a minor scientific episode in a news source not particularly famous (I hope you admire my gift for euphemism) for its extensive coverage of the frontlines of scientific research? But whatever its ultimate agenda the article itself is factual, not judgmental.

What is striking is the avalanche of reader comments on that page. If you go and take a look at them, be prepared; put on your parka. Reading these comments will be, for many of us, a peek into a completely different world. A world that we vaguely know exists, but do not actually visit.

It is not a nice world to venture into: full of bile, frustration, resentment, jealousy, conspiracy theories, slander, attacks on anyone trying to take a rational approach to issues, with hardly a pleasant or optimistic note ever. It is not a world one wants to visit often, but reading such a page is an eye-opener for anyone who accepts the premises of rational thinking and might believe that they are universally accepted.

“Striking”, I wrote. Scary is a more apposite word. With the kind of nonsense-spouting and science-bashing that appears in countless messages in the comments section of the page, one can fear the worst regarding questions that face our society, for which rational, science-based advice is critical. (Yes, coronavirus, I am looking at you!)

Very few of the comments on the page says the obvious: it is not good to make errors, but errors will occur, and the scientist should be commended for checking further and coming out with the admission that her study had flaws. As far as we know the initiative came from her, spontaneously. It is one of the signs of the healthiness of science that we always question results. We question those of other people (there are plenty of sites, such as pubpeer and forbetterscience, entirely devoted to tracking and debunking flawed research). We also question our own: partly to avoid the humiliation of having someone else report one of our mistakes before we do; but also because of the good scientist’s natural search for intellectual honesty.

Most of the article commenters do not mention this key lesson of the incident; the Nobel prize winner’s integrity. For them, the article retraction demonstrates that… the entire edifice of science is flawed! For example:

She’s a liberal… I thought her being wrong was understood.

Now we need to find an honest Climate Change researcher to admit that their computer models are faulty and much of their “data” is fake.

Integrity! Now if the “scientists” who have fabricated Global Warming/ Climate Change, whatever, “research” would come forward with admissions about their flawed, fallacious “research” we would be golden.

Now if we could get the climate change “scientists” to do the same maybe some credibility could be restored to the field.

and so on ad nauseam. (Not a figure of style — reading these comments is truly nauseating.) In reality the retraction demonstrates, or rather illustrates (one example is not a demonstration), the reverse of these assertions: that the scientific process includes its own correction mechanisms. To use a computer scientist’s terminology, it is not fault-free (no scientist ever claimed anything like that) but fault-tolerant.

Of course the reason the Fox News crowd is suddenly so interested in science is not (one imagines) science per se but the science of climate change. Comment after comment uses the article, as illustrated by the above examples, to dismiss the scientific consensus on the reports of the United Nations’ Intergovernmental Panel on Climate Change. In other words: the retraction of one three-author paper on beta-lactams proves that the the work of hundreds of scientists producing thousands of articles on climatology over several decades is flawed? The logic of such a deduction is… shaky.

The modern world is based, through technology, on science. To post on the Web their absurd rejections of scientifically established facts, the Fox News readers couldn’t do without relying on mobile phones, mobile networks, software systems, computers and other extraordinary achievements of human intelligence, the result of centuries of patient cumulative application of the same scientific principles and techniques that these posts ridicule. They are stuck in a pre-scientific mindset, dominated by the kind of magical thinking that the founders of modern thought had to overcome between the 16th and 18th century, as brilliantly analyzed by Gaston Bachelard’s Formation of the Scientific Mind.

Somehow they skipped what the rest of us learn in grade school (that two plus two equals four, cause precedes effect and so on). They are many, they vote, they think they are right and the rest of the world is wrong, hold these beliefs very strongly (Dunning-Kruger effect), and put the world at risk.

Rating: 10.0/10 (4 votes cast)

Rating: +1 (from 1 vote)

Category: Essay, General technology | Comment

A theorem of software engineering

1 November 2019, 09:48

Some of the folk wisdom going around in software engineering, often cluessly repeated for decades, is just wrong. It can be particularly damaging when it affects key aspects of software development and is contradicted by solid scientific evidence. The present discussion covers a question that meets both of these conditions: whether it makes sense to add staff to a project to shorten its delivery time.

My aim is to popularize a result that is well known in the software engineering literature, going back to the early work of Barry Boehm [1], and explained with great clarity by Steve McConnell in his 2006 book on software cost estimation [2] under the name “Shortest Possible Schedule”. While an empirical rather than a logical result, I believe it deserves to be called a theorem (McConnell stays shy of using the term) because it is as close as we have in the area of software engineering management to a universal property, confirmed by numerous experimental studies.

This article contributes no new concept since McConnell’s chapter 20 says all there is to say about the topic; my aim is simply to make the Shortest Possible Schedule Theorem better known, in particular to practitioners.

The myth about shortening project times begins with an observation that is clearly correct, at least in an extreme form. Everyone understands that if our project has been evaluated, through accepted cost estimation techniques, to require three developers over a year we cannot magically hire 36 people to complete it in one month. Productivity does not always scale up.

But neither does common sense. Too often the conclusion from the preceding trival observation takes the form of an old saw, “Brooks’ Law”: adding people to a late project delays it further. The explanation is that the newcomers cost more through communication overhead than they bring through actual contributions. While a few other sayings of Brooks’ Mythical Man-Month have stood the test of time, this one has always struck me as describing, rather than any actual law, a definition of bad management. Of course if you keep haplessly throwing people at deadlines you are just going to add communication problems and make things worse. But if you are a competent manager expanding the team size is one of the tools at your disposal to improve the state of a project, and it would be foolish to deprive yourself of it. A definitive refutation of the supposed law, also by McConnell, was published 20 years ago [3].

For all the criticism it deserves, Brooks’s pronouncement was at least limited in its scope: it addressed addition of staff to a project that is already late. It is even wronger to apply it to the more general issue of cost-estimating and staffing software projects, at any stage of their progress. Forty-year-old platitudes have even less weight here. As McConnell’s book shows, cost estimation is no longer a black art. It is not an exact science either, but techniques exist for producing solid estimates.

The Shortest Possible Schedule theorem is one of the most interesting results. Much more interesting than Brooks’s purported law, because it is backed by empirical studies (rather than asking us to believe one person’s pithy pronouncement), and instead of just a general negative view it provides a positive result complemented by a limitation of that result; and both are expressed quantitatively.

Figure 1 gives the general idea of the SPS theorem. General idea only; Figure 2 will provide a more precise view.

Figure 1: General view of the Shortest Possible Schedule theorem.

The “nominal project” is the result of a cost and schedule estimation yielding the optimum point. The figure and the theorem provide project managers with both a reason to rejoice and a reason to despair:

Rejoice: by putting in more money, i.e. more people (in software engineering, project costs are essentially people costs [4]), you can bring the code to fruition faster.
Despair: whatever you do, there is a firm limit to the time you can gain: 25%. It seems to be a kind of universal constant of software engineering.

The “despair” part typically gets the most attention at first, since it sets an absolute value on how much money can buy (so to speak) in software: try as hard as you like, you will never get below 75% of the nominal (optimal) value. The “impossible zone” in Figure 1 expresses the fundamental limitation. This negative result is the reasoned and precise modern replacement for the older folk “law”.

The positive part, however, is just as important. A 75%-empty glass is also 25%-full. It may be disappointing for a project manager to realize that no amount of extra manpower will make it possible to guarantee to higher management more than a 25% reduction in time. But it is just as important to know that such a reduction, not at all insignificant, is in fact reachable given the right funding, the right people, the right tools and the right management skills. The last point is critical: money by itself does not suffice, you need management; Brooks’ law, as noted, is mostly an observation of the effects of bad management.

Figure 1 only carries the essential idea, and is not meant to provide precise numerical values. Figure 2, the original figure from McConnell’s book, is. It plots effort against time rather than the reverse but, more importantly, it shows several curves, each corresponding to a published empirical study or cost model surveyed by the book.

Figure 2: Original illustration of the Shortest Possible Schedule
(figure 2-20 of [3], reproduced with the author’s permission)

On the left of the nominal point, the curves show how, according to each study, increased cost leads to decreased time. They differ on the details: how much the project needs to spend, and which maximal reduction it can achieve. But they all agree on the basic Shortest Possible Schedule result: spending can decrease time, and the maximal reduction will not exceed 25%.

The figure also provides an answer, although a disappointing one, to another question that arises naturally. So far this discussion has assumed that time was the critical resource and that we were prepared to spend more to get a product out sooner. But sometimes it is the other way around: the critical resource is cost, or, concretely, the number of developers. Assume that nominal analysis tells us that the project will take four developers for a year and, correspondingly, cost 600K (choose your currency). We only have a budget of 400K. Can we spend less by hiring fewer developers, accepting that it will take longer?

On that side, right of the nominal point in Figure 2, McConnell’s survey of surveys shows no consensus. Some studies and models do lead to decreased costs, others suggest that with the increase in time the cost will actually increase too. (Here is my interpretation, based on my experience rather than on any systematic study: you can indeed achieve the original goal with a somewhat smaller team over a longer period; but the effect on the final cost can vary. If the new time is t’= t + T and the new team size s’= s – S, t and s being the nominal values, the cost difference is proportional to Ts – t’S. It can be positive as well as negative depending on the values of the original t and s and the precise effect of reduced team size on project duration.)

The firm result, however, is the left part of the figure. The Shortest Possible Schedule theorem confirms what good project managers know: you can, within limits, shorten delivery times by bringing all hands on deck. The precise version deserves to be widely known.

References and note

[1] Barry W. Boehm: Software Engineering Economics, Prentice Hall, 1981.

[2] Steve McConnell: Software Estimation ― Demystifying the Black Art, Microsoft Press, 2006.

[3] Steve McConnell: Brooks’ Law Repealed, in IEEE Software, vol. 16, no. 6, pp. 6–8, November-December 1999, available here.

[4] This is the accepted view, even though one might wish that the industry paid more attention to investment in tools in addition to people.

A version of this article was first published on the Comm. ACM blog under the title The Shortest Possible Schedule Theorem: Yes, You Can Throw Money at Software Deadlines

Rating: 9.8/10 (6 votes cast)

Rating: +2 (from 2 votes)

Category: Computer science, Education, Empirical Software Engineering, Essay, General technology, Project management, Recycled, Software engineering | Comment

Ten traits of exceptional innovators

8 December 2018, 19:17

Imagine having had coffee, over the years, with each of Euclid, Galileo, Descartes, Marie Curie, Newton, Einstein, Lise Leitner, Planck and de Broglie. For a computer scientist, if we set aside the founding generation (the Turings and von Neumanns), the equivalent is possible. I have had the privilege of meeting and in some cases closely interacting with pioneer scientists, technologists and entrepreneurs, including Nobel, Fields and Turing winners, Silicon-Valley-type founders and such. It is only fair that I should share some of the traits I have observed in them.

Clarification and disclaimer:

This discussion is abstract and as a result probably boring because I am not citing anyone by name (apart from a few famous figures, most of whom are dead and none of whom I have met). It would be more concrete and lively if I buttressed my generalities by actual examples, of which I have many. The absence of any name-dropping is a matter of courtesy and respect for people who have interacted with me unguardedly as a colleague, not a journalist preparing a tell-all book. I could of course cite the names for positive anecdotes only, but that would bias the story (see point 4). So, sorry, no names (and I won’t relent even if you ask me privately — mumm like a fish).
I am looking at truly exceptional people. They are drawn from a more general pool of brilliant, successful scientists and technologists, of which they form only a small subset. Many of their traits also apply to this more general community and to highly successful people in any profession. What interests me is the extra step from brilliant to exceptional. It would not be that difficult to identify fifty outstanding mathematics researchers in, say, 1900, and analyze their psychological traits. The question is: why are some of them Hilbert and Poincaré, and others not?
Of course I do not even begin to answer that question. I only offer a few personal remarks.
More generally, cargo cult does not work. Emulating every one of the traits listed below will not get you a Nobel prize. You will not turn into a great composer by eating lots of Tournedos Rossini. (Well, you might start looking like the aging Rossini.) This note presents some evidence; it does not present any conclusion, let alone advice. Any consequence is for you to draw, or not.
The traits obviously do not universally characterize the population observed. Not all of the people exhibit all of the traits. On the other hand, my impression is that most exhibit most.

1 Idiosyncratic

“Idiosyncratic” is a high-sounding synonym for “diverse,” used here to deflect the ridicule of starting a list of what is common to those people by stating that they are different from each other. The point is important, though, and reassuring. Those people come in all stripes, from the stuffy professor to the sandals-shorts-and-Hawaiian-shirt surfer. Their ethnic backgrounds vary. And (glad you asked) some are men and some are women.

Consideration of many personality and lifestyle features yields no pattern at all. Some of the people observed are courteous, a delight to deal with, but there are a few jerks too. Some are voluble, some reserved. Some boastful, some modest. Some remain for their full life married to the same person, some have been divorced many times, some are single. Some become CEOs and university presidents, others prefer the quieter life of a pure researcher. Some covet honors, others are mostly driven by the pursuit of knowledge. Some wanted to become very rich and did, others care little about money. It is amazing to see how many traits appear irrelevant, perhaps reinforcing the value of those that do make a difference.

2 Lucky

In trying to apply a cargo-cult-like recipe, this one would be the hardest to emulate. We all know that Fleming came across penicillin thanks to a petri dish left uncleaned on the window sill; we also know that luck favors only the well-prepared: someone other than Fleming would have grumbled at the dirtiness of the place and thrown the dish into the sink. But I am not just talking about that kind of luck. You have to be at the right place at the right time.

Read the biographies, and you will see that almost always the person happened to study with a professor who just then was struggling with a new problem, or did an internship in a group that had just invented a novel technique, or heard about recent results before everyone else did.

Part of what comes under “luck” is luck in obtaining the right education. Sure, there are a few autodidacts, but most of the top achievers studied in excellent institutions.

Success comes from a combination of nature and nurture. The perfect environment, such as a thriving laboratory or world-class research university, is not enough; but neither is individual brilliance. In most cases it is their combination that produces the catalysis.

3 Smart

Laugh again if you wish, but I do not just mean the obvious observation that those people were clever in what they did. In my experience they are extremely intelligent in other ways too. They often possess deep knowledge beyond their specialties and have interesting conversations.

You approach them because of the fame they gained in one domain, and learn from them about topics far beyond it.

4 Human

At first, the title of this section is another cause for ridicule: what did you expect, extraterrestrials? But “human” here means human in their foibles too. You might expect, if not an extraterrestrial, someone of the oracle-of-Delphi or wizard-on-a-mountain type, who after a half-hour of silence makes a single statement perfect in its concision and exactitude.

Well, no. They are smart, but they say foolish things too. And wrong things. Not only do they say them, they even publish them. (Newton wasted his brilliance on alchemy. Voltaire — who was not a scientist but helped promote science, translating Newton and supporting the work of Madame du Châtelet — wasted his powerful wit to mock the nascent study of paleontology: so-called fossils are just shells left over by picnicking tourists! More recently, a very famous computer scientist wrote a very silly book — of which I once wrote, fearlessly, a very short and very disparaging review.)

So what? It is the conclusion of the discussion that counts, not the meanderous path to it, or the occasional hapless excursion into a field where your wisdom fails you. Once you have succeeded, no one will care how many wrong comments you made in the process.

It is fair to note that the people under consideration probably say fewer stupid things than most. (The Erich Kästner ditty from an earlier article applies.) But no human, reassuringly perhaps, is right 100% of the time.

What does set them apart from many people, and takes us back to the previous trait (smart), is that even those who are otherwise vain have no qualms recognizing mistakes in their previous thinking. They accept the evidence and move on.

5 Diligent

Of two people, one an excellent, top-ranked academic, the other a world-famous pioneer, who is the more likely to answer an email? In my experience, the latter.

Beyond the folk vision of the disheveled, disorganized, absent-minded professor lies the reality of a lifetime of rigor and discipline.

This should not be a surprise. There is inspiration, and there is perspiration. Think of it as the dual of the broken-windows theory, or of the judicial view that a defendant who lies in small things probably lies in big things: the other way around, if you do huge tasks well, you probably do small tasks well too.

6 Focused

Along with diligence comes focus, carried over from big matters to small matters. It is the lesser minds that pretend to multiplex. Great scientists, in my experience, do not hack away at their laptops during talks, and they turn off their cellphones. They choose carefully what they do (they are deluged with requests and learn early to say no), but what they accept to do they do. Seriously, attentively, with focus.

A fascinating spectacle is a world-famous guru sitting in the first row at a conference presentation by a beginning Ph.D. student, and taking detailed notes. Or visiting an industrial lab and quizzing a junior engineer about the details of the latest technology.

For someone who in spite of the cargo cult risk is looking for one behavior to clone, this would be it. Study after study has shown that we only delude ourselves in thinking we can multiplex. Top performers understand this. In the seminar room, they are not the ones doing email. If they are there at all, then watch and listen.

7 Eloquent

Top science and technology achievers are communicators. In writing, in speaking, often in both.

This quality is independent from their personal behavior, which can cover the full range from shy to boisterous. It is the quality of being articulate. They know how to convey their results — and often do not mind crossing the line to self-advertising. It is not automatically the case that true value will out: even the most impressive advances need to be pushed to the world.

The alternative is to become Gregor Mendel: he single-handedly discovered the laws of genetics, and was so busy observing the beans in his garden that no one heard about his work until some twenty years after his death. Most of us prefer to get the recognition earlier. (Mendel was a monk, so maybe he believed in an afterlife; yet again maybe he, like everyone else, might have enjoyed attracting interest in this world first.)

In computer science it is not surprising that many of the names that stand out are of people who have written seminal books that are a pleasure to read. Many of them are outstanding teachers and speakers as well.

8 Open

Being an excellent communicator does not mean that you insist on talking. The great innovators are excellent listeners too.

Some people keep talking about themselves. They exist in all human groups, but this particular trait is common among scientists, particularly junior scientists, who corner you and cannot stop telling you about their ideas and accomplishments. That phenomenon is understandable, and in part justified by an urge to avoid the Mendel syndrome. But in a conversation involving some less and some more recognized professionals it is often the most accomplished members of the group who talk least. They are eager to learn. They never forget that the greatest insighs can start with a casual observation from an improbable source. They know when to talk, and when to shut up and listen.

Openness also means intellectual curiosity, willingness to have your intellectual certainties challenged, focus on the merit of a comment rather than the commenter’s social or academic status, and readiness to learn from disciplines other than your own.

9 Selfish

People having achieved exceptional results were generally obsessed with the chase and the prey. They are as driven as an icebreaker ship in the Sea of Barents. They have to get through; the end justifies the means; anything in the way is collateral damage.

So it is not surprising, in the case of academics, to hear colleagues from their institutions mumble that X never wanted to do his share, leaving it to others to sit in committees, teach C++ to biology majors and take their turn as department chair. There are notable exceptions, such as the computer architecture pioneer who became provost then president at Stanford before receiving the Turing Award. But you do not achieve breakthroughs by doing what everything else is doing. When the rest of the crowd is being sociable and chatty at the conference party long into the night, they go back to their hotel to be alert for tomorrow’s session. A famous if extreme case is Andrew Wiles, whom colleagues in the department considered a has-been, while he was doing the minimum necessary to avoid trouble while working secretly and obsessively to prove Fermat’s last theorem.

This trait is interesting in light of the soothing discourse in vogue today. Nothing wrong with work-life balance, escaping the rat race, perhaps even changing your research topic every decade (apparently the rule in some research organizations). Sometimes a hands-off, zen-like attitude will succeed where too much obstination would get stuck. But let us not fool ourselves: the great innovators never let go of the target.

10. Generous

Yes, selfishness can go with generosity. You obsess over your goals, but it does not mean you forget other people.

Indeed, while there are a few solo artists in the group under observation, a striking feature of the majority is that in addition to their own achievements they led to the creation of entire communities, which often look up to them as gurus. (When I took the comprehensive exam at Stanford, the first question was what the middle initial “E.” of a famous professor stood for. It was a joke question, counting for maybe one point out of a hundred, helpfully meant to defuse students’ tension in preparation for the hard questions that followed. But what I remember is that every fellow student whom I asked afterwards knew the answer. Me too. Such was the personality cult.) The guru effect can lead to funny consequences, as with the famous computer scientist whose disciples you could spot right away in conferences by their sandals and beards (I do not remember how the women coped), carefully patterned after the master’s.

The leader is often good at giving every member of that community flattering personal attention. In a retirement symposium for a famous professor, almost every person I talked too was proud of having developed a long-running, highly personal and of course unique relationship with the honoree. One prestigious computer scientist who died in the 80’s encouraged and supported countless young people in his country; 30 years later, you keep running into academics, engineers and managers who tell you that they owe their career to him.

Some of this community-building can be self-serving and part of a personal strategy for success. There has to be more to it, however. It is not just that community-building will occur naturally as people discover the new ideas: since these ideas are often controversial at first, those who understood their value early band together to defend them and support their inventor. But there is something else as well in my observation: the creators’ sheer, disinterested generosity.

These people are passionate in their quest for discovery and creation and genuinely want to help others. Driven and self-promoting they may be, but the very qualities that led to their achievements — insight, intellectual courage, ability to think beyond accepted ideas — are at the antipodes of pettiness and narrow-mindedness. A world leader cannot expect any significant personal gain from spotting and encouraging a promising undergraduate, telling a first-time conference presenter that her idea is great and worth pushing further, patiently explaining elementary issues to a beginning student, or responding to a unknown correspondent’s emails. And still, as I have observed many times, they do all of this and more, because they are in the business of advancing knowledge.

These are some of the traits I have observed. Maybe there are more but, sorry, I have to go now. The pan is sizzling and I don’t like my tournedos too well-done.

(Originally published on CACM blog.)

Rating: 9.3/10 (8 votes cast)

Rating: +4 (from 4 votes)

Category: Education, Essay, General technology, History, Memoir, Personal | Comment

Why not program right?

24 May 2018, 20:05

(Originally published on CACM blog.)

Most of the world programs in a very strange way. Strange to me. I usually hear the reverse question: people ask us, the Eiffel community, to explain why we program our way. I hardly understand the question, because the only mystery is how anyone can even program in any other way.

The natural reference is the beginning of One Flew Over the Cuckoo’s Nest: when entering an insane asylum and wondering who is an inmate and who a doctor, you may feel at a loss for objective criteria. Maybe the rest of the world is right and we are the nut cases. Common sense suggests it.

But sometimes one can go beyond common sense and examine the evidence. So lend me an ear while I explain my latest class invariant. Here it is, in Figure 1. (Wait, do not just run away yet.)

multigraph_invariant

Figure 1: From the invariant of class MULTIGRAPH

This is a program in progress and by the time you read this note the invariant and enclosing class will have changed. But the ideas will remain.

Context: multigraphs

The class is called MULTIGRAPH and describes a generalized notion of graph, illustrated in Figure 2. The differences are that: there can be more than one edge between two nodes, as long as they have different tags (like the spouse and boss edges between 1 and 2); and there can be more than one edge coming out of a given node and with a given tag (such as the two boss edges out of 1, reflecting that 1’s boss might be 2 in some cases and 3 in others). Some of the nodes, just 1 here, are “roots”.

The class implements the notion of multigraph and provides a wide range of operations on multigraphs.

multigraph_example

Figure 2: A multigraph

Data structures

Now we turn to the programming and software engineering aspects. I am playing with various ways of accessing multigraphs. For the basic representation of a multigraph, I have chosen a table of triples:

triples_table: HASH_TABLE [TRIPLE, TUPLE [source: INTEGER; tag: INTEGER; target: INTEGER]] — Table of triples, each retrievable through its `source’, `tag’ and `target’.

where the class TRIPLE describes [source, tag, target] triples, with a few other properties, so they are not just tuples. It is convenient to use a hash table, where the key is such a 3-tuple. (In an earlier version I used just an ARRAY [TRIPLE], but a hash table proved more flexible.)

Sources and targets are nodes, also called “objects”; we represent both objects and tags by integers for efficiency. It is easy to have structures that map symbolic tag names such as “boss” to integers.

triples_table is the core data structure but it turns out that for the many needed operations it is convenient to have others. This technique is standard: for efficiency, provide different structures to access and manipulate the same underlying information, with some redundancy. So I also have:

triples_from: ARRAYED_LIST [LIST [TRIPLE]]
— Triples starting from a given object. Indexed by object numbers.

triples_with: HASH_TABLE [LIST [TRIPLE], INTEGER]
— Triples labeled by a given tag. Key is tag number.

triples_to: ARRAYED_LIST [LIST [TRIPLE]]
— Triples leading into a given object. Indexed by object numbers.

Figure 3 illustrates triples_from and Figures 4 illustrates triples_with. triples_to is similar.

triples_from

Figure 3: The triples_from array of lists and the triples_table

triples_with

Figure 4: The triples_with array of lists and the triples_table

It is also useful to access multigraphs through yet another structure, which gives us the targets associated with a given object and tag:

successors: ARRAY [HASH_TABLE [LIST [TRIPLE], INTEGER]]
— successors [obj] [t] includes all o such that there is a t- reference from obj to o.

For example in Figure 1 successors [1] [spouse] is {2, 3}, and in Figures 3 and 4 successors [26] [t] is {22, 55, 57}. Of course we can obtain the “successors” information through the previously defined structures, but since this is a frequently needed operation I decided to include a specific data structure (implying that every operation modifying the multigraph must update it). I can change my mind later on and decide to make “successors” a function rather than a data structure; it is part of the beauty of OO programming, particularly in Eiffel, that such changes are smooth and hardly impact client classes.

There is similar redundancy in representing roots:

roots: LINKED_SET [INTEGER]
— Objects that are roots.

is_root: ARRAY [BOOLEAN]
— Which objects are roots? Indexed by object numbers.

If o is a root, then it appears in the “roots” set and is_root [o] has value True.

Getting things right

These are my data structures. Providing such a variety of access modes is a common programming technique. From a software engineering perspective ― specification, implementation, verification… ― it courts disaster. How do we maintain their consistency? It is very easy for a small mistake to slip into an operation modifying the graph, causing one of the data structures to be improperly updated, but in a subtle and rare enough way that it will not manifest itself during testing, coming back later to cause strange behavior that will be very hard to debug.

For example, one of the reasons I have a class TRIPLE and not just 3-tuples is that a triple is not exactly the same as an edge in the multigraph. I have decided that by default the operation that removes and edge would not remove the corresponding triple from the data structure, but leave it in and mark it as “inoperative” (so class TRIPLE has an extra “is_inoperative” boolean field). There is an explicit GC-like mechanism to clean up deleted edges occasionally. This approach brings efficiency but makes the setup more delicate since we have to be extremely careful about what a triple means and what removal means.

This is where I stop understanding how the rest of the world can work at all. Without some rigorous tools I just do not see how one can get such things right. Well, sure, spend weeks of trying out test cases, printing out the structures, manually check everything (in the testing world this is known as writing lots of “oracles”), try at great pains to find out the reason for wrong results, guess what program change will fix the problem, and start again. Stop when things look OK. When, as Tony Hoare once wrote, there are no obvious errors left.

Setting aside the minuscule share of projects (typically in embedded life-critical systems) that use some kind of formal verification, this process is what everyone practices. One can only marvel that systems, including many successful ones, get produced at all. To take an analogy from another discipline, this does not compare to working like an electrical engineer. It amounts to working like an electrician.

For a short time I programmed like that too (one has to start somewhere, and programming methodology was not taught back then). I no longer could today. Continuing with the Hoare citation, the only acceptable situation is to stop when there are obviously no errors left.

How? Certainly not, in my case, by always being right the first time. I make mistakes like everyone else does. But I have the methodology and tools to avoid some, and, for those that do slip through, to spot and fix them quickly.

Help is available

First, the type system. Lots of inconsistencies, some small and some huge, which in an untyped language would only hit during execution, do not make it past compilation. We are not just talking here about using REAL instead of INTEGER. With a sophisticated type system involving multiple inheritance, genericity, information hiding and void safety, a compiler error message can reflect a tricky logical mistake. You are using a SET as if it were a LIST (some operations are common, but others not). You are calling an operation on a reference that may be void (null) at run time. And so on.

By the way, about void-safety: for a decade now, Eiffel has been void-safe, meaning a compile-time guarantee of no run-time null pointer dereferencing. It is beyond my understanding how the rest of the world can still live with programs that run under myriad swords of Damocles: x.op (…) calls that might any minute, without any warning or precedent, hit a null x and crash.

Then there is the guarantee of logical consistency, which is where my class invariant (Figure 1) comes in. Maybe it scared you, but in reality it is all simple concepts, intended to make sure that you know what you are doing, and rely on tools to check that you are right. When you are writing your program, you are positing all kinds, logical assumptions, large and (mostly) small, all the time. Here, for the structure triples_from [o] to make sense, it must be a list such that:

It contains all the triples t in the triples_table such that t.source = o.
It contains only those triples!

You know this when you write the program; otherwise you would not be having a “triples_from” structure. Such gems of knowledge should remain an integral part of the program. Individually they may not be rocket science, but accumulated over the lifetime of a class design, a subsystem design or a system design they collect all the intelligence that makes the software possible. Yet in the standard process they are gone the next minute! (At best, some programmers may write a comment, but that does not happen very often, and a comment has no guarantee of precision and no effect on testing or correctness.)

Anyone who takes software development seriously must record such fundamental properties. Here we need the following invariant clause:

across triples_from as tf all

across tf.item as tp all tp.item.source = tf.cursor_index end

end

(It comes in the class, as shown in Figure 1, with the label “from_list_consistent”. Such labels are important for documentation and debugging purposes. We omit them here for brevity.)

What does that mean? If we could use Unicode (more precisely, if we could type it easily with our keyboards) we would write things like “∀ x: E | P (x)”: for all x in E, property P holds of x. We need programming-language syntax and write this as across E as x all P (x.item) end. The only subtlety is the “.item” part, which gives us generality beyond the ∀ notation: x in the across is not an individual element of E but a cursor that moves over E. The actual element at cursor position is x.item, one of the properties of that cursor. The advantage is that the cursor has more properties, for example x.cursor_index, which gives its position in E. You do not get that with the plain “ of mathematics.

If instead of ∀ you want ∃ (there exists), use some instead of all. That is pretty much all you need to know to understand all the invariant clauses of class MULTIGRAPH as given in Figure 1.

So what the above invariant clause says is: take every position tf in triples_from; its position is tf.cursor_index and its value is tf.item. triples_from is declared as ARRAYED_LIST [LIST [TRIPLE]], so tf.cursor_index is an integer representing an object o, and tf.item is a list of triples. That list should consist of the triples having tf.cursor_index as their source. This is the very property that we are expressing in this invariant clause, where the innermost across says: for every triple tp.item in the list, the source of that triple is the cursor index (of the outside across). Simple and straightforward, I think (although such English explanations are so much more verbose than formal versions, such as the Eiffel one here, and once you get the hang of it you will not need them any more).

How can one ever include a structure such as triples_from without expressing such a property? To put the question slightly differently: am I inside the asylum looking out, or outside the asylum looking in? Any clue would be greatly appreciated.

More properties

For the tag ( “with_”) and target lists, the properties are similar:

across triples_with as tw all across tw.item as tp all tp.item.tag = tw.key end end

across triples_to as tt all across tt.item as tp all tp.item.target = tt.cursor_index end end

We also have some properties of array bounds:

is_root.lower = 1 and is_root.upper = object_count

triples_from.lower = 1 and triples_from.upper = object_count

triples_to.lower = 1 and triples_to.upper = object_count

where object_count is the number of objects (nodes), and for an array a (whose bounds in Eiffel are arbitrary, not necessarily 0 or 1, and set on array creation), a.lower and a.upper are the bounds. Here we number the arrays from 1.

There are, as noted, two ways to represent rootness. We must express their consistency (or risk trouble). Two clauses of the invariant do the job:

across roots as t all is_root [t.item] end

across is_root as t all (t.item = roots.has (t.cursor_index)) end

The first one says that if we go through the list “roots” we only find elements whose “is_root” value is true; the second, that if we go through the array “is_root” we find values that are true where and only where the corresponding object, given by the cursor index, is in the “roots” set. Note that the “=” in that second property is between boolean values (if in doubt, check the type instantly in the EIffelStudio IDE!), so it means “if and only if”.

Instead of these clauses, a more concise version, covering them both, is just

roots ~ domain (is_root)

with a function domain that gives the domain of a function represented by a boolean array. The ~ operator denotes object equality, redefined in many classes, and in particular in the SET classes (“roots” is a LINKED_SET) to cover equality between sets, i.e. the property of having the same elements.

The other clauses are all similarly self-explanatory. Let us just go through the most elaborate one, successors_consistent, involving three levels of across:

across successors as httpl all — httpl.item: hash table of list of triples

across httpl.item as tpl all — tpl.item: list of triples (tpl.key: key (i.e. tag) in hash table (tag)

across tpl.item as tp all — tp.item: triple

tp.item.tag = tpl.key

and tp.item.source = httpl.cursor_index

end

You can see that I struggled a bit with this one and made provisions for not having to struggle again when I would look at the code again 10 minutes, 10 days or 10 months later. I chose (possibly strange but consistent) names such as httpl for hash-table triple, and wrote comments (I do not usually need any in invariant and other contract clauses) to remind me of the type of everything. That was not strictly needed since once again the IDE gives me the types, but it does not cost much and could help.

What this says: go over “successors”; which as you remember is an ARRAY, indexed by objects, of HASH_TABLE, where each entry of such a hash table has an element of type [LIST [TRIPLE] and a key of type INTEGER, representing the tag of a number of outgoing edges from the given object. Go over each hash table httpl. Go over the associated list of triples tpl. Then for each triple tp in this list: the tag of the triple must be the key in the hash table entry (remember, the key does denote a tag); and the source of the triple must the object under consideration, which is the current iteration index in the array of the outermost iteration.

I hope I am not scaring you at this point. Although the concepts are simple, this invariant is more sophisticated than most of those we typically write. Many invariant clauses (and preconditions, and postconditions) are very simple properties, such as x > 0 or x ≠ y. The reason this one is more elaborate is not that I am trying to be fussy but that without it I would be the one scared to death. What is elaborate here is the data structure and programming technique. Not rocket science, not anything beyond programmers typically do, but elaborate. The only way to get it right is to buttress it by the appropriate logical properties. As noted, these properties are there anyway, in the back of your head, when you write the program. If you want to be more like an electrical engineer than an electrician, you have to write them down.

There is more to contracts

Invariants are not the only kind of such “contract” properties. Here for example, from the same class, is a (slightly abbreviated) part of the postcondition (output property) of the operation that tells us, through a boolean Result, if the multigraph has an edge of given components osource, t (the tag) and otarget :

Result =

(across successors [osource] [t] as tp some

not tp.item.is_inoperative and tp.item.target = otarget

end)

In words, this clause expresses the compatibility of the operation with the “successors” view: it must answer yes if and only if otarget appears in the successor set of osource for t, and the corresponding triple is not marked inoperative.

The concrete benefits

And so? What do we get out of making these logical properties explicit? Just the intellectual satisfaction of doing things right, and the methodological guidance? No! Once you have done this work, it is all downhill. Turn on the run-time assertion monitoring option (tunable separately for preconditions, postconditions, invariants etc., and on by default in development mode), and watch your tests run. If you are like almost all of us, you will have made a few mistakes, some which will seem silly when or rather if you find them in time (but there is nothing funny about a program that crashes during operation) and some more subtle. Sit back, and just watch your contracts be violated. For example if I change “<=” to “<” in the invariant property “tw.key <= max_tag”, I get the result of Figure 5. I see the call stack that I can traverse, the object run-time structure that I can explore, and all the tools of a modern debugger for an OO language. Finding and correcting the logical flaw will be a breeze.

debugger

Figure 5: An invariant violation brings up the debugger

The difference

It will not be a surprise that I did not get all the data structures and algorithms of the class MULTIGRAPH right the first time. The Design by Contract approach (the discipline of systematically expressing, whenever you write any software element, the associated logical properties) does lead to fewer mistakes, but everyone occasionally messes up. Everyone also looks at initial results to spot and correct mistakes. So what is the difference?

Without the techniques described here, you execute your software and patiently examine the results. In the example, you might output the content of the data structures, e.g.

List of outgoing references for every object:

1: 1-1->1|D, 1-1->2|D, 1-1->3|D, 1-2->1|D, 1-2->2|D, 1-25->8|D, 1-7->1|D, 1-7->6|D,

1-10->8|D, 1-3->1|D, 1-3->2|D, 1-6->3|D, 1-6->4|D, 1-6->5|D

3: 3-6->3, 3-6->4, 3-6->5, 3-9->14, 3-9->15, 3-9->16, 3-1->3, 3-1->2, 3-2->3, 3-2->2,

3-25->8, 3-7->3, 3-7->6, 3-10->8, 3-3->3, 3-3->2

List of outgoing references for every object:

1: 1-1->1|D, 1-1->2|D, 1-1->3|D, 1-2->1|D, 1-2->2|D, 1-25->8|D, 1-7->1|D, 1-7->6|D,

1-10->8|D, 1-3->1|D, 1-3->2|D, 1-6->3|D, 1-6->4|D, 1-6->5|D

3: 3-6->3, 3-6->4, 3-6->5, 3-9->14, 3-9->15, 3-9->16, 3-1->3, 3-1->2, 3-2->3, 3-2->2,

3-25->8, 3-7->3, 3-7->6, 3-10->8, 3-3->3, 3-3->2

and so on for all the structures. You check the entries one by one to ascertain that they are as expected. The process nowadays has some automated support, with tools such as JUnit, but it is still essentially manual, tedious and partly haphazard: you write individual test oracles for every relevant case. (For a more automated approach to testing, taking advantage of contracts, see [1].) Like the logical properties appearing in contracts, these oracles are called “assertions” but the level of abstraction is radically different: an oracle describes the desired result of one test, where a class invariant, or routine precondition, or postcondition expresses the properties desired of all executions.

Compared to the cost of writing up such contract properties (simply a matter of formalizing what you are thinking anyway when you write the code), their effect on testing is spectacular. Particularly when you take advantage of “across” iterators. In the example, think of all the checks and crosschecks automatically happening across all the data structures, including the nested structures as in the 3-level across clause. Even with a small test suite, you immediately get, almost for free, hundreds or thousands of such consistency checks, each decreasing the likelihood that a logical flaw will survive this ruthless process.

Herein lies the key advantage. Not that you will magically stop making mistakes; but that the result of such mistakes, in the form of contract violations, directly points to logical properties, at the level of your thinking about the program. A wrong entry in an output, whether you detect it visually or through a Junit clause, is a symptom, which may be far from the cause. (Remember Dijkstra’s comment, the real point of his famous Goto paper, about the core difficulty of programming being to bridge the gap between the static program text, which is all that we control, and its effect: the myriad possible dynamic executions.) Since the cause of a bug is always a logical mistake, with a contract violation, which expresses a logical inconsistency, you are much close to that cause.

(About those logical mistakes: since a contract violation reflects a discrepancy between intent, expressed by the contract, and reality, expressed by the code, the mistake may be on either side. And yes, sometimes it is the contract that is wrong while the implementation in fact did what is informally expected. There is partial empirical knowledge [1] of how often this is the case. Even then, however, you have learned something. What good is a piece of code of which you are not able to say correctly what it is trying to do?)

The experience of Eiffel programmers reflects these observations. You catch the mistakes through contract violations; much of the time, you find and correct the problem easily. When you do get to producing actual test output (which everyone still does, of course), often it is correct.

This is what has happened to me so far in the development of the example. I had mistakes, but converging to a correct version was a straightforward process of examining violations of invariant violations and other contract elements, and fixing the underlying logical problem each time.

By the way, I believe I do have a correct version (in the sense of the second part of the Hoare quote), on the basis not of gut feeling or wishful thinking but of solid evidence. As already noted it is hard to imagine, if the code contains any inconsistencies, a test suite surviving all the checks.

Tests and proofs

Solid evidence, not perfect; hard to imagine, not impossible. Tests remain only tests; they cannot exercise all cases. The only way to achieve demonstrable correctness is to rely on mathematical proofs performed mechanically. We have this too, with the AutoProof proof system for Eiffel, developed in recent years [1]. I cannot overstate my enthusiasm for this work (look up the Web-based demo), its results (automated proof of correctness of a full-fledged data structures and algorithms library [2]) and its potential, but it is still a research effort. The dynamic approach (meaning test-based rather than proof-based) presented above is production technology, perfected over several decades and used daily for large-scale mission-critical applications. Indeed (I know you may be wondering) it scales up without difficulty:

The approach is progressive. Unlike fully formal methods (and proofs), it does not require you to write down every single property down to the last quantifier. You can start with simple stuff like x > 0. The more you write, the more you get, but it is the opposite of an all-or-nothing approach.
On the practical side, if you are wondering about the consequences on performance of a delivered system: there is none. Run-time contract monitoring is a compilation option, tunable for different kinds of contracts (invariants, postconditions etc.) and different parts of a system. People use it, as discussed here, for development, testing and debugging. Most of the time, when you deliver a debugged system, you turn it off.
It is easy to teach. As a colleague once mentioned, if you can write an if-then-else you can write a precondition. Our invariants in the above example where a bit more sophisticated, but programmers do write loops (in fact, the Eiffel loop for iterating over a structure also uses across, with “loop” and instructions instead of “all” or “some” and boolean expressions). If you can write a loop over an array, you can write a property of the array’s elements.
A big system is an accumulation of small things. In a blog article [5] I recounted how I lost a full day of producing a series of technical diagrams of increasing complexity, using one of the major Web-based collaborative development tools. A bug of the system caused all the diagrams to reproduce the first, trivial one. I managed to get through to the developers. My impression (no more than an educated guess resulting from this interaction) is that the data structures involved were far simpler than the ones used in the above discussion. One can surmise that even simple invariants would have uncovered the bug during testing rather than after deployment.
Talking about deployment and tools used directly on the cloud: the action in software engineering today is in DevOps, a rapid develop-deploy loop scheme. This is where my perplexity becomes utter cluelessness. How can anyone even consider venturing into that kind of exciting but unforgiving development model without the fundamental conceptual tools outlined above?

We are back then to the core question. These techniques are simple, demonstrably useful, practical, validated by years of use, explained in professional books (e.g. [6]), introductory programming textbooks (e.g. [7]), EdX MOOCs (e.g. [8]), YouTube videos, online tutorials at eiffel.org, and hundreds of articles cited thousands of times. On the other hand, most people reading this article are not using Eiffel. On reflection, a simple quantitative criterion does exist to identify the inmates: there are far more people outside the asylum than inside. So the evidence is incontrovertible.

What, then, is wrong with me?

References

(Nurse to psychiatrist: these are largely self-references. Add “narcissism” to list of patient’s symptoms.)

1. Ilinca Ciupa, Andreas Leitner, Bertrand Meyer, Manuel Oriol, Yu Pei, Yi Wei and others: AutoTest articles and other material on the AutoTest page.

2. Bertrand Meyer, Ilinca Ciupa, Lisa (Ling) Liu, Manuel Oriol, Andreas Leitner and Raluca Borca-Muresan: Systematic evaluation of test failure results, in Workshop on Reliability Analysis of System Failure Data (RAF 2007), Cambridge (UK), 1-2 March 2007 available here.

3. Nadia Polikarpova, Ilinca Ciupa and Bertrand Meyer: A Comparative Study of Programmer-Written and Automatically Inferred Contracts, in ISSTA 2009: International Symposium on Software Testing and Analysis, Chicago, July 2009, available here.

4. Carlo Furia, Bertrand Meyer, Nadia Polikarpova, Julian Tschannen and others: AutoProof articles and other material on the AutoProof page. See also interactive web-based online tutorial here.

5. Bertrand Meyer, The Cloud and Its Risks, blog article, October 2010, available here.

6. Bertrand Meyer: Object-Oriented Software Construction, 2^nd edition, Prentice Hall, 1997.

7. Bertrand Meyer: Touch of Class: Learning to Program Well Using Objects and Contracts, Springer, 2009, see touch.ethz.ch and Amazon page.

8. MOOCs (online courses) on EdX : Computer: Art, Magic, Science, Part 1 and Part 2. (Go to “archived versions” to follow the courses.)

Rating: 9.9/10 (12 votes cast)

Rating: +8 (from 10 votes)

Category: Computer science, Design by Contract, Eiffel, Essay, Language design, Object technology, Programming techniques, Software engineering, Software process, Software design | 3 Comments

Towards empirical answers to important software engineering questions

26 January 2018, 13:53

(Adapted from a two-part article on the Communications of the ACM blog.)

1 The rise of empirical software engineering

One of the success stories of software engineering research in recent decades has been the rise of empirical studies. Visionaries such as Vic Basili, Marvin Zelkowitz and Walter Tichy advocated empirical techniques early [1, 2, 3]; what enabled the field to take off was the availability of software repositories for such long-running projects as Apache, Linux and Eclipse [4], which researchers started mining using modern data analysis techniques.

These studies have yielded many insights including surprises. More experienced developers can produce more buggy code (Schröter, Zimmermann, Premraj, Zeller). To predict whether a module has bugs, intrinsic properties such as complexity seem to matter less than how many changes it went through (Moser, Pedrycz, Succi). Automatic analysis of user reports seems much better at identifying bugs than spotting feature requests (Panichella, Di Sorbo, Guzman, Visaggio, Canfora, Gall). More extensively tested modules tend to have more bugs (Mockus, Nagappan, Dinh-Trong). Eiffel programmers do use contracts (Estler, Furia, Nordio, Piccioni and me). Geographical distance between team members negatively affects the amount of communication in projects (Nordio, Estler, Tschannen, Ghezzi, Di Nitto and me). And so on.

The basic observation behind empirical software engineering is simple: if software products and processes are worthy of discussion, they must be worthy of quantitative discussion just like any natural artifact or human process. Usually at that point the advocacy cites Lord Kelvin:”If you cannot measure it, you cannot improve it” [5].

Not that advocacy is much needed today, at least for publishing research in software engineering and in computer science education. The need for empirical backing of conceptual proposals has achieved consensus. The so-called a “Marco Polo paper” [6] (I traveled far and saw wonderful things, thank you very much for your attention) no longer suffices for program committees; today they want numbers (and also, thankfully, a “threats to validity” section which protects you against suspicions that the numbers are bogus by stating why they might be). Some think this practice of demanding empirical backing for anything you propose has gone too far; see Jeff Ullman’s complaint [7], pertaining to database research rather than software engineering, but reflecting some of the same discussions. Here we can counter Kelvin with another quote (for more effect attributed to Einstein, albeit falsely): not everything that can be counted counts, and not everything that counts can be counted.

2 Limits of empirical research

There can indeed be too much of a good thing. Still, no one would seriously deny the fundamental role that empirical research has gained in modern software engineering. Which does not prevent us from considering the limits of what it has achieved; not in a spirit of criticism for its own sake, but to help researchers define an effective agenda for the next steps. There are in my opinion two principal limitations of current empirical results in software engineering.

The first has to do with the distinction introduced above between the two kinds of possible targets for empirical assessment: products (artifacts) versus processes.

Both aspects are important, but one is much easier to investigate than the other. For software products, the material of study is available in the form of repositories mentioned above, with their wealth of information about lines of code, control and data structures, commits, editing changes, bug reports and bug fixes. Processes are harder to grasp. You gain some information on processes from the repositories (for example, patterns and delays of bug fixing), but processes deserve studies of their own. For example, agile teams practice iterations (sprints) of widely different durations, from a few days to a few weeks; what is the ideal length? A good empirical answer would help many practitioners. But this example illustrates how difficult empirical studies of processes can be: you would need to try many variations with teams of professional programmers (not students) in different projects, different application areas, different companies; for the results to be believable the projects should be real ones with business results at stake, there should be enough samples in each category to ensure statistical significance, and the companies should agree to publication of some form, possibly anonymized, of the outcomes. The difficulties are formidable.

This issue of how to obtain project-oriented metrics is related to the second principal limitation of some of the initial empirical software engineering work: the risk of indulging in lamppost research. The term refers to the well-known joke about the drunkard who, in the dark of the night, searches for his lost keys next to the lamp post, not because he has lost them there but because it is the only place where one can see anything. To a certain extent all research is lamppost research: by definition, if you succeed in studying something, it will be because it can be studied. But the risk is to choose to work on a problem only, or principally, because it is easy to set up an empirical study — regardless of its actual importance. To cite an example that I have used elsewhere, one may suspect that the reason there are so many studies of pair programming is not that it’s of momentous relevance but that it is not hard to set up an experiment.

3 Beyond the lamppost

As long as empirical software engineering was a young, fledgling discipline, it made good sense to start with problems that naturally lended themselves to empirical investigation. But now that the field has matured, it may be time to reverse the perspective and start from the consumer’s perspective: for practitioners of software engineering, what problems, not yet satisfactorily answered by software engineering theory, could benefit, in the search for answers, from empirical studies?

Indeed, this is what we are entitled to expect from empirical studies: guidance. The slogan of empirical software engineering is that software is worthy of study just like geological strata, photons, and lilies-of-the-valley; OK, sure, but we are talking about human artifacts rather than wonders of the natural world, and the idea should be to help us produce better software and produce software better.

4 A horror story

Whenever we call for guidance from empirical studies, we should immediately include a caveat: every empirical study has its limitations (politely called “threats to validity”) and one must be careful about any generalization. The following horror story serves as caution [9]. The fashion today in programming language design is to use the semicolon not as separator in the Algol tradition (instruction1 ; instruction2) but as a terminator in the C tradition (instruction1; instruction2;). The original justification, particularly in the case of Ada [10], is an empirical paper by Gannon and Horning [11], which purported to show that the terminator convention led to fewer errors. (The authors themselves not only give their experimental results but, departing from the experimenter’s reserve, explicitly jump to the conclusion that terminators are better.) This view defies reason: witness, among others, the ever-recommenced tragedy of if c then a; else; b where the semicolon after else is an error (a natural one, since one gets into the habit of adding semicolons just in case) but the code compiles, with the result that b will be executed in all cases rather than (as intended) just when c is false [12].

How in the world could an empirical study come up with such a bizarre conclusion? Go back to the original Gannon-Horning paper and the explanation becomes clear: the experiments used subjects who were familiar with the PL/I programming language, where semicolons are used generously and an extra semicolon is harmless, as it is in all practical languages (two successive semicolons being simply interpreted as the insertion of an empty instruction, causing no harm); but the experimental separator-based language and compiler used to the experiment treated an extra semicolon as an error! As if this were not enough, checking the details of the article reveals that the terminator language is terminator-based for both declarations and instructions, whereas the example delimiter language is only delimiter-based for instructions, but terminator-based for declarations. Talk about a biased experiment! The experiment was bogus and so are the results.

One should not be too harsh about a paper from 1975, when the very idea of systematic experimental studies of programming was novel, and some of its other results are worthy of consideration. But the sad terminator story, even though it only affected a syntax property, should serve as a reminder that we should not accept a view blindly just because someone invokes some empirical study to justify it. We should assess the study itself, its methods and its credibility.

5 Addressing the issues that matter

With this warning in mind, we should still expect empirical software engineering to help us practitioners. It should help address important software engineering problems.

Ideally, I should now list the open issues of software engineering, but I am in no position even to start such a list. All I can do is to give a few examples. They may not be important to you, but they give an idea:

What are the respective values of upfront design and refactoring? How best can we combine these approaches?
Specification and testing are complementary techniques. Specifications are in principles superior to testing in general, but testing remains necessary. What combination of specification and testing works best?
What is the best commit/release technique, and in particular should we use RTC (Review Then Commit, as with Apache originally then Google) or CTR (Commit To Review, as Apache later) [13]?
What measure of code properties best correlates with effort? Many fancy metrics have appeared in the literature over the years, but there is still a nagging feeling among many of us that for all its obvious limitations the vulgar SLOC metrics (Source Lines Of Code) still remains the least bad.
When can a manager decide to stop testing? We did some work on the topic [14], but it is only a start.
Is test coverage a good measure of test quality [15] (spoiler: it is not, but again we need more studies)?

And so on. These examples may not be the cases that you consider most important; indeed what we need is input from many software engineers to help steer empirical software engineering towards the topics that truly matter to the community.

To provide a venue for that discussion, a workshop will take place 10-12 September 2018 (provisional dates) in the Toulouse area, involving many of the tenors in empirical software engineering, with the same title as these two articles: Empirical Answers to Important Software Engineering Questions. The key idea is to start not from the solutions side (the lamppost) but from the actual challenges facing software engineers. It will not just be a traditional publication-oriented meeting but will also include ample time for discussions and joint work.

If you would like to contribute your example “important questions”, please use any appropriate support (responses to this blog, email to me, Facebook, LinkedIn, anything as long as one can find it). Suggestions will be taken into consideration for the workshop. Empirical software engineering has already established itself as a core area of research; it is time feed that research with problems that actually matter to software developers, managers and users

Acknowledgments

These reflections originated in a keynote that I gave at ESEM in Bolzano in 2010 (I am grateful to Barbara Russo and Giancarlo Succi for the invitation). I never wrote up the talk but I dug up the slides [8] since they might contain a few relevant observations. I used some of these ideas in a short panel statement at ESEC/FSE 2013 in Saint Petersburg, and I am grateful to Moshe Vardi for suggesting I should write them up for Communications of the ACM, which I never did.

References and notes

[1] Victor R. Basili: The role of experimentation in software engineering: past, present and future, in 18th ICSE (International Conference on Software Engineering), 1996, see here.

[2] Marvin V. Zelkowitz and Dolores Wallace: Experimental validation in software engineering, International Conference on Empirical Assessment and Evaluation in Software Engineering, March 1997, see here.

[3] Walter F. Tichy: Should computer scientists experiment more?, in IEEE Computer, vol. 31, no. 5, pages 32-40, May 1998, see here.

[4] And EiffelStudio, whose repository goes back to the early 90s and has provided a fertile ground for numerous empirical studies, some of which appear in my publication list.

[5] This compact sentence is how the Kelvin statement is usually abridged, but his thinking was more subtle.

[6] Raymond Lister: After the Gold Rush: Toward Sustainable Scholarship in Computing, Proceedings of 10th conference on Australasian Computing Education Conference, pages 3-17, see here.

[7] Jeffrey D. Ullman: Experiments as research validation: have we gone too far?, in Communications of the ACM, vol. 58, no. 9, pages 37-39, 2015, see here.

[8] Bertrand Meyer, slides of a talk at ESEM (Empirical Software Engineering and Measurement), Bozen/Bolzano, 2010, available here. (Provided as background material only, they are not a paper but just slide support for a 45-minute talk, and from several years ago.)

[9] This matter is analyzed in more detail in section 26.5 of my book Object-Oriented Software Construction, 2nd edition, Prentice Hall. No offense to the memory of Jim Horning, a great computer scientist and a great colleague. Even great computer scientists can be wrong once in a while.

[10] I know this from the source: Jean Ichbiah, the original designer of Ada, told me explicitly that this was the reason for his choice of the terminator convention for semicolons, a significant decision since it was expected that the language syntax would be based on Pascal, a delimiter language.

[11] Gannon & Horning, Language Design for Programming Reliability, IEEE Transactions on Software Engineering, vol. SE-1, no. 2, June 1975, pages 179-191, see here.

[12] This quirk of C and similar languages is not unlike the source of the Apple SSL/TLS bug discussed earlier in this blog under the title Code matters.

[13] Peter C. Rigby, Daniel M. German, Margaret-Anne Storey: Open Source Software Peer Review Practices: a Case study of the Apache Server, in ICSE (International Conference on Software Engineering) 2008, pages 541-550, see here.

[14] Carlo A. Furia, Bertrand Meyer, Manuel Oriol, Andrey Tikhomirov and Yi Wei:The Search for the Laws of Automatic Random Testing, in Proceedings of the 28th ACM Symposium on Applied Computing (SAC 2013), Coimbra (Portugal), ACM Press, 2013, see here.

[15] Yi Wei, Bertrand Meyer and Manuel Oriol: Is Coverage a Good Measure of Testing Effectiveness?, in Empirical Software Engineering and Verification (LASER 2008-2010), eds. Bertrand Meyer and Martin Nordio, Lecture Notes in Computer Science 7007, Springer, February 2012, see here.

Rating: 9.8/10 (4 votes cast)

Rating: +1 (from 1 vote)

Category: Conference, Design by Contract, Empirical Software Engineering, Essay, Language design, Software engineering | Comment

Lampsort

7 December 2014, 15:50

In support of his view of software methodology, Leslie Lamport likes to use the example of non-recursive Quicksort. Independently of the methodological arguments, his version of the algorithm should be better known. In fact, if I were teaching “data structures and algorithms” I would consider introducing it first.

As far as I know he has not written down his version in an article, but he has presented it in lectures; see [1]. His trick is to ask the audience to give a non-recursive version of Quicksort, and of course everyone starts trying to remove the recursion, for example by making the stack explicit or looking for invertible functions in calls. But his point is that recursion is not at all fundamental in Quicksort. The recursive version is a specific implementation of a more general idea.

Lamport’s version — let us call it Lampsort —is easy to express in Eiffel. We may assume the following context:

a: ARRAY [G -> COMPARABLE]        — The array to be sorted.
pivot: INTEGER                                    — Set by partition.
picked: INTEGER_INTERVAL       — Used by the sorting algorithm, see below.
partition (i, j: INTEGER)
……..require      — i..j is a sub-interval of the array’s legal indexes:
……..……..i < j
……..……..i >= a.lower
……..……..j <= a.upper
……..do
……..……..… Usual implementation of partition …
……..ensure     — The expected effect of partition:
……..……..pivot >= i
……..……..pivot < j
……..……..— a [i..j] has been reshuffled so that elements in i..pivot are less than
……..……..— or equal to those in pivot+1 .. j.
……..end

We do not write the implementation of partition since the point of the present discussion is the overall algorithm. In the usual understanding, that algorithm consists of doing nothing if the array has no more than one element, otherwise performing a partition and then recursively calling itself on the two resulting intervals. The implementation can take advantage of parallelism by forking the recursive calls out to different processors. That presentation, says Lamport, describes only a possible implementation. The true Quicksort is more general. The algorithm works on a set not_sorted of integer intervals i..j such that the corresponding array slices a [i..j] are the only ones possibly not sorted; the goal of the algorithm is to make not_sorted empty, since then we know the entire array is sorted. In Eiffel we declare this set as:

not_sorted: SET [INTEGER_INTERVAL]

The algorithm initializes not_sorted to contain a single element, the entire interval; at each iteration, it removes an interval from the set, partitions it if that makes sense (i.e. the interval has more than one element), and inserts the resulting two intervals into the set. It ends when not_sorted is empty. Here it is:

……..from                                 — Initialize interval set to contain a single interval, the array’s entire index range:
……..…..create not_sorted.make_one (a.lower |..| a.upper)….         ..……..
……..invariant
……..…..— See below
……..until
……..…..not_sorted.is_empty                                                            — Stop when there are no more intervals in set
……..loop
……..…..picked := not_sorted.item                                                 — Pick an interval from (non-empty) interval set.
……..……if picked.count > 1 then                                                    — (The precondition of partition holds, see below.)
……..……..…..partition (picked.lower, picked.upper)             — Split, moving small items before & large ones after pivot.
……..……..…..not_sorted.extend (picked.lower |..| pivot)            — Insert new intervals into the set of intervals: first
……..……....…not_sorted.extend (pivot + 1 |..| picked.upper)   — and second.
……..……end
……..…...not_sorted.remove (picked)                                             — Remove interval that was just partitioned.
…….end

Eiffel note: the function yielding an integer interval is declared in the library class INTEGER using the operator |..| (rather than just ..).

The query item from SET, with the precondition not is_empty, returns an element of the set. It does not matter which element. In accordance with the Command-Query Separation principle, calling item does not modify the set; to remove the element you have to use the command remove. The command extend adds an element to the set.

The abstract idea behind Lampsort, explaining why it works at all, is the following loop invariant (see [2] for a more general discussion of how invariants provide the basis for understanding loop algorithms). We call “slice” of an array a non-empty contiguous sub-array; for adjacent slices we may talk of concatenation; also, for slices s and t, s <= t means that every element of s is less than or equal to every element of t. The invariant is:

a is the concatenation of the members of a set slices of disjoint slices, such that:
– The elements of a are a permutation of its original elements.
– The index range of any member of slices having more than one element is in not_sorted.
– For any adjacent slices s and t (with s before t), s <= t.

The first condition (conservation of the elements modulo permutation) is a property of partition, the only operation that can modify the array. The rest of the invariant is true after initialization (from clause) with slices made of a single slice, the full array. The loop body maintains it since it either removes a one-element interval from not_sorted (slices loses the corresponding slice) or performs partition with the effect of partitioning one slice into two adjacent ones satisfying s <= t, whose intervals replace the original one in not_sorted. On exit, not_sorted is empty, so slices is a set of one-element slices, each less than or equal to the next, ensuring that the array is sorted.

The invariant also ensures that the call to partition satisfies that routine’s precondition.

The Lampsort algorithm is a simple loop; it does not use recursion, but relies on an interesting data structure, a set of intervals. It is not significantly longer or more difficult to understand than the traditional recursive version

sort (i, j: INTEGER)
……..require
……..……..i <= j
……..……..i >= a.lower
……..……..j <= a.upper
……..do
……..……if j > i then                    — Note that precondition of partition holds.
……..……..…..partition (i, j)         — Split into two slices s and t such that s <= t.
……..……..…..sort (i, pivot)    — Recursively sort first slice.
……..……..…..sort (pivot+1, j)    — Recursively sort second slice.
……..……end……..…..
……..end

Lampsort, in its author’s view, captures the true idea of Quicksort; the recursive version, and its parallelized variants, are only examples of possible implementations.

I wrote at the start that the focus of this article is Lampsort as an algorithm, not issues of methodology. Let me, however, give an idea of the underlying methodological debate. Lamport uses this example to emphasize the difference between algorithms and programs, and to criticize the undue attention being devoted to programming languages. He presents Lampsort in a notation which he considers to be at a higher level than programming languages, and it is for him an algorithm rather than a program. Programs will be specific implementations guided in particular by efficiency considerations. One can derive them from higher-level versions (algorithms) through refinement. A refinement process may in particular remove or restrict non-determinism, present in the above version of Lampsort through the query item (whose only official property is that it returns an element of the set).

The worldview underlying the Eiffel method is almost the reverse: treating the whole process of software development as a continuum; unifying the concepts behind activities such as requirements, specification, design, implementation, verification, maintenance and evolution; and working to resolve the remaining differences, rather than magnifying them. Anyone who has worked in both specification and programming knows how similar the issues are. Formal specification languages look remarkably like programming languages; to be usable for significant applications they must meet the same challenges: defining a coherent type system, supporting abstraction, providing good syntax (clear to human readers and parsable by tools), specifying the semantics, offering modular structures, allowing evolution while ensuring compatibility. The same kinds of ideas, such as an object-oriented structure, help on both sides. Eiffel as a language is the notation that attempts to support this seamless, continuous process, providing tools to express both abstract specifications and detailed implementations. One of the principal arguments for this approach is that it supports change and reuse. If everything could be fixed from the start, maybe it could be acceptable to switch notations between specification and implementation. But in practice specifications change and programs change, and a seamless process relying on a single notation makes it possible to go back and forth between levels of abstraction without having to perform repeated translations between levels. (This problem of change is, in my experience, the biggest obstacle to refinement-based approaches. I have never seen a convincing description of how one can accommodate specification changes in such a framework without repeating the whole process. Inheritance, by the way, addresses this matter much better.)

The example of Lampsort in Eiffel suggests that a good language, equipped with the right abstraction mechanisms, can be effective at describing not only final implementations but also abstract algorithms. It does not hurt, of course, that these abstract descriptions can also be executable, at the possible price of non-optimal performance. The transformation to an optimal version can happen entirely within the same method and language.

Quite apart from these discussions of software engineering methodology, Lamport’s elegant version of Quicksort deserves to be known widely.

References

[1] Lamport video here, segment starting at 0:32:34.
[2] Carlo Furia, Bertrand Meyer and Sergey Velder: Loop invariants: Analysis, Classification and Examples, in ACM Computing Surveys, September 2014, preliminary text here.

Rating: 7.0/10 (27 votes cast)

Rating: +5 (from 11 votes)

Tags: Lamport, Quicksort, Single-Model Principle
Category: Algorithms, Computer science, Concurrency, Design by Contract, Education, Eiffel, Essay, Programming techniques, Reading notes, Software engineering, Theory | 7 Comments

Empirical answers to fundamental software engineering questions

29 August 2013, 11:10

This is a slightly reworked version of an article in the CACM blog, which also served as the introduction to a panel which I moderated at ESEC/FSE 2013 last week; the panelists were Harald Gall, Mark Harman, Giancarlo Succi (position paper only) and Tony Wasserman.

For all the books on software engineering, and the articles, and the conferences, a remarkable number of fundamental questions, so fundamental indeed that just about every software project runs into them, remain open. At best we have folksy rules, some possibly true, others doubtful, and others — such as “adding people to a late software project delays it further” [1] — wrong to the point of absurdity. Researchers in software engineering should, as their duty to the community of practicing software practitioners, try to help provide credible answers to such essential everyday questions.

The purpose of this panel discussion is to assess what answers are already known through empirical software engineering, and to define what should be done to get more.

“Empirical software engineering” applies the quantitative methods of the natural sciences to the study of software phenomena. One of its tasks is to subject new methods — whose authors sometimes make extravagant and unsupported claims — to objective scrutiny. But the benefits are more general: empirical software engineering helps us understand software construction better.

There are two kinds of target for empirical software studies: products and processes. Product studies assess actual software artifacts, as found in code repositories, bug databases and documentation, to infer general insights. Project studies assess how software projects proceed and how their participants work; as a consequence, they can share some properties with studies in other fields that involve human behavior, such as sociology and psychology. (It is a common attitude among computer scientists to express doubts: “Do you really want to bring us down to the standards of psychology and sociology?” Such arrogance is not justified. These sciences have obtained many results that are both useful and sound.)

Empirical software engineering has been on a roll for the past decade, thanks to the availability of large repositories, mostly from open-source projects, which hold information about long-running software projects and can be subjected to data mining techniques to identify important properties and trends. Such studies have already yielded considerable and often surprising insights about such fundamental matters as the typology of program faults (bugs), the effectiveness of tests and the value of certain programming language features.

Most of the uncontested successes, however, have been from the product variant of empirical software engineering. This situation is understandable: when analyzing a software repository, an empirical study is dealing with a tangible and well-defined artifact; if any of the results seems doubtful, it is possible and sometimes even easy for others to reproduce the study, a key condition of empirical science. With processes, the object of study is more elusive. If I follow a software project working with Scrum and another using a more traditional lifecycle, and find that one does better than the other, how do I know what other factors may have influenced the outcome? And even if I bring external factors under control how do I compare my results with those of another researcher following other teams in other companies? Worse, in a more realistic scenario I do not always have the luxury of tracking actual industry projects since few companies are enlightened enough to let researchers into their developments; how do I know that I can generalize to industry the conclusions of experiments made with student groups?

Such obstacles do not imply that sound results are impossible; studies involving human behavior in psychology and sociology face many of the same difficulties and yet do occasionally yield insights. But these obstacles explain why there are still few incontrovertible results on process aspects of software engineering. This situation is regrettable since it means that projects large and small embark on specific methods, tools and languages on the basis of hearsay, opinions and sometimes hype rather than solid knowledge.

No empirical study is going to give us all-encompassing results of the form “Agile methods yield better products” or “Object-oriented programming is better than functional programming”. We are entitled to expect, however, that they help practitioners assess some of the issues that await every project. They should also provide a perspective on the conventional wisdom, justified or not, that pervades the culture of software engineering. Here are some examples of general statements and questions on which many people in the field have opinions, often reinforced by the literature, but crying for empirical backing:

The effect of requirements faults: the famous curve by Boehm is buttressed by old studies on special kinds of software (large mission-critical defense projects). What do we really lose by not finding an error early enough?
The cone of uncertainty: is that idea just folklore?
What are the successful techniques for shortening delivery time by adding manpower?
The maximum compressibility factor: is there a nominal project delivery time, and how much can a project decrease it by throwing in money and people?
Pair programming: when does it help, when does it hurt? If it has any benefits, are there in quality or in productivity (delivery time)?
In iterative approaches, what is the ideal time for a sprint under various circumstances?
How much requirements analysis should be done at the beginning of a project, and how much deferred to the rest of the cycle?
What predictors of size correlate best with observed development effort?
What predictors of quality correlate best with observed quality?
What is the maximum team size, if any, beyond which a team should be split?
Is it better to use built-in contracts or just to code assertions in tests?

When asking these and other similar questions relating to core aspects of practical software development, I sometimes hear “Oh, but we know the answer conclusively, thanks to so-and-so’s study“. This may be true in some cases, but in many others one finds, in looking closer, that the study is just one particular experiment, fraught with the same limitations as any other.

The principal aim of the present panel is to find out, through the contributions of the panelists which questions have useful and credible empirical answers already available, whether or not widely known. The answers must indeed be:

Empirical: obtained through objective quantitative studies of projects.
Useful: providing answers to questions of interest to practitioners.
Credible: while not necessarily absolute (a goal difficult to reach in any matter involving human behavior), they must be backed by enough solid evidence and confirmation to be taken as a serious input to software project decisions.

An auxiliary outcome of the panel should be to identify fundamental questions on which credible, useful empirical answers do not exist but seem possible, providing fuel for researchers in the field.

To mature, software engineering must shed the folkloric advice and anecdotal evidence that still pervade the field and replace them with convincing results, established with all the limitations but also the benefits of quantitative, scientific empirical methods.

Note

[1] From Brooks’s Mythical Man-Month.

Rating: 8.7/10 (10 votes cast)

Rating: +3 (from 5 votes)

Category: Empirical Software Engineering, Essay, General technology, Software engineering | 2 Comments

Conferences: Publication, Communication, Sanction

6 February 2013, 18:18

(This article was first published in the Communications of the ACM blog.)

A healthy discussion is taking place in the computer science community on our publication culture. It was spurred by Lance Fortnow’s 2009 article [1]; now Moshe Vardi has taken the lead to prepare a report on the topic, following a workshop in Dagstuhl in November [2]. The present article and one that follows (“The Waves of Publication”) are intended as contributions to the debate.

One of the central issues is what to do with conferences. Fortnow had strong words for the computer science practice of using conferences as the selective publication venues, instead of relying on journals as traditional scientific disciplines do. The criticism is correct, but if we look at the problem from a practical perspective it is unlikely that top conferences will lose their role as certifiers of quality. This is not a scientific matter but one of power. People in charge of POPL or OOPSLA have decisive sway over the careers (one is tempted to say the lives) of academics, particularly young academics, and it is a rare situation in human affairs that people who have critical power voluntarily renounce it. Maybe the POPL committee will see the light: maybe starting in 2014 it will accept all reasonable papers somehow related to “principles of programming languages”, turn the event itself into a pleasant multi-track community affair where everyone in the field can network, and hand over the selection and stamp-of-approval job to a journal such as TOPLAS. Dream on; it is not going to happen.

We should not, however, remain stuck with the status quo and all its drawbacks. That situation is unsustainable. As a single illustration, consider the requirement, imposed by all conferences, that having a paper pass the refereeing process is not enough: you must also register. A couple of months before the conference, authors of accepted papers (at least, they thought their paper was accepted) receive a threatening email telling them that unless they register and pay their paper will not be published after all. Now assume an author, in a field where a conference is the top token of recognition, has his visa application rejected by the country of the conference — a not so uncommon situation — and does not register. (Maybe he does not mind paying the fee, but he does not want to lie by pretending he is going to attend whereas he knows he will not.) He has lost his opportunity for publication and perhaps severely harmed this career. What have such requirements to do with science?

To understand what can be done, we need to analyze the role of conferences. In an earlier article [3] I described four “modes and uses” of publication: Publication, Exam, Business and Ritual. From the organizers’ viewpoint, ignoring the Business and Ritual aspects although they do play a significant role, a conference has three roles: Publication, Communication and Sanction. The publication part corresponds to the proceedings of the conference, which makes articles available to the community at large, not just the conference attendees. The communication part only addresses the attendees: it includes the presentation of papers as well as all other interactions made possible by being present at a conference. The sanction part (corresponding to the “exam” part of the more general classification) is the role of a renowned conference as a stamp of approval for the best work of the moment.

What we should do is separate these roles. A conference can play all three roles, but it can also select two of them, or even just one. A well-established, prestigious conference will want to retain its sanctioning role: accepted papers get the stamp of approval. It will also remain an event, where people meet. And it may distribute proceedings. But the three roles can also be untied:

Publication is the least critical, and can easily be removed from the other two, since everything will be available on the Web. In fact the very notion of proceedings is quickly becoming fuzzy: more and more conferences save money by not distributing printed proceedings to attendees, sometimes not printing any proceedings at all; and some even spare themselves the production of a proceedings-on-a-stick, putting the material on the Web instead. A conference may still decide to have its own proceedings, or it might outsource that part to a journal. Each conference will make these decisions based on its own culture, tradition, ambition and constraints. For authors, the decision does not particularly matter: what counts are the sanction, which is provided by the refereeing process, and the availability of their material to the world, which will be provided in any scenario (at least in computer science where we have, thankfully, the permission to put our papers on our own web sites, an acquired right that our colleagues from other disciplines do not all enjoy).
Separating sanction from communication is a natural step. Acceptance and participation are two different things.

Conference organizers should not be concerned about lost revenue: most authors will still want to participate in the conference, and will get the funding since institutions are used to pay for travel to present accepted papers; some new participants might come, attracted by more interaction-oriented conference styles; and organizers can replace the requirement to register by a choice between registering and paying a publication fee.

Separating the three roles does not mean that any established conference renounces its sanctioning status, acquired through the hard work of building the conference’s reputation, often over decades. But everyone gets more flexibility. Several combinations are possible, such as:

Sanction without communication or publication: papers are submitted for certification through peer-review, they are available on the Web anyway, and there is no need for a conference.
Publication without sanction or communication: an author puts a paper on his web page or on a self-publication site such as ArXiv.
Sanction and communication without publication: a traditional selective conference, which does not bother to produce proceedings.
Communication without sanction: a working conference whose sole aim is to advance the field through presentations and discussions, and accepts any reasonable submission. It may be by invitation (a kind of advance sanction). It may have proceedings (publication) or not.

Once we understand that the three roles are not inextricably tied, the stage is clear for removal for some impediments to a more effective publication culture. Some, not all. The more general problem is the rapidly changing nature of scientific publication, what may be called the concentric waves of publication. That will be the topic of the next article.

References

[1] Lance Fortnow: Time for Computer Science to Grow Up, in Communications of the ACM, Vol. 52, no. 8, pages 33-35, 2009, available here.

[2] Dagstuhl: Perspectives Workshop: Publication Culture in Computing Research, see here.

[3] Bertrand Meyer: The Modes and Uses of Scientific Publication, article on this blog, 22 November 2011, see here.

Rating: 10.0/10 (1 vote cast)

Rating: +1 (from 1 vote)

Category: Computer science, Conference, Essay, Publications, Research evaluation | 1 Comment

A fundamental duality of software engineering

14 October 2012, 20:29

A couple of weeks ago I proposed a small quiz. (I also stated that the answer would come “on Wednesday” — please understand any such promise as “whenever I find the time”. Sorry.) Here is the answer.

The quiz was:

I have a function:

For 0 it yields 0.

For 1 it yields 1.

For 2 it yields 4.

For 3 it yields 9.

For 4 it yields 16.

What is the value for 5?

Shortly thereafter I added a hint: the value for 5 is 25, and changed the question to: “What is the value for 6?”. For good measure we can also ask about the value for 1000. Now compare your answer to what follows.

A good answer for the value at 6 is: 34 . The function in this case is -10 + 5 x + |2 x – 3| + |2 x -7|. It matches the values for the given inputs.

The value for 1000 is 8980:

Another good answer at position 6 is 35.6. It comes up if we assume the function is over reals rather than integers; then a possible formula, which correlates very well (R-square of 0.9997) with the values at the given inputs, is:

869.42645566111 (1 – 0.4325853145802 e^{-.0467615868913719 (x – 17.7342512233011)})^{2.3116827277657443}

with a quite different asymptotic behavior, giving the value 869.4 at position 1000:

Some readers might have thought of another possibility, the square function x², which again matches all the given values:

So which of these answers is right? Each is as good as the others, and as bad. There is in particular no reason to believe that the values given in the quiz’s statement suggest the square function. Any function that fits the given values, exactly (if we stick to integers) or approximately (with reals as simulated on a computer) is an equally worthy candidate. Six inputs, or six thousand, do not resolve the question. At best they are hints.

This difference between a hint and a solution is at the core of software engineering. It is, for example, the difference between a test and a specification. A test tells us that the program works for some values; as Dijkstra famously pointed out, and anyone who has developed a serious program has experienced, it does not tell us that it will work for others. The more successful tests, the more hints; but they are still only hints. I have always wondered whether Dijkstra was explicitly thinking of the Popperian notion of falsifiability: no number of experiments will prove a physical theory (although a careful experiment may boost the confidence in the theory, especially if competing theories fail to explain it, as the famous Eddington expedition did for relativity in 1919 [1]); but a single experiment can disprove a theory. Similarly, being told that our function’s value at 6 is 34 disqualifies the square function and the last one (the exponential), but does not guarantee that the first function (the linear combination) is the solution.

The specification-testing duality is the extension to computer science of the basic duality of logic. It starts with the elementary boolean operators: to prove a or b it suffices to establish a or to establish b; and to disprove a and b it suffices to show that a does not hold or to show that b does not hold. The other way round, to disprove a or b we have to show that a does not hold and to show that b does not hold; to prove that a and b holds, we have to show that a holds and to show that b holds.

Predicate calculus generalizes or to ∃, “there exists”, and and to ∀, “for all”. To prove ∃ x | p (x) (there is an x of which p holds) it suffices to find one value a such that p (a); let’s be pretentious and say we have “skolemized” x. To disprove∀ x | p (x) (p holds of all x) it suffices to find one value for which p does not hold.

In software engineering the corresponding duality is between proofs and tests, or (equivalently) specifications and use cases. A specification is like a “for all”: it tells us what must happen for all envisioned inputs. A test is like a “there exists”: it tells us what happens for a particular input and hence, as in predicate calculus, it is interesting as a disproof mechanism:

A successful test brings little information (like learning the value for 5 when trying to figure out what a function is, or finding one true value in trying to prove a ∀ or a false value in trying to prove a ∃).
An unsuccessful test brings us decisive information (like a false value for a ∀): the program is definitely not correct. It skolemizes incorrectness.

A proof, for its part, brings the discussion to an end when it is successful. In practice, testing may still be useful in this case, but only testing that addresses issues not covered by the proof:

Correctness of the compiler and platform, if not themselves proved correct.
Correctness the proof tools themselves, since most practical proofs require software support.
Aspects not covered by the specification such as, typically, performance and usability.

But for the properties it does cover the proof is final.

It is as foolish, then, to use tests in lieu of specifications as it would be to ignore the limitations of a proof. Agile approaches have caused much confusion here; as often happens in the agile literature [2], the powerful insight is mixed up with harmful advice. The insight, which has significantly improved the practice of software development, is that the regression test suite is a key asset of a project and that tests should be run throughout. The bad advice is to ditch upfront requirements and specifications in favor of tests. The property that tests lack and specifications possess is generality. A test is an instance; a thousand tests can never be more than a thousand instances. As I pointed out in a short note in EiffelWorld (the precursor to this blog) a few years ago [3], the relationship is not symmetric: one can generate tests from a specification, but not the other way around.

The same relationship holds between use cases and requirements. It is stunning to see how many people think that use cases (scenarios) are a form of requirements. As requirements they are as useless as one or ten values are to defining a function. Use cases are a way to complement the requirements by describing the system’s behavior in selected important cases. A kind of reality check, to ensure that whatever abstract aims have been defined for the system it still covers the cases known to be of immediate interest. But to rely on use cases as requirements means that you will get a system that will satisfy the use cases — and possibly little else.

When I use systems designed in recent years, in particular Web-based systems, I often find myself in a stranglehold: I am stuck with the cases that the specifiers thought of. Maybe it’s me, but my needs tend, somehow, to fall outside of these cases. Actually it is not just me. Not long ago, I was sitting close to a small-business owner who was trying to find her way through an insurance site. Clearly the site had a planned execution path for employees, and another for administrators. Problem: she was both an employee and the administrator. I do not know how the session ended, but it was a clear case of misdesign: a system built in terms of standard scenarios. Good specification performs an extra step of abstraction (for example using object-oriented techniques and contracts, but this is for another article). Skipping this step means forsaking the principal responsibility of the requirements phase: to generalize from an analysis of the behavior in known cases to a definition of the desired behaviors in all relevant cases.

Once more, as everywhere else in computer science [4], abstraction is the key to solid results that stand the test of time. Definitely better than judging a book by its cover, inferring a function by its first few values, verifying a program by its tests, or specifying a system by its use cases.

References

[1] See e.g. a blog article: Einstein and Eddington, here.

[2] Bertrand Meyer: Agile! The Good, the Hype and the Ugly, 2013, to appear.

[3] Bertrand Meyer: Test or spec? Test and spec? Test from spec!, EiffelWorld column, 2004 available here.

[4] Jeff Kramer: Is Abstraction the Key to Computer Science?, in Communications of the ACM, vol. 50, no. 4, April 2007, pages 36-42, available from CiteSeer here

Rating: 9.5/10 (32 votes cast)

Rating: +14 (from 14 votes)

Category: Computer science, Design by Contract, Essay, Formal methods and proofs, Software verification, Testing | 11 Comments

The most beautiful monument of Europe

4 June 2012, 15:29

The most beautiful of all monuments in Europe is not the palace of Versailles, notwithstanding the Hall of Mirrors with its endless reflections of chandeliers and pillars, notwithstanding the fairy-tale grace of the Trianons, notwithstanding the sumptuous Hall of Congresses where the 1919 peace conference put a formal end … read the entire text.

Le plus beau des monuments d’Europe n’est pas Versailles, malgré sa Galerie des Glaces où se reflètent à l’infini les lustres et les pilastres, malgré ses Trianons, malgré son imposante Salle du Congrès où prit officiellement fin, en 1919, … lire le texte complet en français.

Yes, I know, this is supposed to be a technology blog.

There are, however, times like right now when intellectuals should not remain silent — especially engineers and scientists.

I wrote the text referenced above several years ago; I don’t remember the exact date but it sounds very much Maastricht-aftermath. I have circulated it to a few friends, but think the time has come to publish it.

I am quite aware that unfolding events may make it look ridiculous. And then what? I will have done my tiny bit to bring people back to reason.

Note: I do not remember the provenance of the photograph. If informed, I would be happy to add the proper acknowledgment.

Rating: 9.3/10 (13 votes cast)

Rating: +12 (from 12 votes)

Tags: Monument
Category: Essay, Europe, Politics, Software engineering | 1 Comment

Domain Theory: the forgotten step in program verification

11 April 2012, 15:04

Program verification is making considerable progress but is hampered by a lack of abstraction in specifications. A crucial step is, almost always, absent from the process; this omission is the principal obstacle to making verification a standard component of everyday software development.

1. Steps in software verification

In the first few minutes of any introduction to program verification, you will be told that the task requires two artifacts: a program, and a specification. The program describes what executions will do; the specification, what they are supposed to do. To verify software is to ascertain that the program matches the specification: that it does is what it should.

The consequence usually drawn is that verification consists of three steps: write a specification, write a program, prove that the program satisfies the specification. The practical process is of course messier, if only because the first two steps may occur in the reverse order and, more generally, all three steps are often intertwined: the specification and the program influence each other, in particular through the introduction of “verification conditions” into the program; and initial proof attempts will often lead to changes in both the specification and the program. But by and large these are the three accepted steps.

Such a description misses a fourth step, a prerequisite to specification that is essential to a scalable verification process: Domain Theory. Any program addresses a specific domain of discourse, be it the domain of network access and communication for a mobile phone system, the domain of air travel for a flight control system, of companies and shares for a stock exchange system and so on. Even simple programs with a limited scope, such as the computation of the maximum of an array, use a specific domain beyond elementary mathematics. In this example, it is the domain of arrays, with their specific properties: an array has a range, a minimum and maximum indexes in that range, an associated sequence of values; we may define a slice a [i..j], ask for the value associated with a given index, replace an element at a given index and so on. The Domain Theory provides a formal model for any such domain, with the appropriate mathematical operations and their properties. In the example the operations are the ones just mentioned, and the properties will include the axiom that if we replace an element at a certain index i with a value v then access the element at an index j, the value we get is v if i = j, and otherwise the earlier value at j.

2. The role of a Domain Theory

The task of devising a Domain Theory is to describe such a domain of reference, in the spirit of abstract data types: by listing the applicable operations and their properties. If we do not treat this task as a separate step, we end up with the kind of specification that works for toy examples but quickly becomes unmanageable for real-life applications. Most of the verification literature, unfortunately, relies on such specifications. They lack abstraction since they keep using the lowest-level mathematical objects and constructs, such as numbers and quantified expressions. They are to specification what assembly language is to modern programming.

Dines Bjørner has for a long time advocated a closely related idea, domain engineering; see for example his book in progress [1]. Unfortunately, he does not take advantage of modularization through abstract data types; the book is an example of always-back-to-the-basics specification, resorting time and again to fully explicit specifications based on a small number of mathematical primitives, and as a consequence making formal specification look difficult.

3. Maximum computed from both ends

As a simple example of modeling through an abstract theory consider an algorithm for computing the maximum of an array. We could use the standard technique that goes through the array one-way, but for variety let us take the algorithm that works from both ends, moving two integer cursors towards each other until they meet. (This example was used in a verification competition at a recent conference, I forgot which one.) The code looks like this:

The specification, expressed by the postcondition (ensure) should state that Result is the maximum of the array; the loop invariant will be closely related to it. How do we express these properties? The obvious way is not the right way. It states the postcondition as something like

∀ k: Z | (k ≥ a.lower ∧ k ≤ a.upper) ⇒ a [k] ≤ Result

∃ k: Z | k ≥ a.lower ∧ k ≤ a.upper ∧ a [k] = Result

In words, Result is at least as large as every element of the array, and is equal to at least one of the elements of the array. The invariant can also be expressed in this style (try it).

The preceding specification expresses the desired property, but it is of an outrageously lower level than called for. The notion of maximum is a general one for arrays over an ordered type. It can be computed through many different algorithms in addition to the one shown above, and exists independently of these algorithms. The detailed, assembly-language-like definition of its properties should not have to be repeated in every case. It should be part of the Domain Theory for the underlying notion, arrays.

4. A specification at the right level of abstraction

In a Domain Theory for arrays of elements from an ordered set, one of the principal operations is maximum, satisfying the above properties. The definition of maximum through these properties belongs at the Domain Theory level. The Domain Theory should include that definition, independent of any particular computational technique such as two_way_max. Then the routine’s postcondition, relying on this notion from the Domain Theory, becomes simply

Result = a.maximum

The application of this approach to the loop invariant is particularly interesting. If you tried to write it at the lowest level, as suggested above, you should have produced something like this:

a.lower ≤ i ≤ j ≤ a.upper

∃ k: Z | k ≥ i ∧ k ≤ j ∧ (∀ l: Z | l ≥ a.lower ∧ l ≤ a.upper ⇒ a [l] ≤ a [k])

The first clause is appropriate but the rest is horrible! With its nested quantified expressions it gives an impression of great complexity for a property that is in fact straightforward, simple enough in fact to be explained to a 10-year-old: the maximum of the entire array can be found between indexes i and j. In other words, it is also the maximum of the array slice going from i to j. The Domain Theory will define the notion of slice and enable us to express the invariant as just

a.lower ≤ i ≤ j ≤ a.upper — This bounding clause remains

a.maximum = (a [i..j ]).maximum

(where we will write the slice a [i..j ] as a.slice (i, j ) if we do not have mechanisms for defining special syntax). To verify the routine becomes trivial: on loop exit the invariant still holds and i = j, so the maximum of the entire array is given by the maximum of the single-element slice a [i..i ], which is the value of its single element a [i ]. This last property — the maximum of a single-element array is its single value — is independent of the verification of any particular program and should be proved as a little theorem of the Domain Theory for arrays.

The comparison between the two versions is striking: without Domain Theory, we are back to the most tedious mathematical manipulations again and again; simple, clear properties look complicated and obscure. This just for a small example on basic data structures; now think what it will be for a complex application domain. Without a first step of formal modeling to develop a Domain Theory, no realistic specification and verification process is realistic.

Although the idea is illustrated here through examples of individual routines, the construction of a Domain Theory should usually occur, in an object-oriented development process, at the level of a class: the embodiment of an abstract data type, which is at the appropriate level of granularity. The theory applies to objects of a given type, and hence will be used for the verification of all operations of that type. This observation justifies the effort of devising a Domain Theory, since it will benefit a whole set of software elements.

5. Components of a Domain Theory

The Domain Theory should include the three ingredients illustrated in the example:

Operations, modeled as mathematical functions (no side effects of course, we are in the world of specification).
Axioms characterizing the defining properties of these operations.
Theorems, characterizing other important properties.

This approach is of course nothing else than abstract data types (the same thing, although few people realize it, as object-oriented analysis). Even though ADTs are a widely popularized notion, supported for example by tools such as CafeOBJ [2] and Maude [3], it is generally not taken to its full conclusions; in particular there is too often a tendency to define every new ADT from scratch, rather than building up libraries of reusable high-level mathematical components in the O-O spirit of reuse.

6. Results, not just definitions

In devising a Domain Theory with the three kinds of ingredient listed above, we should not forget the last one, the theorems! The most depressing characteristic of much of the work on formal specification is that it is long on definitions and short on results, while good mathematics is supposed to be the reverse. I think people who have seriously looked at formal methods and do not adopt them are turned off not so much by the need to use mathematics but by the impression they get little value for it.

That is why Eiffel contracts do get adopted: even if it’s just for testing and debugging, people see immediate returns. It suffices for a programmer to have caught one bug as the violation of a simple postcondition to be convinced for life and lose any initial math-phobia.

7. Quantifiers are evil

As we go beyond simple contract properties — this argument must be positive, this reference will not be void — the math needs to be at the same level of abstraction to which, as modern programmers, we are accustomed. For example, one should always be wary of program specifications relying directly on quantified expressions, as in the low-level variants of the postcondition and loop invariant of the two_way_max routine.

This is not just a matter of taste, as in the choice in logic [4] between lambda expressions (more low-level but also more immediately understandable) and combinators (more abstract but, for many, more abstruse). We are talking here about the fundamental software engineering problem of scalability; more generally, of the understandability, extendibility and reusability of programs, and the same criteria for their specification and verification. Quantifiers are of course needed to express fundamental properties of a structure but in general should not directly appear in program assertions: as the example illustrated, their level of abstraction is lower than the level of discourse of a modern object-oriented program. If the rule — Quantifiers Considered Harmful — is not absolute, it must be pretty close.

Quantified expressions, “All elements of this structure possess this property” and “Some element of this structure possesses this property” — belong in the description of the structure and not in the program. They should appear in the Domain Theory, not in the verification. If you want to express that a hash table search found an element of key K, you should not write

(Result = Void ∧ (∀ i: Z | i ≥ a.lower ∧ i ≤ a.upper ⇒ a.item (i).key ≠ K))
∨
(Result ≠ Void ∧ (∀ i: Z | i ≥ a.lower ∧ i ≤ a.upper ∧ a.item (i).key = K ∧ Result = a.item (i))

but

Result /= Void ⇔ (Result ∈ a.elements_of_key (K))

The quantified expressions will appear in the Domain Theory for the corresponding structure, in the definition of such domain properties as elements_of_key. Then the program’s specification — the contracts to be verified — can rely on concepts that make sense to the programmer; the verification will take advantage of theorems that have been proved independently since they belong to the Domain Theory and do not depend on individual programs.

8. Even the simplest examples…

Practical software verification requires Domain Theory even in the simplest cases, including those often used as purely academic examples. Perhaps the most common (and convenient) way to explain the notion of loop invariant is Euclid’s algorithm to compute the greatest common divisor (gcd) of two numbers (with a structure remarkably similar to that of two_way_max):

I have expressed the postcondition using a concept from an assumed Domain Theory for the underlying problem: gcd, the mathematical function that yields the greatest common divisor of two integers. Many specifications I have seen go back to the basics, with something like this (using \\ for integer remainder):

a \\ Result = 0 ∧ b \\ Result = 0 ∧ ∀ i: N | (a \\ i = 0) ∧ (b \\ i = 0) ⇒ i ≤ Result

This is indeed the definition of what it means for Result to be the gcd of a and b (it divides a, it divides b, and is greater than any other integer that also has these two properties). But it makes no sense to include such a detailed mathematical property in the specification of a program element. It belongs in the domain theory, where it will serve as the definition of a function gcd, which we can then use directly in the specification of the program.

Note how the invariant makes the necessity of the Domain Theory approach even more clear: try to express it in the basic mathematical form, not using the function gcd, It can be done, but the result is typical of the high complexity to usefulness ratio of traditional formal specifications mentioned above. Instead, the invariant that I have included in the program text above says exactly what there is to say, clearly and concisely: at each iteration, the gcd of our two temporary values, i and j, is the result that we are seeking, the gcd of the original values a and b. On exit from the loop, when i and j are equal, their common value is that result.

It is also thanks to the Domain Theory modeling that the verification of the program — consisting of proving that the stated property is indeed invariant — will be so simple: as part of the theory, we should have the two little theorems

i > j > 0 ⇒ gcd (i, j) = gcd (i –j, j)
gcd (i, i) = i

which immediately show the implementation to be correct.

Inside of any big, fat, messy, quantifier-ridden specification there is a simple, elegant and clear Domain-Theory-based specification desperately trying to get out. Find it and use it.

9. From Domain Theory to domain library

One of the reasons most people working on program verification have not used the division into levels of discourse described here, with a clear role for developing a Domain Theory, is that they lack the appropriate notational support. Mathematical notation is of course available, but we are talking about programs a general verification framework cannot resort to a new special notation for every new application domain.

This is one of the places where Eiffel provides a consistent solution, through its seamless approach to integrating programs and specifications in a single notation. Thanks to mechanisms such as deferred classes (classes that describe concepts through detailed specifications without committing to an implementation), Eiffel is as much for specification as for design and implementation; a Domain Theory can be expressed though a set of deferred Eiffel classes, which we may call a domain library. The classes in a domain library should not just be deferred, meaning devoid of implementation; they should in addition describe stateless operations only — queries, not commands — since they are modeling purely mathematical concepts.

An earlier article in this blog [5] outlined the context of our verification work: the EVE project (Eiffel Verification Environment), a practical approach to integrating software verification in the day-to-day practice of modern software development, with the slogan ““Verification As a Matter Of Course”. In this project we have applied the idea of Domain Theory by building a domain library covering fundamental concepts of set theory, including functions and relations. This is the Mathematical Model Library (MML) [6, 7], which we use to verify the new data structure library EiffelBase 2 using specifications at the appropriate level of abstraction.

MML is in fact useful for the specification of a wide variety of programs, since almost every application area can benefit from the general concepts of set, subset, relation and such. But to cover a specific application domain, say flight traffic control, MML will generally not suffice; you will need to devise a Domain Theory that mathematically models the target domain, and may express it in the form of a domain library written in the same general spirit as MML: all deferred, stateless, focused on high-level abstractions.

It is one of the attractions of Eiffel that you can express such a theory and library in the same notation as the programs that will use it — more precisely in a subset of that notation, since the specification classes do not need the imperative constructs of the language such as instructions and attributes. Then both the development process and the verification use a seamlessly integrated set of notations and techniques, and all use the same tools from a modern IDE, in our case EiffelStudio, for browsing, editing, working with graphical repreentation, metrics etc.

10. DSL libraries for specifications

A mechanism to express Domain Theories is to a general specification mechanism essentially like a Domain Specific Language (DSL) is to a general programming language: a specialization for a particular domain. Domain libraries make the approach practical by:

Embedding the specification language in the programming language.
Fundamentally relying on reuse, in the best spirit of object technology.

This approach is in line with the one I presented for handling DSLs in an earlier article of this blog [8] (thanks, by the way, for the many comments received, some of them posted here and some on Facebook and LinkedIn where the post triggered long discussions). It is usually a bad idea to invent a new language for a new application domain. A better solution is to rely on libraries, by taking advantage of the power of object-oriented mechanisms to model (in domain libraries) and implement (for DSLs) the defining features of such a domain, and to make the result widely reusable. The resulting libraries are purely descriptive in the case of a domain library expressing a Domain Theory, and directly usable by programs in the case of a library embodying a DSL, but the goal is the same.

11. A sound and necessary engineering practice

Many ideas superficially look similar to Domain Theory: domain engineering as mentioned above, “domain analysis” as widely discussed in the requirements literature, model-driven development, abstract data type specification… They all start from some of the same observations, but Domain Theory as described in this article is something different: a systematic approach to modeling an arbitrary application domain mathematically, which:

Describes the concepts through applicable operations, axioms and (most importantly) theorems.
Expresses these elements in an applicative (side-effect free, i.e. equivalent to pure mathematics) subset of the programming language, for direct embedding in program specifications.
Relies on the class mechanism to structure the results.
Collects the specifications into specification libraries and promotes the reuse of specifications in the same way we promote software reuse.
Uses the combination of these techniques to ensure that program specifications are at a high level of abstraction, compatible with the programmers’ view of their software.
Promotes a clear and effective verification process.

The core idea is in line with standard engineering practices in disciplines other than software: to build a bridge, a car or a chip you need first to develop a sound model of the future system and its environment, using any useful models developed previously rather than always going back to elementary textbook mathematics.

It seems in fact easier to justify doing Domain Analysis than to justify not doing it. The power of expression and abstraction of our programs has grown by leaps and bounds; it’s time for our specifications to catch up.

References

[1] Dines Bjørner: From Domains to Requirements —The Triptych Approach to Software Engineering, “to be submitted to Springer”, available here.

[2] Kokichi Futatsugi and others: CafeObj page, here.

[3] José Meseguer and others: Maude publication page, here.

[4] J. Roger Hindley, J. P. Seldin: Introduction to Combinators and l-calculus, Cambridge University Press, 1986.

[5] Verification As a Matter Of Course, earlier article on this blog (March 2010), available here.

[6] Bernd Schoeller, Tobias Widmer and Bertrand Meyer. Making specifications complete through models, in Architecting Systems with Trustworthy Components, eds. Ralf Reussner, Judith Stafford and Clemens Szyperski, Lecture Notes in Computer Science, Springer-Verlag, 2006, pages 48-70, available here.

[7] Nadia Polikarpova, Carlo A. Furia and Bertrand Meyer: Specifying Reusable Components, in VSTTE’10: Verified Software: Theories, Tools and Experiments, Edinburgh, August 2010, Lecture Notes in Computer Science, Springer-Verlag, available here.

[8] Never Design a Language, earlier article on this blog (January 2012), available here.

Rating: 9.3/10 (12 votes cast)

Rating: +7 (from 7 votes)

Category: Education, Essay, Language design, Object technology, Programming techniques, Software engineering, Software process, Software verification, Software design | 5 Comments

Never design a language

31 January 2012, 04:26

It is a common occurrence in software development. Someone says: “We should design a language”. The usual context is that some part of the development requires a rich functionality set, and it appears appropriate to provide a flexible solution through a specialized language. As an example, in the development of an airline’s frequent flyer program on which I once worked the suggestion came to design a “Flyer Award Language” , with instructions appropriate for that application domain: record a trip, redeem an award, provide a statement of available miles and so on. A common term for such notations is DSL, for Domain-Specific Language.

Designing a language in such a context is almost always a bad idea (and I am not sure why I wrote “almost”). Languages are endless objects of discussion, usually on the least important aspects, which are also the most visible and those on which everyone has a strong opinion: concrete syntactic properties. People might pretend otherwise (“let’s not get bogged down on syntax, this is just one possible form”) but syntax is what the discussions will get bogged down to — keywords or symbols, this order or that order of operands, one instruction with several variants vs. several instructions… — at the expense of discussing the fundamental issues of functionality.

Worse yet, even if a language will be part of the solution it is usually just one facet to the solution. As was already explained in detail in [1], any useful functionality set will naturally be useful through several interfaces: a textual notation with concrete syntax may be one of them, but other possible ones include an API (Abstract Program Interface) for use from other software elements, a Graphical User Interface, a web user interface, yet another for web services (typically WSDL or some other XML or JSON format).

In such cases, starting with a concrete textual language is pretty silly, since it cannot yield the others directly (it would have to be parsed and further analyzed, which does not make sense). Of all the kinds of interface listed, the most fundamental one is the API: it describes the raw functionality, excluding any choice of syntax but including, thanks to contracts, elements of semantics. For example, a class AWARD in our frequent flyer application might include the feature

             redeem_for_upgrade (c: CUSTOMER; f : FLIGHT)
                   — Upgrade c to next class of service on f.
                       require
                                    c /= holder implies holder.allowed_substitute (c)
                                    f.permitted_for_upgrade (Current)
                                    c.booked ( f )
                        ensure
                                    c.class_of_service ( f ) = old c.class_of_service ( f ) + 1

There is of course no implementation as this declaration only specifies an interface, but it says what needs to be said: to redeem the award for an upgrade, the intended customer must be either the holder of the award or an allowed substitute; the flight must be available for an upgrade with the current award (including the availability of enough miles); the intended customer must already be booked on the flight; and the upgrade will be for the next class of service.

These details are the kind of things that need to be discussed and agreed before the API is finalized. Then one can start discussing about a textual form (a DSL), a graphical interface, a web services interface. They all consist of relatively simple layers to be superimposed on a solidly defined and precisely specified basis. Once you have that basis, you can have all the fun you like arguing over everyone’s favorite forms of concrete syntax; it cannot hurt the project any more. Having these discussions early, at the expense of the more fundamental issues, is a great danger.

One of the key rules for successful software construction — as for many other ventures of course, especially in science and technology — is to distinguish the essential from the auxiliary, and consequently to devote proper attention to the essential issues while avoiding disputations of auxiliary issues. To define functionality, API is essential; language is auxiliary.

So when should you design a language? Never. Well, hardly ever.

Reference

[1] Bertrand Meyer: Introduction to the Theory of Programming Languages, Prentice Hall, 1990.

Rating: 7.9/10 (18 votes cast)

Rating: +8 (from 16 votes)

Tags: API, Domain-Specific Language, DSL
Category: Computer science, Design by Contract, Eiffel, Essay, Language design, Programming techniques, Software engineering, Software design, Theory, User interface design | 5 Comments

The Modes and Uses of Scientific Publication

22 November 2011, 23:26

(This article was initially published in the CACM blog.)
Publication is about helping the advancement of humankind. Of course.

Let us take this basis for granted and look at the other, possibly less glamorous aspects.

Publication has four modes: Publicity; Exam; Business; and Ritual.

1. Publication as Publicity

The first goal of publication is to tell the world that you have discovered something: “See how smart I am!” (and how much smarter than all the others out there!). In a world devoid of material constraints for science, or where the material constraints are handled separately, as in 19^th-century German universities where professors were expected to fund their own labs, this would be the only mode and use of publication. Science today is a more complex edifice.

A good sign that Publication as Publicity is only one of the modes is that with today’s technology we could easily skip all the others. If all we cared about were to make our ideas and results known, we would simply put out our papers on ArXiv or just our own Web page. But almost no one stops there; researchers submit to conferences and journals, demonstrating how crucial the other three modes are to the modern culture of science.

2. Publication as Exam

Academic careers depend on a publication record. Actually this is not supposed to be the case; search and tenure committees are officially interested in “impact,” but any candidate is scared of showing a short publication list where competitors have tens or (commonly) hundreds of items.

We do not just publish; we want to be chosen for publication. Authors are proud of the low acceptance rates of conferences at which their papers have been accepted; in the past few years it has in fact become common practice, in publication lists attached to CVs, to list this percentage next to each accepted article. Acceptance rates are carefully tracked; see for example [2] for software engineering.

As Jeff Naughton has pointed out [1], this mode of working amounts to giving researchers the status of students forced to take exams again and again. Maybe that part is inevitable; the need to justify ourselves anew every morning may be an integral part of being a scientist, especially one funded by other people’s money. Two other consequences of this phenomenon are, I believe, more damaging.

The first risk directly affects the primary purpose of publication (remember the advancement of humankind?): a time-limited review process with low acceptance rates implies that some good papers get rejected and some flawed ones accepted. Everyone in software engineering knows (and recent PC chairs have admitted) that getting a paper accepted at the International Conference on Software Engineering is in part a lottery; with an acceptance rate hovering around 13%, this is inevitable. The mistakes occur both ways: papers accepted or even getting awards, then shown a few months later to be inaccurate; and innovative papers getting rejected because some sentence rubbed the referees the wrong way, or some paper was not cited. With a 4-month review cycle, and the next deadline coming several months later, the publication of a truly important result can be delayed significantly.

The second visible damage is publication inflation. Today’s research environment channels productive research teams towards an LPU (Least Publishable Unit) publication practice, causing an explosion of small contributions and the continuous decrease of the ratio of readers to writers. When submitting a paper I have always had, as my personal goal, to be read; but looking at the overall situation of computer science publication today suggests that this is not the dominant view: the overwhelming goal of publication is publication.

3. Publication as Business

Publishing requires an infrastructure, and money plays a role. Conferences in particular are a business. They have a budget to balance, not always an easy task, although a truly successful conference can be a big money-maker for its sponsor, commercial or non-profit. The financial side of conference publication has its consequences on authors: if you do not pay your fees, not only will you be unable to participate, but your paper will not be published.

One can deplore these practices, in particular their effect on authors from less well-endowed institutions, but they result from today’s computer science publication culture with its focus on the conference, what Lance Fortnow has called “A Journal in a Hotel”.

Sometimes the consequences border on the absurd. The ASE conference (Automated Software Engineering) accepts some contributions as “short papers”. Fair enough. At ASE 2009, “short paper” did not mean a shorter conference presentation but the permission to put up a poster and stand next to it for a while and answer passersby’s questions. For that privilege — and the real one: a publication in the conference volume — one had to register for the conference. ASE 2009 was in New Zealand, the other end of the world for a majority of authors. I ceded to the injunction: who was I to tell the PhD student whose work was the core of the submission, and who was so happy to have a paper accepted at a well-ranked conference, that he was not going to be published after all? But such practices are dubious. It would be more transparent to set up an explicit pay-for-play system, with page charges: at least the money would go to a scientific society or a university. Instead we ended up funding (in addition to the conference, which from what I heard was an excellent experience) airlines and hotels.

What makes such an example remarkable is that a reasonable justification exists for every one of its components: a highly selective refereeing process to maintain the value of the publication venue; limiting the number of papers selected for full presentation, to avoid a conference with multiple parallel tracks (and the all too frequent phenomenon of conference sessions whose audience consists of the three presenters plus the session chair); making sure that authors of published papers actually attend the event, so that it is a real conference with personal encounters, not just an opportunity to increment one’s publication count. The concrete result, however, is that authors of short papers have the impression of being ransomed without getting the opportunity to present their work in a serious way. Literally seconds as I was going to hit the “publish” button for the present article, an author of an accepted short paper for ASE 2012 (where the process appears similar) sent an email to complain, triggering a new discussion. We clearly need to find better solutions to resolve the conflicting criteria.

4. Publication as Ritual

Many of the seminal papers in science, including some of the most influential in computer science, defy classification and used a distinctive, one-of-a-kind style. Would they stand a chance in one of today’s highly ranked conferences, such as ICSE in software or VLDB in databases? It’s hard to guess. Each community has developed its own standard look-and-feel, so that after a while all papers start looking the same. They are like a classical mass with its Te Deum, Agnus Dei and Kyrie Eleison. (The “Te Deum” part is, in a conference submission, spread throughout the paper, in the form of adoring citations of the program committee members’ own divinely inspired articles, good for their H-indexes if they bless your own offering.)

All empirical software engineering papers, for example, have the obligatory “Threats to Validity” section, which is has developed into a true art form. The trick is the same as in the standard interview question “What can you say about your own deficiencies?”, to which every applicant know the key: describe a personality trait so that you superficially appear self-critical but in reality continue boasting, as in “sometimes I take my work too much to heart” [3]. The “Threats to Validity” section follows the same pattern: you try to think of all possible referee objections, the better to refute them.

Another part of the ritual is the “related work” section, treacherous because you have to make sure not to omit anything that a PC member finds important; also, you must walk a fine line between criticizing existing research too much, which could offend someone, or not enough, which enables the referee to say that you are not bringing anything significantly new. I often wonder who, besides the referees, reads those sections. But here too it is easier to lament than to fault the basic idea or propose better solutions. We do want to avoid wasting our time on papers whose authors are not aware of previous work. The related work section allows referees to perform this check. Its importance in the selection process has, however, grown out of proportion. It is one thing to make sure that a paper is state-of-the-art, but another to reject it (as often happens) because it fails to cite a particular contribution whose results would not directly affect its own. Here we move from the world of the rational to the world of the ritual. An extreme and funny recent example — funny to me, not necessarily to the coauthors — is a rejection from APSEC 2011, the Australia-Pacific Software Engineering Conference, based on one review (the others were positive) that stated: “How novel is this? Are [there] not any cloud-based IDEs out there that have [a] similar awareness model integrated into their CM? This is something the related work [section] fails to describe precisely. [4]” The ritual here becomes bizarre: as far as we know, no existing system discusses a similar model; the reviewer too does not know of any; but he blasts the paper all the same for not citing work that he thinks must have been done by someone, somehow, somewhere. APSEC is a fine conference — it has to be, from the totally unbiased criterion that it accepted another one of our submissions this year! — and this particular paper may or may not have been ready for publication; judge it for yourself [5]. Such examples suggest, however, that the ritual of computer science publication has its limits.

Publicity, Exam, Business, Ritual: to which one of the four modes of publication are you most attuned? Oh, sorry, I forgot: in your case, it is solely for the advancement of humankind.

References and notes

[1] Jeffrey F. Naughton, DBMS Research: First 50 Years, Next 50 Years, slides of keynote at 26^th IEEE International Conference on Data Engineering, 2010, available at lazowska.cs.washington.edu/naughtonicde.pdf .

[2] Tao Xie, Software Engineering Conferences, at people.engr.ncsu.edu/txie/seconferences.htm .

[3] I once saw on French TV a hilarious interview of an entrepreneur who had started a software company in Vietnam, where job candidates just did not know “the code”, and moved on, in response to such a question, to tell the interviewer about being rude to their mother and all the other horrible things they had done in their lives.

[4] The words in brackets were not in the review but I added them for clarity.

[5] Martin Nordio, H.-Christian Estler, Carlo A. Furia and Bertrand Meyer: Collaborative Software Development on the Web, available at arxiv.org/abs/1105.0768 .

(This article was first published on the CACM blog in September 2011.)

Rating: 9.3/10 (11 votes cast)

Rating: +9 (from 9 votes)

Tags: Publication
Category: Computer science, Essay, Recycled, Research evaluation, Writing and style | 3 Comments

Nastiness in computer science

27 August 2011, 18:51

(This article was originally published in the CACM blog.)

Are we malevolent grumps? Nothing personal, but as a community computer scientists sometimes seem to succumb to negativism.

They admit it themselves. A common complaint in the profession (at least in academia) is that instead of taking a cue from our colleagues in more cogently organized fields such as physics, who band together for funds, promotion, and recognition, we are incurably fractious. In committees, for example, we damage everyone’s chances by badmouthing colleagues with approaches other than ours. At least this is a widely perceived view (“Circling the wagons and shooting inward,” as Greg Andrews put it in a recent discussion). Is it accurate?

One statistic that I have heard cited is that in 1-to-5 evaluations of projects submitted to the U.S. National Science Foundation the average grade of computer science projects is one full point lower than the average for other disciplines. This is secondhand information, however, and I would be interested to know if readers with direct knowledge of the situation can confirm or disprove it.

More such examples can be found in the material from a recent keynote by Jeffrey Naughton, full of fascinating insights (see his Powerpoint slides ). Naughton, a database expert, mentions that only one paper out of 350 submissions to SIGMOD 2010 received a unanimous “accept” from its referees, and only four had an average accept recommendation. As he writes, “either we all suck or something is broken!”

Much of the other evidence I have seen and heard is anecdotal, but persistent enough to make one wonder if there is something special with us. I am reminded of a committee for a generously funded CS award some time ago, where we came close to not giving the prize at all because we only had “good” proposals, and none that a committee member was willing to die for. The committee did come to its senses, and afterwards several members wondered aloud what was the reason for this perfectionism that almost made us waste a great opportunity to reward successful initiatives and promote the discipline.

We come across such cases so often—the research project review that gratuitously but lethally states that you have “less than a 10% chance” of reaching your goals, the killer argument “I didn’t hear anything that surprised me” after a candidate’s talk—that we consider such nastiness normal without asking any more whether it is ethical or helpful. (The “surprise” comment is particularly vicious. Its real purpose is to make its author look smart and knowledgeable about the ways of the world, since he is so hard to surprise; and few people are ready to contradict it: Who wants to admit that he is naïve enough to have been surprised?)

A particular source of evidence is refereeing, as in the SIGMOD example. I keep wondering at the sheer nastiness of referees in CS venues.

We should note that the large number of rejected submissions is not by itself the problem. Naughton complains that researchers spend their entire careers being graded, as if passing exams again and again. Well, I too like acceptance better than rejection, but we have to consider the reality: with acceptance rates in the 8%-20% range at good conferences, much refereeing is bound to be negative. Nor can we angelically hope for higher acceptance rates overall; research is a competitive business, and we are evaluated at every step of our careers, whether we like it or not. One could argue that most papers submitted to ICSE and ESEC are pretty reasonable contributions to software engineering, and hence that these conferences should accept four out of five submissions; but the only practical consequence would be that some other venue would soon replace ICSE and ESEC as the publication place that matters in software engineering. In reality, rejection remains a frequent occurrence even for established authors.

Rejecting a paper, however, is not the same thing as insulting the author under the convenient cover of anonymity.

The particular combination of incompetence and arrogance that characterizes much of what Naughton calls “bad refereeing” always stings when you are on the receiving end, although after a while it can be retrospectively funny; one day I will publish some of my own inventory, collected over the years. As a preview, here are two comments on the first paper I wrote on Eiffel, rejected in 1987 by the IEEE Transactions on Software Engineering (it was later published, thanks to a more enlightened editor, Robert Glass, in the Journal of Systems and Software, 8, 1988, pp. 199-246 ). The IEEE rejection was on the basis of such review gems as:

I think time will show that inheritance (section 1.5.3) is a terrible idea.
Systems that do automatic garbage collection and prevent the designer from doing his own memory management are not good systems for industrial-strength software engineering.

One of the reviewers also wrote: “But of course, the bulk of the paper is contained in Part 2, where we are given code fragments showing how well things can be done in Eiffel. I only read 2.1 arrays. After that I could not bring myself to waste the time to read the others.” This is sheer boorishness passing itself off as refereeing. I wonder if editors in other, more established disciplines tolerate such attitudes. I also have the impression that in non-CS journals the editor has more personal leverage. How can the editor of IEEE-TSE have based his decision on such a biased an unprofessional review? Quis custodiet ipsoes custodes?

“More established disciplines”: Indeed, the usual excuse is that we are still a young field, suffering from adolescent aggressiveness. If so, it may be, as Lance Fortnow has argued in a more general context, “time for computer science to grow up.” After some 60 or 70 years we are not so young any more.

What is your experience? Is the grass greener elsewhere? Are we just like everyone else, or do we truly have a nastiness problem in computer science?

Rating: 9.5/10 (31 votes cast)

Rating: +19 (from 19 votes)

Tags: Refereeing, Research
Category: Computer science, Essay, General technology | 4 Comments

All Bugs Great and Small

23 August 2011, 08:38

(Acknowledgment: this article came out of a discussion with Manuel Oriol, Carlo Furia and Yi Wei. The material is largely theirs but the opinions are mine.)

A paper on automatic testing, submitted some time ago, received the following referee comment:

The case study seems unrealistic and biased toward the proposed technique. 736 unique faults found in 92 classes means at least 8 unique faults per class at the same time. I have never seen in all my life a published library with so many faults …

This would be a good start for a discussion of what is wrong with refereeing in computer science today (on the negativism of our field see [1]); we have a referee who mistakes experience for expertise, prejudice for truth, and refuses to accept carefully documented evidence because “in all his life”, presumably a rich and rewarding life, he has never seen anything of the sort. That is not the focus of the present article, however; arrogant referees eventually retire and good papers eventually get published. The technical problems are what matters. The technical point here is about testing.

Specifically, what bugs are worth finding, and are high bug rates extraordinary?

The paper under review was a step in the work around the automatic testing tool AutoTest (see [2] for a slightly older overall description and [3] for the precise documentation). AutoTest applies a fully automatic strategy, exercising classes and their routines without the need to provide test cases or test oracles. What makes such automation possible is the combination of random generation of tests and reliance on contracts to determine the success of tests.

For several years we have regularly subjected libraries, in particular the EiffelBase data structure library, to long AutoTest sessions, and we keep finding bugs (the better term is faults). The fault counts are significant; here they caught the referee’s eye. In fact we have had such comments before: I don’t believe your fault counts for production software; your software must be terrible!

Well, maybe.

My guess is that in fact EiffelBase has no more bugs, and possibly far fewer bugs, than other “production” code. The difference is that the AutoTest framework performs far more exhaustive tests than usually practiced.

This is only a conjecture; unlike the referee I do not claim any special powers that make my guesses self-evident. Until we get test harnesses comparable to AutoTest for environments other than Eiffel and, just as importantly, libraries that are fully equipped with contracts, enabling the detection of bugs that otherwise might not come to light, we will not know whether the explanation is the badness of EiffelBase or the goodness of AutoTest.

What concrete, incontrovertible evidence demonstrates is that systematic random testing does find faults that human testers typically do not. In a 2008 paper [4] with Ilinca Ciupa, Manuel Oriol and Alexander Pretschner, we ran AutoTest on some classes and compared the results with those of human testers (as well as actual bug reports from the field, since this was released software). We found that the two categories are complementary: human testers find faults that are still beyond the reach of automated tools, but they typically never find certain faults that AutoTest, with its stubborn dedication to leaving no stone unturned, routinely uncovers. We keep getting surprised at bugs that AutoTest detects and which no one had sought to test before.

A typical set of cases that human programmers seldom test, but which frequently lead to uncovering bugs, involves boundary values. AutoTest, in its “random-plus” strategy, always exercises special values of every type, such as MAXINT, the maximum representable integer. Programmers don’t. They should — all testing textbooks tell them so — but they just don’t, and perhaps they can’t, as the task is often too tedious for a manual process. It is remarkable how many routines using integers go bezerk when you feed them MAXINT or its negative counterpart. Some of the fault counts that seem so outrageous to our referee directly come from trying such values.

Some would say the cases are so extreme as to be insignificant. Wrong. Many documented software failures and catastrophes are due to untested extreme values. Perhaps the saddest is the case of the Patriot anti-missile system, which at the beginning of the first Gulf war was failing to catch Scud missiles, resulting in one case in the killing of twenty-eight American soldiers in an army barrack. It was traced to a software error [5]. To predict the position of the incoming missile, the computation multiplied time by velocity. The time computation used multiples of the time unit, a tenth of a second, stored in a 24-bit register and hence approximated. After enough time, long enough to elapse on the battlefield, but longer than what the tests had exercised, the accumulated error became so large as to cause a significant — and in the event catastrophic — deviation. The unique poser of automatic testing is that unlike human testers it is not encumbered by a priori notions of a situation being extreme or unlikely. It tries all the possibilities it can.

The following example, less portentous in its consequences but just as instructive, is directly related to AutoTest. For his work on model-based contracts [6] performed as part of his PhD completed in 2008 at ETH, Bernd Schoeller developed classes representing the mathematical notion of set. There were two implementations; it turned out that one of them, say SET1, uses data structures that make the subset operation ⊆ easy to program efficiently; in the corresponding class, the superset operation, a ⊇ b, is then simply implemented as b ⊆ a. In the other implementation, say SET2, it is the other way around: ⊇ is directly implemented, and a ⊆ b, is implemented as b ⊇ a. This all uses a nice object-oriented structure, with a general class SET defining the abstract notion and the two implementations inheriting from it.

Now you may see (if you have developed a hunch for automated testing) where this is heading: AutoTest knows about polymorphism and dynamic binding, and tries all the type combinations that make sense. One of the generated test cases has two variables s1 and s2 of type SET, and tries out s2 ⊆ s1; in one of the combinations that AutoTest tries, s1 is dynamically and polymorphically of type SET1 and s2 of type SET2. The version of ⊆ that it will use is from SET2, so it actually calls s1 ⊇ s2; but this tests the SET1 version of ⊇, which goes back to SET2. The process would go on forever, were it not for a timeout in AutoTest that uncovers the fault. Bernd Schoeller had tried AutoTest on these classes not in the particular expectation of finding bugs, but more as a favor to the then incipient development of AutoTest, to see how well the tool could handle model-based contracts. The uncovering of the fault, testament to the power of relentless, systematic automatic testing, surprised us all.

In this case no contract was violated; the problem was infinite recursion, due to a use of O-O techniques that for all its elegance had failed to notice a pitfall. In most cases, AutoTest finds the faults through violated postconditions or class invariants. This is one more reason to be cautious about sweeping generalizations of the kind “I do not believe these bug rates, no serious software that I have seen shows anything of the sort!”. Contracts express semantic properties of the software, which the designer takes care of stating explicitly. In run-of-the-mill code that does not benefit from such care, lots of things can go wrong but remain undetected during testing, only to cause havoc much later during some actual execution.

When you find such a fault, it is irrelevant that the case is extreme, or special, or rare, or trivial. When a failure happens it no longer matter that the fault was supposed to be rare; and you will only know how harmful it is when you deal with the consequences. Testing, single-mindedly devoted to the uncovering of faults [7], knows no such distinction: it hunts all bugs large and small.

References

[1] The nastiness problem in computer science, article on the CACM blog, 22 August 2011, available here.

[2] Bertrand Meyer, Ilinca Ciupa, Andreas Leitner, Arno Fiva, Yi Wei and Emmanuel Stapf: Programs that Test Themselves, IEEE Computer, vol. 42, no. 9, pages 46-55, September 2009, also available here.

[3] Online AutoTest documentation, available here at docs.eiffel.com.

[4] Ilinca Ciupa, Bertrand Meyer, Manuel Oriol and Alexander Pretschner: Finding Faults: Manual Testing vs. Random+ Testing vs. User Reports, in ISSRE ’08, Proceedings of the 19th IEEE International Symposium on Software Reliability Engineering, Redmond, November 2008, available here.

[5] US General Accounting Office: GAO Report: Patriot Missile Defense– Software Problem Led to System Failure at Dhahran, Saudi Arabia, February 4, 1992, available here.

[6] Bernd Schoeller, Tobias Widmer and Bertrand Meyer: Making Specifications Complete Through Models, in Architecting Systems with Trustworthy Components, eds. Ralf Reussner, Judith Stafford and Clemens Szyperski, Lecture Notes in Computer Science, Springer-Verlag, 2006, available here.

[7] Bertrand Meyer: Seven Principles of Software testing, in IEEE Computer, vol. 41, no. 10, pages 99-101, August 2008available here.

Rating: 9.1/10 (9 votes cast)

Rating: +6 (from 6 votes)

Tags: AutoTest, Design by Contract, Eiffel
Category: Design by Contract, Essay, Software engineering, Testing | 4 Comments

Testing insights

11 July 2011, 16:25

Lionel Briand and his group at the Simula Research Laboratory in Oslo have helped raise the standard for empirical research in testing and other software engineering practices by criticizing work that in their opinion relies on wrong assumptions or insufficiently supported evidence. In one of their latest papers [1] they take aim at “Adaptive Random Testing” (ART); one of the papers they criticize is from our group at ETH, on the ARTOO extension [2] to this testing method. Let’s examine the criticism!

We need a bit of background on random testing, ART, and ARTOO:

Random testing tries inputs based on a random process rather than attempting a more sophisticated strategy; it was once derided as silly [3], but has emerged in recent years as a useful technique. Our AutoTest tool [4], now integrated in EiffelStudio, has shown it to be particularly effective when applied to code equipped with contracts, which provide built-in test oracles. As a result of this combination, testing can be truly automatic: the two most tedious tasks of traditional testing, test case preparation and test oracle definition, can be performed without human intervention.
ART, developed by Chen and others [5], makes random testing not entirely random by ensuring that the inputs are spread reasonably evenly in the input domain.
ARTOO, part of Ilinca Ciupa’s PhD thesis on testing defended in 2008, generalized ART to object-oriented programs, by defining a notion of distance between objects; the ARTOO strategy avoids choosing objects that are too close to each other. The distance formula, which you can find in[2], combines three elementary distances: between the types of the objects involved, the values in their primitive fields (integers etc.), and, recursively, the objects to which they have references.

Arcuri and Briand dispute the effectiveness of ART and criticize arguments that various papers have used to show its effectiveness. About the ARTOO paper they write

The authors concluded that ART was better than random testing since it needed to sample less test cases before finding the first failure. However, ART was also reported as taking on average 1.6 times longer due to the distance calculations!

To someone not having read our paper the comment and the exclamation mark would seem to suggest that the paper somehow downplays this property of random testing, but in fact it stresses it repeatedly. The property appears for example in boldface as part of the caption to Table 2: In most cases ARTOO requires significantly less tests to find a fault, but entails a time overhead, and again in boldface in the caption to Table 3: The overhead that the distance calculations introduce in the testing process causes ARTOO to require on average 1.6 times more time than RAND to find the first fault.

There is no reason, then, to criticize the paper on this point. It reports the results clearly and fairly.

If we move the focus from the paper to the method, however, Arcuri and Briand have a point. As they correctly indicate, the number of tests to first fault is not a particularly useful criterion. In fact I argued against it in another paper on testing [6]

The number of tests is not that useful to managers, who need help deciding when to stop testing and ship, or to customers, who need an estimate of fault densities. More relevant is the testing time needed to uncover the faults. Otherwise we risk favoring strategies that uncover a failure quickly but only after a lengthy process of devising the test; what counts is total time. This is why, just as flies get out faster than bees, a seemingly dumb strategy such as random testing might be better overall.

(To understand the mention of flies and bees you need to read [6].) The same article states, as its final principle:

Principle 7: Assessment criteria A testing strategy’s most important property is the number of faults it uncovers as a function of time.

The ARTOO paper, which appeared early in our testing work, used “time to first failure” because it has long been a standard criterion in the testing literature, but it should have applied our own advice and focused on more important properties of testing strategies.

The “principles” paper [6] also warned against a risk awaiting anyone looking for new test strategies:

Testing research is vulnerable to a risky thought process: You hit upon an idea that seemingly promises improvements and follow your intuition. Testing is tricky; not all clever ideas prove helpful when submitted to objective evaluation.

The danger is that the clever ideas may result in so much strategy setup time that any benefit on the rest of the testing process is lost. This danger threatens testing researchers, including those who are aware of it.

The idea of ARTOO and object distance remains attractive, but more work is needed to make it an effective contributor to automated random testing and demonstrate that effectiveness. We can be grateful to Arcuri and Briand for their criticism, and I hope they continue to apply their iconoclastic zeal to empirical software engineering work, ours included.

I have objections of my own to their method. They write that “all the work in the literature is based either on simulations or case studies with unreasonably high failure rates”. This is incorrect for our work, which does not use simulations, relying instead on actual, delivered software, where AutoTest routinely finds faults in an automatic manner.

In contrast, however, Arcuri and Briand rely on fault seeding (also known as fault introduction or fault injection):

To obtain more information on how shapes appear in actual SUT, we carried out a large empirical analysis on 11 programs. For each program, a series of mutants were generated to introduce faults in these programs in a systematic way. Faults generated through mutation [allow] us to generate a large number of faults, in an unbiased and varied manner. We generated 3727 mutants and selected the 780 of them with lower detection probabilities to carry out our empirical analysis of faulty region shapes.

In the absence of objective evidence attesting to the realism of fault seeding, I do not believe any insights into testing obtained from such a methodology. In fact we adopted, from the start of our testing work, the principle that we would never rely on fault seeding. The problem with seeded faults is that there is no guarantee they reflect the true faults that programmers make, especially the significant ones. Techniques for fault seeding are understandably good at introducing typographical mistakes, such as a misspelling or the replacement of a “+” by a “-”; but these are not interesting kinds of fault, as they are easily caught by the compiler, by inspection, by low-tech static tools, or by simple tests. Interesting faults are those resulting from a logical error in the programmer’s mind, and in my experience (I do not know of good empirical studies on this topic) seeding techniques do not generate them.

For these reasons, all our testing research has worked on real software, and all the faults that AutoTest has found were real faults, resulting from a programmer’s mistake.

We can only apply this principle because we work with software equipped with contracts, where faults will be detected through the automatic oracle of a violated assertion clause. It is essential, however, to the credibility and practicality of any testing strategy; until I see evidence to the contrary, I will continue to disbelieve any testing insights resulting from studies based on artificial fault injection.

References

[1] Andrea Arcuri and Lionel Briand: Adaptive Random Testing: An Illusion of Effectiveness, in ISSTA 2011 (International Symposium on Software Testing and Analysis), available here.

[2] Ilinca Ciupa, Andreas Leitner, Manuel Oriol and Bertrand Meyer: ARTOO: Adaptive Random Testing for Object-Oriented Software, in ICSE 2008: Proceedings of 30th International Conference on Software Engineering, Leipzig, 10-18 May 2008, IEEE Computer Society Press, 2008, also available here.

[3] Glenford J. Myers. The Art of Software Testing. Wiley, New York, 1979. Citation:

Probably the poorest methodology of all is random-input testing: the process of testing a program by selecting, at random, some subset of all possible input values. In terms of the probability of detecting the most errors, a randomly selected collection of test cases has little chance of being an optimal, or close to optimal, subset. What we look for is a set of thought processes that allow one to select a set of test data more intelligently. Exhaustive black-box and white-box testing are, in general, impossible, but a reasonable testing strategy might use elements of both. One can develop a reasonably rigorous test by using certain black-box-oriented test-case-design methodologies and then supplementing these test cases by examining the logic of the program (i.e., using white-box methods).

[4] Bertrand Meyer, Ilinca Ciupa, Andreas Leitner, Arno Fiva, Yi Wei and Emmanuel Stapf: Programs that Test Themselves, IEEE Computer, vol. 42, no. 9, pages 46-55, September 2009, available here. For practical uses of AutoTest within EiffelStudio see here.

[5] T. Y. Chen, H Leung and I K Mak: Adaptive Random Testing, in Advances in Computer Science, ASIAN 2004, Higher-Level Decision Making, ed. M.J. Maher, Lecture Notes in Computer Science 3321, Springer-Verlag, pages 320-329, 2004, available here.

[6] Bertrand Meyer: Seven Principles of Software testing, in IEEE Computer, vol. 41, no. 10, pages 99-101, August 2008, also available here.

Rating: 8.5/10 (6 votes cast)

Rating: +1 (from 7 votes)

Tags: ARTOO, AutoTest, Random testing
Category: Computer science, Design by Contract, Empirical Software Engineering, Essay, Research evaluation, Software engineering, Software verification, Testing | Comment

If I’m not pure, at least my functions are

4 July 2011, 22:11

If I’m not pure, at least my jewels are [1].

We often need to be reassured that a routine, usually a function, is “pure”, meaning that it does not change the state of the computation. For example, a function used in a contract element (precondition, postcondition, class invariant, loop invariant) should be purely descriptive, since it is a specification element; evaluating it, typically for testing and debugging, should not create a change of behavior — a “Heisenberg effect” — in the very computation that it is intended to assess. Another application is in a concurrency context, particularly in SCOOP (see earlier posts and forthcoming ones): if one or more functions are pure, several of their executions can run concurrently on the same object.

The notion of purity admits variants. The usual notion is what [2] calls weak purity, which precludes changes to previously existing objects but allow creating new objects. In the EiffelBase library we also encounter routines that have another form of purity, which we may call “relative” purity: they can leave the same state on exit as they found on entry, but in-between they might change the state. For the rest of this discussion we will rely on the standard notion of weak purity: no changes permitted on existing objects.

It is often suggested that the programming language should support specifying that a routine is pure; many people have indeed proposed the addition of a keyword such as pure to Eiffel. One of the reasons this is not — in my opinion — such a great idea is that purity is just a special case of the more general problem of framing: specifying and verifying what a routine does not change. If we can specify an arbitrary frame property, then we can, as a special case covered by the general mechanism, specify that a routine changes nothing.

To see why framing is so important, consider a class ACCOUNT with a routine deposit that has the postcondition

balance = old balance + sum………..— Where sum is the argument of deposit

Framing here means stating that nothing else than balance changes; for example the account’s owner and its number should remain the same. It is not practical to write all individual postcondition clauses such as

owner= old owner
number= old number

and so on. But we do need to specify these properties and enforce them, if only to avoid that a descendant class (maybe MAFIA_ACCOUNT) distort the rules defined by the original.

One technique is to add a so-called “modifies clause”, introduced by verification tools such as ESC-Java [3] and JML [4]. Modifies clauses raise some theoretical issues; in particular, the list of modified expressions is often infinite, so we must restrict ourselves to an abstract-data-type view where we characterize a class by commands and queries and the modifies clause only involves queries of the class. Many people find this hard to accept at first, since anything that is not talked about can change, but it is the right approach. A modifies clause of sorts, included in the postcondition, appeared in an earlier iteration of the Eiffel specification, with the keyword only (which is indeed preferable to modifies, which in the Eiffel style favoring grammatically simple keywords would be modify, since what we want to express is not that the routine must change anything at all but that it may only change certain specified results!). The convention worked well with inheritance, since it included the rule that a clause such as only balance, in class ACCOUNT, prescribes that the routine may not, in its modifies version as well as in any version redefined in descendants, change any other query known at the level of ACCOUNT; but a descendant version may change, subject to its own ACCOUNT clauses, any new query introduced by a descendant.

To declare a routine as pure, it would suffice to use an empty only clause (not very elegant syntactically — “only” what? — but one can get used to it).

This construct has been discarded, as it places too heavy a burden on the programmer-specifier. Here the key observation resulted from a non-scientific but pretty extensive survey I made of all the JML code I could get my hands on. I found that every time a query appeared in a modifies clause it was also listed in the postcondition! On reflection, this seems reasonable: if you are serious about specification, as anyone bothering to write such a clause surely is, you will not just express that something changes and stop there; you will write something about how it may change. Not necessarily the exact result, as in

my_query = precise_final_value

but at least some property of that result, as in

some_property (my_query)

This observation has, however, an inescapable consequence for language design: modifies or only clauses should be inferred by the compiler from the postcondition, not imposed on the programmer as an extra burden. The convention, which we may call the Implicit Framing Rule, is simple:

A routine may change the value of a query only if the query is specified in the routine’s postcondition

(or, if you like double negation, “no routine may change the value of a query other than those specified in its postcondition”). Here we say that a feature is “specified” in a postcondition if it appears there outside of the scope of an old expression. (Clearly, an occurrence such as old balance does not imply that balance can be modified, hence this restriction to occurrences outside of an old.)

With this convention the only clause becomes implicit: it would simply list all the queries specified in the postcondition, so there is no need for the programmer to write it. For the rare case of wanting to specify that a query q may change, but not wanting to specify how, it is easy to provide a library function, say involved, that always return true and can be used in postconditions, as in involved (q).

The convention is of course not just a matter of programming methodology but, in an IDE supporting verification, such as the EVE “Verification As a Matter Of Course” environment which we are building for Eiffel [5], the compiler will enforce the definition above — no change permitted to anything not specified in the postcondition — and produce an error in case of a violation. This also means that we can easily specify that a routine is pure: it must simply not specify any query in its postcondition. It may still list it in an old clause, as happens often in practice, e.g.

Result = old balance – Minimum_balance………..— In the postcondition of a function available_funds

Note the need to use old here. Apart from this addition of old to some postconditions, a considerable advantage of the convention is that most existing code using pure functions will be suitable to the new purity enforcement without any need to provide new annotations.

I believe that this is the only sustainable convention. It does not, of course, solve the frame problem by itself (for attempts in this direction see [6, 7]), but it is a necessary condition for a solution that is simple, easily taught, fairly easily implemented, and effective. It goes well with model-based specifications [8, 9], which I believe are the technique of most promise for usable specifications of object-oriented software. And it provides a straightforward, no-frills way to enforce purity where prescribed by the Command-Query Separation principle [10, 11]: if I’m not pure, at least my functions must be.

References

[1] From the lyrics of the aria Glitter and Be Gay in Leonard Bernstein’s Candide, text by Lillian Hellman and others. Youtube offers several performances, including by Diana Damrau (here) and Natalie Dessay (here) . For the text see e.g. here.

[2] Adam Darvas and Peter Müller: Reasoning About Method Calls in Interface Specifications, in Journal of Object Technology, Volume 5, no. 5, jUNE 2006, pages 59-85, doi:10.5381/jot.2006.5.5.a3, available here.

[3] C. Flanagan, K.R.M. Leino, M. Lillibridge, G. Nelson, J. B. Saxe and R. Stata: Extended static checking for Java, in PLDI 2002 (Programming Language Design and Implementation), pages 234–245, 2002.

[4] Gary Leavens et al.: Java Modeling Language, see references here.

[5] Julian Tschannen, Carlo A. Furia, Martin Nordio, and Bertrand Meyer: Verifying Eiffel Programs with Boogie, to appear in Boogie 2011, First International Workshop on Intermediate Verification Languages, Wroclaw, August 2011. See documentation about the EVE project on the project page.

[6] Ioannis Kassios: Dynamic Frames: Support for Framing, Dependencies and Sharing Without Restrictions, in Formal Methods 2006, eds. J. Misra, T. Nipkow and E. Sekerinski, Lecture Notes in Computer Science 4085, Springer Verlag, 2006, pages 268-283.

[7] Bernd Schoeller: Making Classes Provable through Contracts, Models and Frames, PhD thesis, ETH Zurich, 2007, available here

[8] Bernd Schoeller, Tobias Widmer and Bertrand Meyer: Making Specifications Complete Through Models, in Architecting Systems with Trustworthy Components, eds. Ralf Reussner, Judith Stafford and Clemens Szyperski, Lecture Notes in Computer Science, Springer-Verlag, 2006, available here.

[9] Nadia Polikarpova, Carlo Furia and Bertrand Meyer: Specifying Reusable Components, in Verified Software: Theories, Tools, Experiments (VSTTE ’10), Edinburgh, UK, 16-19 August 2010, Lecture Notes in Computer Science, Springer Verlag, 2010, available here.

[10] Bertrand Meyer: Object-Oriented Software Construction, first (1988) and second (1997) editions, Prentice Hall.

[11] Bertrand Meyer: Touch of Class: An Introduction to Programming Well, Using Objects and Contracts, Springer Verlag, 2009.

Rating: 10.0/10 (7 votes cast)

Rating: +3 (from 5 votes)

Tags: Framing, Purity
Category: Design by Contract, Eiffel, Essay, Formal methods and proofs, Language design, Programming techniques, Software engineering, Software verification | 13 Comments

Assessing concurrency models

12 June 2011, 16:57

By describing a poorly conceived hypothetical experiment, last week’s article described the “Professor Smith syndrome” consisting of four risks that threaten the validity of empirical software engineering experiments relying on students in a course:

Professor Smith Risk 1: possible bias if the evaluator has a stake in the ideas or tools under assessment.
Professor Smith Risk 2: creating different levels of motivation in the different groups (Hawthorne effect).
Professor Smith Risk 3: extrapolating from students to professionals.
Professor Smith Risk 4: violation of educational ethics if the experiment may cause some students to learn better than others.

If you have developed a great new method or tool and would like to assess it, the best way to address Risk 1 is to find someone else to do the assessment. What if this solution is not practical? Recently we wanted to get some empirical evidence on the merits of the SCOOP (Simple Concurrent Object-Oriented Programming) approach to concurrency [1, 2], on which I have worked for a long time and which is now part of EiffelStudio since the release of 6.8 a couple of weeks ago. We wanted to see if, despite the Professor Smith risks, we could do a credible study ourselves.

The ETH Software Architecture course[3], into which we introduced some introductory material on concurrency last year (as part of a general effort to push more concurrency into software courses at ETH), looked like a good place to try an evaluation; it is a second-year course, where students, or so we thought, would have little prior experience in concurrent software design.

The study’s authors — Sebastian Nanz, Faraz Torshizi and Michela Pedroni — paid special attention to the methodological issues. To judge for yourself whether we addressed them properly, you can read the current version of our paper to be presented at ESEM 2011 [4]. Do note that it is a draft and that we will improve the paper for final publication.

Here is some of what we did. I will not address the Professor Smith Risk 3, the use of students, which (as Lionel Briand has pointed out in a comment on the previous article) published work has studied; in a later article I will give references to some of that work. But we were determined to tackle the other risks explicitly, so as to obtain credible results.

The basic experiment was a session in which the students were exposed to two different design methods for concurrent software: multithreaded programming in Java, which I’ll call “Java Threads”, and SCOOP. We wanted to explore whether it is easier to program in SCOOP than in Java. This is too general a hypothesis, so it was refined into three concrete hypotheses: is it easier to understand a SCOOP program? Is it easier to find errors in SCOOP programs? Do programmers using SCOOP make fewer errors?

A first step towards reducing the effect — Professor Smith Risk 1 — of any emotional attachment of the experimenters to one of the approaches, SCOOP in our case, was to generalize the study. Although what directly interested us was to compare SCOOP against Java Threads, we designed the study as a general scheme to compare concurrency approaches; SCOOP and Java Threads are just an illustration, but anyone else interested in assessing concurrency techniques — say Erlang versus C# concurrency — can apply the same methodology. This decision had two benefits: it freed the study from dependency on the particular techniques, hence, we hope, reducing bias; and as side attraction of the kind that is hard for researchers to resist, it increased the publishability of the results.

Circumstances unexpectedly afforded us another protection against any for-SCOOP bias: unbeknownst to us at the time of the study’s design, a first-year course had newly added (in 2009, whereas our study was performed in 2010) an introduction to concurrent programming — using Java Threads! While we had thought that concurrency in any form would be new to most students, in fact almost all of them had now seen Java Threads before. (The new material in the first-year course was taken by ETH students only, but many transfer students had also already had an exposure to Java Threads.) On the other hand, students had not had any prior introduction to SCOOP. So any advantage that one of the approaches may have had because of students’ prior experience would work against our hypotheses. This unexpected development would not help if the study’s results heavily favored Java Threads, but if they favored SCOOP it would reinforce their credibility.

A particular pedagogical decision was made regarding the teaching of our concurrency material: it started with a self-study rather than a traditional lecture. One of the reasons for this decision was purely pedagogical: we felt (and the course evaluations confirmed) that at that stage of the semester the students would enjoy a break in the rhythm of the course. But another reason was to avoid any bias that might have arisen from any difference in the lecturers’ levels of enthusiasm and effectiveness in teaching the two approaches. In the first course session devoted to concurrency, students were handed study materials presenting Java Threads and SCOOP and containing a test to be taken; the study’s results are entirely based on their answers to these tests. The second session was a traditional lecture presenting both approaches again and comparing them. The purpose of this lecture was to make sure the students got the full picture with the benefit of a teacher’s verbal explanations.

The study material was written carefully and with a tone as descriptive and neutral as possible. To make comparisons meaningful, it does not follow a structure specific to Java Threads or SCOOP (as we might have used had we taught only one of these approaches); instead it relies in both cases on the same overall plan (figure 2 of the paper), based on a neutral analysis of concurrency concepts and issues: threads, mutual exclusion, deadlock etc. Each section then presents, for one such general concurrency question, the solution proposed by Java Threads or SCOOP.

This self-study material, as well as everything else about the study, is freely available on the Web; see the paper for the links.

In the self-study, all students studied both the Java Threads and SCOOP materials. They were randomly assigned to two groups, for which the only difference was the order of studying the approaches. We feel that this decision addresses the ethical issue (Professor Smith Risk 4): any pedagogical effect of reading about A before B rather than the reverse, in the course of a few hours, has to be minimal if you end up reading about the two of them, and on the next day follow a lecture that also covers both.

Having all students study both approaches — a crossover approach in the terminology of [5] — should also address the Hawthorne effect (Professor Smith Risk 2): students have no particular incentive to feel that one of the approaches is more hip than the other. While they are not told that SCOOP is partly the work of the instructors, some of them may know or guess this information; the consequences, positive or negative, are limited, since they are asked in both cases to do as well as they can in answering the assessment questions.

The design of that evaluation is another crucial element in trying to avoid bias. We tried, to the extent possible, to base the assessment on objective criteria. For the first hypothesis (program understanding) the technique was to ask the students to predict the output of some simple concurrent programs. To address the risk of a binary correct/incorrect assessment, and get a more fine-grained view, we devised the programs so that they would produce output strings and measured the Levenshtein (edit) distance to the correct result. For the second hypothesis (ease of program debugging), we gave students programs exhibiting typical errors in both approaches and asked them to tell us both the line number of any error they found and an explanation. Assessing the explanation required human analysis; the idea of also assigning partial credit for pointing out a line number without providing a good explanation is that it may be meaningful that a student found that something is amiss even without being quite able to define what it is. The procedure for the third hypothesis (producing programs with fewer errors) was more complex and required two passes over the result; it requires some human analysis, as you will see in the article, but we hope that the two-pass process removes any bias.

This description of the study is only partial and you should read the article [4] for the full details of the procedure.

So what did we find in the end? Does SCOOP really makes concurrency easier to learn, concurrent programs easier to debug, and concurrent programmers less error-prone? Here too I will refer you to the article. Let me simply mention that the results held some surprises.

In obtaining these results we tried very hard to address the Professor Smith syndrome and its four risks. Since all of our materials, procedures and data are publicly accessible, described in some detail in the paper, you can determine for yourself how well we met this objective, and whether it is possible to perform credible assessments even of one’s own work.

References

Further reading: for general guidelines on how to conduct empirical research see [5]; for ethical guidelines, applied to psychological research but generalizable, see [6].

[1] SCOOP Eiffel documentation, available here.

[2] SCOOP project documentation at ETH, available here.

[3] Software Architecture course at ETH, course page (2011).

[4] Sebastian Nanz, Faraz Torshizi, Michela Pedroni and Bertrand Meyer: Design of an Empirical Study for Comparing the Usability of Concurrent Programming Languages, to appear in ESEM 2011 (ACM/IEEE International Symposium on Empirical Software Engineering and Measurement), 22-23 September 2011. Draft available here.

[5] Barbara A. Kitchenham, Shari L. Pfleeger, Lesley M. Pickard, Peter W. Jones, David C. Hoaglin, Khaled El-Emam and Jarrett Rosenberg: Preliminary Guidelines for Empirical Research in Software Engineering, national Research Council Canada (NRC-CNRC), Report ERB-1082, 2001, available here.

[6] Robert Rosenthal: Science and ethics in conducting, analyzing, and reporting psychological research, in Psychological Science, 5, 1994, p127-134. I found a copy cached by a search engine here.

Rating: 10.0/10 (8 votes cast)

Rating: +7 (from 7 votes)

Tags: Empirical study, ESEM, Java threads, Multithreading, SCOOP
Category: Computer science, Concurrency, Education, Eiffel, Empirical Software Engineering, Essay, Programming techniques, Research evaluation, Software engineering | Comment

Stendhal on abstraction

29 May 2011, 19:35

This week we step away from our usual sources of quotations — the Hoares and Dijkstras and Knuths — in favor an author who might seem like an unlikely inspiration for a technology blog: Stendhal. A scientist may like anyone else be fascinated by Balzac, Flaubert, Tolstoy or Dostoevsky, but they live in an entirely different realm; Stendhal is the mathematician’s novelist. Not particularly through the themes of his works (as could be the case with Borges or Eco), but because of their clear structure and elegant style, impeccable in its conciseness and razor-like in its precision. Undoubtedly his writing was shaped by his initial education; he prepared for the entrance exam of the then very young École Polytechnique, although at the last moment he yielded instead to the call of the clarion.

The scientific way of thinking was not just an influence on his writing; he understood the principles of scientific reasoning and knew how to explain them. Witness the following text, which explains just about as well as anything I know the importance of abstraction. In software engineering (see for example [1]), abstraction is the key talent, a talent of a paradoxical nature: the basic ideas take a few minutes to explain, and a lifetime to master. In this effort, going back to the childhood memories of Henri Beyle (Stendhal’s real name) is not a bad start.

Stendhal’s Life of Henri Brulard is an autobiography, with only the thinnest of disguises into a novel (compare the hero’s name with the author’s). In telling the story of his morose childhood in Grenoble, the narrator grumbles about the incompetence of his first mathematics teacher, a Mr. Dupuy, who taught mathematics “as a set of recipes to make vinegar” (comme une suite de recettes pour faire du vinaigre) and tells how his father found a slightly better one, Mr. Chabert. Here is the rest of the story, already cited in [2]. The translation is mine; you can read the original below, as well as a German version. Instead of stacks and circles — or a university’s commencement day, see last week’s posting — the examples invoke eggs and cheese, but wouldn’t you agree that this paragraph is as good a definition of abstraction, directly applicable to software abstractions, and specifically to abstract data types and object abstractions (yes, it does discuss “objects”!), as any other?

So I went to see Mr. Chabert. Mr. Chabert was indeed less ignorant than Mr. Dupuy. Through him I discovered Euler and his problems on the number of eggs that a peasant woman brings to the market where a scoundrel steals a fifth of them, then she leaves behind the entire half of the remainder and so forth. This opened my mind, I glimpsed what it means to use the tool called algebra. I’ll be damned if anyone had ever explained it to me; endlessly Mr. Dupuy spun pompous sentences on the topic, but never did he say this one simple thing: it is a division of labor, and like every division of labor it creates wonders by allowing the mind to concentrate all its forces on just one side of objects, on just one of their qualities. What difference it would have made if Mr. Dupuy had told us: This cheese is soft or is it hard; it is white, it is blue; it is old, it is young; it is mine, it is yours; it is light or it is heavy. Of so many qualities, let us only consider the weight. Whatever that weight is, let us call it A. And now, no longer thinking of cheese, let us apply to A everything we know about quantities. Such a simple thing; and yet no one was explaining it to us in that far-away province [3]. Since that time, however, the influence of the École Polytechnique and Lagrange’s ideas may have trickled down to the provinces.

References

[1] Jeff Kramer: Is abstraction the key to computing?, in Communications of The ACM, vol. 50, 2007, pages 36-42.
[2] Bertrand Meyer and Claude Baudoin: Méthodes de Programmation, Eyrolles, 1978, third edition, 1982.
[3] No doubt readers from Grenoble, site of great universities and specifically one of the shrines of French computer science, will appreciate how Stendhal calls it “that backwater” (cette province reculée).

Original French text

J’allai donc chez M. Chabert. M. Chabert était dans le fait moins ignare que M. Dupuy. Je trouvai chez lui Euler et ses problèmes sur le nombre d’œufs qu’une paysanne apportait au marché lorsqu’un méchant lui en vole un cinquième, puis elle laisse toute la moitié du reste, etc., etc. Cela m’ouvrit l’esprit, j’entrevis ce que c’était que se servir de l’instrument nommé algèbre. Du diable si personne me l’avait jamais dit ; sans cesse M. Dupuy faisait des phrases emphatiques sur ce sujet, mais jamais ce mot simple : c’est une division du travail qui produit des prodiges comme toutes les divisions du travail et permet à l’esprit de réunir toutes ses forces sur un seul côté des objets, sur une seule de leurs qualités. Quelle différence pour nous si M. Dupuy nous eût dit : Ce fromage est mou ou il est dur ; il est blanc, il est bleu ; il est vieux, il est jeune ; il est à moi, il est à toi ; il est léger ou il est lourd. De tant de qualités ne considérons absolument que le poids. Quel que soit ce poids, appelons-le A. Maintenant, sans plus penser absolument au fromage, appliquons à A tout ce que nous savons des quantités. Cette chose si simple, personne ne nous la disait dans cette province reculée ; depuis cette époque, l’École polytechnique et les idées de Lagrange auront reflué vers la province.

German translation (by Benjamin Morandi)

Deshalb ging ich zu Herrn Chabert. In der Tat war Herr Chabert weniger ignorant als Herr Dupuy. Bei ihm fand ich Euler und seine Probleme über die Zahl von Eiern, die eine Bäuerin zum Markt brachte, als ein Schurke ihr ein Fünftel stahl, sie dann die Hälfte des Restes hinterliest u.s.w. Es hat mir die Augen geöffnet. Ich sah was es bedeutet, das Algebra genannte Werkzeug zu benutzen. Unaufhörlich machte Herr Dupuy emphatische Sätze über dieses Thema, aber niemals dieses einfache Wort: Es ist eine Arbeitsteilung, die wie alle Arbeitsteilungen Wunder herstellt und dem Geist ermöglicht seine Kraft ganz auf eine einzige Seite von Objekten zu konzentrieren, auf eine Einzige ihrer Qualitäten. Welch Unterschied für uns, wenn uns Herr Dupuy gesagt hätte: Dieser Käse ist weich oder er ist hart; er ist weiss, er ist blau; er ist alt, er ist jung; er gehört dir, er gehört mir; er ist leicht oder er ist schwer. Bei so vielen Qualitäten betrachten wir unbedingt nur das Gewicht. Was dieses Gewicht auch sei, nennen wir es A. Jetzt, ohne unbedingt weiterhin an Käse denken zu wollen, wenden wir auf A alles an, was wir über Mengen wissen. Diese einfach Sache sagte uns niemand in dieser zurückgezogenen Provinz; von dieser Epoche an werden die École Polytechnique und die Ideen von Lagrange in die Provinz zurückgeflossen sein.

Rating: 9.5/10 (6 votes cast)

Rating: +4 (from 4 votes)

Tags: Abstraction, Cheese, Grenoble, Polytechnique, Stendhal, Style
Category: Education, Essay, General technology, Software engineering, Writing and style | 1 Comment

The rise of empirical software engineering (II): what we are still missing

31 July 2010, 18:41

(This article was initially published in the CACM blog.)

The previous post under the heading of empirical software engineering hailed the remarkable recent progress of this field, made possible in particular by the availability of large-scale open-source repositories and by the opening up of some commercial code bases.

Has the empirical side of software engineering become a full member of empirical sciences? One component of the experimental method is still not quite there: reproducibility. It is essential to the soundness of natural sciences; when you publish a result there, the expectation is that others will be able to replicate it. Perhaps such duplication does not happen as often and physicists and biologists would have us believe, but it does happen, and the mere possibility that someone could check your results (and make a name for himself, especially if you are famous, by disproving them) keeps experimenters on their toes.

If we had the same norms in empirical software engineering, empirical papers would all contain a clause such as

Hampi’s source code and documentation, experimental data, and additional results are available at http://people.csail.mit.edu/akiezun/hampi

This example is, in fact, a real quote, from a paper [1] at the 2009 ISSTA conference. It shows exactly what we expect for an experimental software engineering publication: below are my results, if you want to rerun the experiments here is the URL where you will find the code (source and binary) and the data.

Unfortunately, such professionalism is the exception rather than the rule. I performed a quick check — entirely informal, as this is a blog post, not an empirical research paper! — in the ISSTA ’09 proceedings. ISSTA, an ACM conference is a good sample point, since it covers testing (plus other approaches to program analysis) and almost every paper has an “experiment” section. I found only a very small number that, like the one cited above, give explicit reproducibility information. (Disclosure: one of those papers is ours [2].)

I believe that the situation will change dramatically and that in a few years it will be impossible to submit an empirical paper without including such information. Computer science, or at least some areas of software engineering, should actually consider themselves privileged when it comes to allowing reproducibility: all that we have to do to reproduce a result, in testing for example, is to run a program. That is easier than for a zoologist — wishing to reproduce a colleague’s experiment precisely — to gather in his lab the appropriate number of flies, chimpanzees or killer whales.

In some types of empirical software research, such as the assessment of process models or design techniques, reproducing an experiment’s setup is harder than when all you have to do is to rerun a program. But regardless of the area we must develop a true culture of reproducibility. It is not yet there. I have personally come to take experimental results with a grain of salt; not that I particulary suspect foul play, but I simply know how easy it is, in the absence of external validation, to make a mistake in the experiments and, unwittingly, publish a paper with wrong results.

Developing a culture of reproducibility also has an effect on the refereeing process. In submitting papers with precise instructions to reproduce our results, we have sometimes remarked that referees never contact us. I hope this means they always succeed; I suspect, however, that in many cases they just do not try. If you think further about the implications, providing reproducibility instructions for a submitted paper is scary: after all a software run may fail to run for marginal reasons, such as the wrong hardware configuration or a misunderstanding of the instructions. You do not want to perform all the extra work (of making your results reproducible) just to have the paper summarily rejected because the referee is running Windows 95. Ideally, then, referees should have the possibility to ask technical questions — but anonymously, since this is the way most refereeing works. Conferences and journals generally do not support such a process.

These obstacles are implementation issues, however, and will go away. What matters for the growth of the discipline is that it needs, like experimental sciences before it, to embrace a true culture of reproducibility.

References

[1] Adam Kieun, Vijay Ganesh, Philip J. Guo, Pieter Hooimeijer, Michael D. Ernst: HAMPI: A Solver for String Constraints, Proceedings of the 2009 ACM/SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’09), July 19-23, 2009, Chicago.

[2] Nadia Polikarpova, Ilinca Ciupa and Bertrand Meyer: A Comparative Study of Programmer-Written and Automatically Inferred Contracts, Proceedings of the 2009 ACM/SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’09), July 19-23, 2009, Chicago.

Rating: 6.0/10 (3 votes cast)

Rating: +2 (from 4 votes)

Category: Conference, Empirical Software Engineering, Essay, Research evaluation, Software engineering | 3 Comments

One cheer for incremental research

10 August 2009, 01:53

[Note: an updated version of this article (June 2011) appears in the Communications of the ACM blog.]

The world of research funding, always a little strange, has of late been prey to a new craze: paradigm-shift mania. We will only fund twenty curly-haired cranky-sounding visionaries in the hope that one of them will invent relativity. The rest of you — bit-players! Petty functionaries! Slaves toiling at incremental research! — should be ashamed of even asking.

Take this from the US National Science Foundation’s current description of funding for Computer Systems Research [1]:

CSR-funded projects will enable significant progress on challenging high-impact problems, as opposed to incremental progress on familiar problems.

The European Research Council is not to be left behind [2]:

Projects being highly ambitious, pioneering and unconventional

Research proposed for funding to the ERC should aim high, both with regard to the ambition of the envisaged scientific achievements as well as to the creativity and originality of proposed approaches, including unconventional methodologies and investigations at the interface between established disciplines. Proposals should rise to pioneering and far-reaching challenges at the frontiers of the field(s) addressed, and involve new, ground-breaking or unconventional methodologies, whose risky outlook is justified by the possibility of a major breakthrough with an impact beyond a specific research domain/discipline.

Frontiers! Breakthrough! Rise! Aim high! Creativity! Risk! Impact! Pass me the adjective bottle. Ground-breaking! Unconventional! Highly ambitious! Major! Far-reaching! Pioneering! Galileo and Pasteur only please — others need not apply.

As everyone knows including the people who write such calls, this is balderdash. First, 99.97% of all research (precise statistic derived from my own ground-breaking research, further funding welcome) is incremental. Second, when a “breakthrough” does happen — the remaining 0.03% — it was often not planned as a breakthrough.

Incremental research is a most glorious (I have my own supply of adjectives) mode of doing science. Beginning PhD students can be forgiven for believing the myth of the lone genius who penetrates the secrets of time and space by thinking aloud during long walks with his best friend [3]; we all, at some stage, shared that delightful delusion. But every researcher, presumably including those who go on to lead research agencies, quickly grows up and learns that it is not how things happen. You read someone else’s solution to a problem, and you improve on it. Any history of science will tell you that for every teenager who from getting hit by a falling apple intuits the structure of the universe there are hundreds of great researchers who look at the state of the art and decide they can do a trifle better.

Here is a still recent example, particularly telling because we have the account from the scientist himself. It would not be much of an exaggeration to characterize the entire field of program proving over the past four decades as a long series of variations on Tony Hoare’s 1969 Axiomatic Semantics paper [4]. Here Hoare’s recollection, from his Turing Award lecture [5]:

In October 1968, as I unpacked my papers in my new home in Belfast, I came across an obscure preprint of an article by Bob Floyd entitled “Assigning Meanings to Programs.” What a stroke of luck! At last I could see a way to achieve my hopes for my research. Thus I wrote my first paper on the axiomatic approach to computer programming, published in the Communications of the ACM in October 1969.

(See also note [6].) Had the research been submitted for funding, we can imagine the reaction: “Dear Sir, as you yourself admit, Floyd has had the basic idea [7] and you are just trying to present the result better. This is incremental research; we are in the paradigm-shift business.” And yet if Floyd had the core concepts right it is Hoare’s paper that reworked and extended them into a form that makes practical semantic specifications and proofs possible. Incremental research at its best.

The people in charge of research programs at the NSF and ERC are themselves scientists and know all this. How come they publish such absurd pronouncements? There are two reasons. One is the typical academic’s fascination with industry and its models. Having heard that venture capitalists routinely fund ten projects and expect one to succeed, they want to transpose that model to science funding; hence the emphasis on “risk”. But the transposition is doubtful because venture capitalists assess their wards all the time and, as soon as they decide a venture is not going to break out, they cut the funding overnight, often causing the company to go under. This does not happen in the world of science: most projects, and certainly any project that is supposed to break new ground, gets funded for a minimum of three to five years. If the project peters out, the purse-holder will only realize it after spending all the money.

The second reason is a sincere desire to avoid mediocrity. Here we can sympathize with the funding executives: they have seen too many “here is my epsilon addition to the latest buzzword” proposals. The last time I was at ECOOP, in 2005, it seemed every paper was about bringing some little twist to aspect-oriented programming. This kind of research benefits no one and it is understandable that the research funders want people to innovate. But telling submitters that every project has to be epochal (surprisingly, “epochal” is missing from the adjectives in the descriptions above — I am sure this will soon be corrected) will not achieve this result.

It achieves something else, good neither for research nor for research funding: promise inflation. Being told that they have to be Darwin or nothing, researchers learn the game and promise the moon; they also get the part about “risk” and emphasize how uncertain the whole thing is and how high the likelihood it will fail. (Indeed, since — if it works — it will let cars run from water seamlessly extracted from the ambient air, and with the excedent produce free afternoon tea.)

By itself this is mostly entertainment, as no one believes the hyped promises. The real harm, however, is to honest scientists who work in the normal way, proposing to bring an important contribution to the solution of an important problem. They risk being dismissed as small-timers with no vision.

Some funding agencies have kept their heads cool. How refreshing, after the above quotes, to read the general description of funding by the Swiss National Science Foundation [8]:

The central criteria for evaluation are the scientific quality, originality and project methodology as well as qualifications and track record of the applicants. Grants are awarded on a competitive basis.

In a few words, it says all there is to say. Quality, originality, methodology, and track record. Will the research be “ground-breaking” or “incremental”? We will know when it is done.

I am convinced that the other agencies will come to their senses and stop the paradigm-shift nonsense. One reason for hope is in the very excesses of the currently fashionable style. The European Research Council quote includes, by my count, nineteen ways of saying that proposals must be daring. Now it is a pretty universal rule of life that someone who finds it necessary to say the same thing nineteen times in a single paragraph does not feel sure about it. He is trying to convince himself. At some point the people in charge will realize that such hype does not breed breakthroughs; it breeds more hype.

Until that happens there is something that some of us can do: refuse to play the game. Of course we are all convinced that our latest idea is the most important one ever conceived by humankind, and we want to picture it in the most favorable light. But we should resist the promise inflation. Such honesty comes at a risk. (I still remember a project proposal, many years ago, which came back with glowing reviews: the topic was important, the ideas right, the team competent. The agency officer’s verdict: reject. The proposers are certain to succeed, so it’s not research.) For some people, there is really no choice but to follow the lead: if your entire career depends on getting external funding, no amount of exhortation will prevent you from saying what the purse-holders want to hear. But those of us who do have a choice (that is to say, will survive even if a project is rejected) should refuse the compromission. We should present our research ideas for what they are.

So: one cheer for incremental research.

Wait, isn’t the phrase supposed to be “two cheers” [9]?

All right, but let’s go at it incrementally. One and one-tenth cheer for incremental research.

References

[1] National Science Foundation, Division of Computer and Network Systems: Computer Systems Research (CSR), at http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=13385.

[2] European Research Council: Advanced Investigators Grant, at http://erc.europa.eu/index.cfmfuseaction=page.display&topicID=66.

[3] The Berne years; see any biography of Albert Einstein.

[4] C.A.R. Hoare: An axiomatic basis for computer programming, in Communications of the ACM, vol. 12, no 10, pages 576–580,583, October 1969.

[5] C.A.R. Hoare: The Emperor’s Old Clothes, in Communications of the ACM, vol. 24, no. 2, pages 75 – 83, February 1981.

[6] In the first version of this essay I wrote “Someone should celebrate the anniversary!”. Moshe Vardi, editor of Communications of the ACM, has informed me that the October 2009 issue will include a retrospective by Hoare on the 1969 paper. I cannot wait to see it.

[7] Robert W. Floyd: Assigning meanings to programs, in Proceedings of the American Mathematical Society Symposia on Applied Mathematics, vol. 19, pp. 19–31, 1967.

[8] Swiss National Science Foundation: Projects – Investigator-Driven Research, at http://www.snf.ch/E/funding/projects/Pages/default.aspx. Disclosure: The SNSF kindly funds some of my research.

[9] E.M. Forster: Two Cheers for Democracy, Edward Arnold, 1951.

Rating: 8.3/10 (3 votes cast)

Rating: +1 (from 3 votes)

Tags: Darwin, Einstein, Galileo, Ground-breaking, Hoare, Incremental, Research, Research funding
Category: Computer science, Essay, Policy, Research evaluation | 4 Comments

The good and the ugly

6 August 2009, 22:51

Once in a while one hits a tool that is just right. An example worth publicizing is the EasyChair system for conference management [1], which — after a first experience as reviewer — I have selected whenever I was in a position to make the choice for a new conference in recent years.

At first sight, a conference management system does not seem so hard to put together; it is in fact a traditional project topic for software engineering courses. But this apparent simplicity is deceptive, as a usable system must accommodate countless small and large needs. To take just one example, you can be a member of a program committee for a conference and also submit a paper to it; this implies strict rules about what you can see, for example reviews of other people’s papers with the referees’ names, and what you should not see. Taking care of myriad such rules and requirements requires in-depth domain knowledge about conferences, and a thorough analysis.

EasyChair is based on such an analysis. It knows what a conference is, and understands what its users need. Here for example is my login screen on EasyChair:

EasyChair knows about me: I only have one user name and one password. It knows the conferences in which I have been involved (and found them by itself). It knows about my various roles: chair, author etc., and will let me do different things depending on the role I choose.

The rest of the tool is up to the standards set by this initial screen. Granted, the Web design is very much vintage 1994; a couple of hours on the site by a professional graphics designer would not hurt, but, really, who cares? What matters is the functionality, and it is not by accident that EasyChair’s author is a brilliant logician [2]. Here is someone who truly understands the business of organizing and refereeing a conference, has translated this understanding into a solid logical model, and has at every step put himself in the shoes of the participants in the process. As a user you feel that everything has been done to make you feel comfortable and perform efficiently, while protecting you from hassle.

Because this is all so simple and natural, you might forget that the system required extensive design. If you need proof, it suffices to consider, by contrast, the ScholarOne system, which as punishment for our sins both ACM and IEEE use for their journals.

Even after the last user still alive has walked away, ScholarOne will remain in the annals of software engineering, as a textbook illustration of how not to design a system and its user interface. Not the visuals; no doubt that site had a graphics designer. But everything is designed to make the system as repellent as possible for its users. You keep being asked for information that you have already entered. If you are a reviewer for Communications of the ACM and submit a paper to an IEEE Computer Society journal, the system does not remember you, since CACM has its own sub-site; you must re-enter everything. Since your identifier is your email address, you will have two passwords with the same id, which confuses the browser. (I keep forgetting the appropriate password, which the site obligingly emails me, in clear.) IEEE publications have a common page, but here is how it looks:

See the menu on the right? It is impossible to see the full names of each of the “Transactio…”. (No tooltips, of course.) Assume you just want to know what one of them is, for example “th-cs”: if you select it you are prompted to provide all kinds of information (which you have entered before for other publications), before you can even proceed.

This user interface design (the minuscule menu, an example of what Scott Meyers calls the “Keyhole problem” [3]) is only a small part of usability flaws that plague the system. The matter is one of design: the prevailing viewpoint is that of the designers and administrators, not the users. I was not really surprised when I found out that the system comes from the same source as the ISI Web of Science system (which should never be used for computer science, see [4]).

It is such a pleasure in contrast to see a system like EasyChair — for all I know a one-man effort — with its attention to user needs, its profound understanding of the problem domain, and its constant improvements over the years.

References

[1] EasyChair system, at http://www.easychair.org.

[2] Andrei Voronkov, http://www.voronkov.com/.

[3] Scott Meyers, The Keyhole Problem, at http://www.aristeia.com/TKP/draftPaper.pdf; see also slides at http://se.ethz.ch/~meyer/publications/OTHERS/scott_meyers/keyhole.pdf

[4] Bertrand Meyer, Christine Choppy, Jan van Leeuwen, Jørgen Staunstrup: Research Evaluation for Computer Science, in Communications of the ACM, vol. 52, no. 4, pages 131-134, online at http://portal.acm.org/citation.cfm?id=1498765.1498780 (requires subscription). Longer version, available at http://www.informatics-europe.org/docs/research_evaluation.pdf (free access).

Rating: 0.0/10 (0 votes cast)

Rating: +1 (from 1 vote)

Tags: ACM, Conferences, IEEE, Journals
Category: Computer science, Essay, Research evaluation, Software engineering, User interface design | Comment

Computer technology: making mozzies out of betties

2 August 2009, 22:11

Are you a Beethoven or a Mozart? If you’ll pardon the familarity, are you more of a betty or more of a mozzy? I am a betty. I am not referring to my musical abilities but to my writing style; actually, not the style of my writings (I haven’t completed any choral fantasies yet) but the style of my writing process. Mozart is famous for impeccable manuscripts; he could be writing in a stagecoach bumping its way through the Black Forest, on the kitchen table in the miserable lodgings of his second, ill-fated Paris trip, or in the antechamber of Archbishop Colloredo — no matter: the score comes out immaculate, not reflecting any of the doubts, hesitations and remorse that torment mere mortals.

Beethoven’s music, note-perfect in its final form, came out of a very different process. Manuscripts show notes overwritten, lines struck out in rage, pages torn apart. He wrote and rewrote and gave up and tried again and despaired and came back until he got it the way it had to be.

How I sympathize! I seldom get things right the first time, and when I had to use a pen and paper I almost never could produce a clean result; there always was one last detail to change. As soon as I could, I got my hands on typewriters, which removed the effects of ugly handwriting, but did not solve the problem of second thoughts followed by third thoughts and many more. Only with computers did it become possible to work sensibly. Even with a primitive text editor, the ability to try out ideas then correct and correct and correct is a profound change of the creation process. Once you have become used to the electronic medium, using a pen and paper seems as awkard and insufferable as, for someone accustomed to driving a car, being forced to travel in an oxen cart.

This liberating effect, the ability to work on your creations as a sculptor kneading an infinitely malleable material, is one of the greatest contributions of computer technology. Here we are talking about text, but the effect is just as profound on other media, as any architect or graphic artist will testify.

The electronic medium does not just give us more convenience; it changes the nature of writing (or composing, or designing). With paper, for example, there is a great practical difference between introducing new material at the end of the existing text, which is easy, and inserting it at some unforeseen position, which is cumbersome and sometimes impossible. With computerized tools, it doesn’t matter. The change of medium changes the writing process and ultimately the writing: with paper the author ends up censoring himself to avoid practically painful revisions; with software tools, you work in whatever order suits you.

Technical texts, with their numbered sections and subsections, are another illustration of the change: with a text processor you do not need to come up with the full plan first, in an effort to avoid tedious renumbering later. You will use such a top-down scheme if it fits your natural way of working, but you can use any other one you like, and renumber the existing sections at the press of a key. And just think of the pain it must have been to produce an index in the old days: add a page (or, worse, a paragraph, since it moves the following ones in different ways) and you would have to recheck every single entry.

Recent Web tools have taken this evolution one step further, by letting several people revise a text collaboratively and concurrently (and, thanks to the marvels of longest-common-subsequence algorithms and the resulting diff tools, retreat to an earlier version if in our enthusiasm to change our design we messed it up) . Wikis and Google Docs are the most impressive examples of these new techniques for collective revision.

Whether used by a single writer or in a collaborative development, computer tools have changed the very process of creation by freeing us from the tyranny of physical media and driving to zero the logistic cost of one or a million changes of mind. For the betties among us, not blessed with an inborn ability to start at A, smoothly continue step by step, and end at Z, this is a life-changer. We can start where we like, continue where we like, and cover up our mistakes when we discover them. It does not matter how messy the process is, how many virtual pages we tore away, how much scribbling it took to bring a paragraph to a state that we like: to the rest of the world, we can present a result as pristine as the manuscript of a Mozart concerto.

These advances are not appreciated enough; more importantly, we do not take take enough advantage of them. It is striking, for example, to see that blogs and other Web pages too often remain riddled with typos and easily repairable mistakes. This is undoubtedly because the power of computer technology tempts us to produce ever more documents and in the euphoria to neglect the old ones. But just as importantly that technology empowers us to go back and improve. The old schoolmaster’s advice — revise and revise again [1] — can no longer be dismissed as an invitation to fruitless perfectionism; it is right, it is fun to apply, and at long last it is feasible.

Reference

[1] “Vingt fois sur le métier remettez votre ouvrage” (Twenty times back to the loom shall you bring your design), Nicolas Boileau

Rating: 8.7/10 (3 votes cast)

Rating: +2 (from 2 votes)

Tags: Beethoven, Change, Computer-Aided Design, Mozart, Technology
Category: Essay, General technology, Writing and style | 1 Comment

The understated breakthrough

31 July 2009, 02:15

You must have seen articles in that genre: the author collects a few imprudent technology predictions from the past and pokes fun at their authors, the more prestigious the better. Favorites — copy-pasted, with or without fact-checking, from one such piece to another — are Lord Kelvin’s 1895 pronouncement that “Heavier-than-air flying machines are impossible” and Bill Gates’s comment (apocryphal, but who cares?) that no one needs more than 640KB of memory. Run a Bing search for something like “wrong technology predictions” and you will find dozens of such collections.

A few years ago they often derided turn-of-the-previous-century visions of communication through a videophone, as in this depiction by Villemard:

A bit one-sided in its view of gender roles, but delightful and amusing. “Look what in 1910 they thought the future would bring!” In 2009 it is no longer laughable: the future is here, and — other than display techniques and women’s fashions — it is exactly this.

I am amazed by the lack of hype that has been associated with Skype. With close to 500 million accounts, and 17 million of them connected at a typical time, it is not exactly a secret; but it is remarkable how little buzz it gets in the media and in our collective consciousness. The company itself seems to be more interested in getting the job done than in glamour. If we look at the substance, however, few technologies come to mind from the past few decades that have influenced people’s lives so directly and beneficially. From the launch of Skype in 2003 it became possible to have free calls worldwide; then in 2005 free video was included. Suddenly this videotelephony, fodder for visionaries and cartoonists, became available to anyone with an Internet connection — for free! Almost as unbelievable as the technical feat is that this all happened without headlines, without grandiose pronouncements, and without any forewarning by the technology pundits. Almost overnight we take a giant collective step, and we act as if nothing happened.

With and without video, Skype has had a profound effect on the daily communications of countless people. It also has many professional applications; in an article of last year [2] I described how we use it, together with other technologies such as Google Docs, to turn the old “Code inspection” of software engineering into something far more useful to the project and attractive to the participants.

As always with a breakthrough technology, you find the naysayers; in the universities of a large Western country, system administrators have — can anyone believe this? — banned the use of Skype, invoking some mysterious and unsubstantiated security risks. And certainly Skype is not perfect; we still get the occasional dropped call. What is more relevant than the occasional annoyance is how well the technology is designed; I have used the audio part, with quite reasonable performance, on a dial-up line from a remote location. And successive versions keep bringing in new wonders.

If there was an award for the highest usefulness to brouhaha ratio, Skype would be the favorite. Cheers for the most practical, least hyped technology of our time.

References

[1] Futuristic postcards by Villemard, from an exhibition.

[2] Design and Code Reviews in the Age of the Internet (preprint of an article in Communications of the ACM, Sept. 2008).

Rating: 8.5/10 (4 votes cast)

Rating: +4 (from 4 votes)

Tags: Breakthroughs, Hype, Skype, Technology, Videophone
Category: Essay, General technology | 2 Comments

The French School of Programming (1)
- JSoftwareEngineer: I read the foreword and preface available as a sample on Amazon’s page. This sounds of great...
Creative writing (1)
- Alcofibras: et traduire “eventually” par “éventuellement ça peut servir en popolitique. Etudiant,...
Le français dans le monde (1)
- Alcofibras: et “méthodologie” ! “méthode” c’est trop raide, contraignant, pas assez...
Descente aux enfers (1)
- Alcofibras: Bonjour, Je suis un vieux lecteur du Baudouin/Meyer et de tous vos livres. Je ne sais si vous avez eu les...
A remarkable group photo (1)
- JSoftwareEngineer: Remarkable, indeed. It would be nice if someone (doesn’t have to be Dr. Meyer) would provide...
The “NATO expansion” canard (2)
- jtgm: I would agree with Bybor’s comment that NATO isn’t a peaceful organization and that Russia is...
- bybor@yandex.ru: I am Russian, I hate this “NATO expansion” argument, but I curios that you just defy it....
Introduction to the Theory of Programming Languages: full book now freely available (1)
- mas: I have created a slightly modified version of ITPL posted as of 2022.09.30 by: extracting the embedded PDF...
Introduction to axiomatic semantics (2)
- Bertrand Meyer: Thank you very much for noticing this! I was playing with my ETH site and made a false move....
- nikita: Both pdf links are broken for me (land to 404 page).

1. Scenarios versus OO

2. Oh no, not stacks again

3. Avoiding premature ordering decisions

4. Logical constraints are more general than sequential orderings

5. What use for scenarios?

6. Where do scenarios fit?

Reference

1. Introduction and attempt #1

Notes to section 1

2. Attempt #2

3. Attempt #3

3. Attempt #4 (also includes 3′)

5. Attempt #5

6. Attempt #6

References for section 6

7. Using a program prover

Note and references for section 7

8. Understanding the proof

References and notes for section 8

9. Lessons learned

References

References and note

1 Idiosyncratic

2 Lucky

3 Smart

4 Human

5 Diligent

6 Focused

7 Eloquent

8 Open

9 Selfish

10. Generous

Context: multigraphs

Data structures

Getting things right

Help is available

More properties

There is more to contracts

The concrete benefits

The difference

Tests and proofs

References

1 The rise of empirical software engineering

2 Limits of empirical research

3 Beyond the lamppost

4 A horror story

5 Addressing the issues that matter

Acknowledgments

References and notes

References

Note

References

References

1. Steps in software verification

2. The role of a Domain Theory

3. Maximum computed from both ends

4. A specification at the right level of abstraction

5. Components of a Domain Theory

6. Results, not just definitions

7. Quantifiers are evil

8. Even the simplest examples…

9. From Domain Theory to domain library

10. DSL libraries for specifications

11. A sound and necessary engineering practice

References

Reference

References

References

References

References

References

Original French text

German translation (by Benjamin Morandi)

References

Projects being highly ambitious, pioneering and unconventional

References

References

Reference

References

Sites