Three new papers on AI, verification, formal methods and the risks of vibe coding

In this issue: three new articles: deployed and verified; tests and proofs; LLMs and humans.

New article: formally verified, deployed systems

Drafts of the first of the three articles have been made public before (on arXiv), but it is still “new” in the sense of having been accepted for publication only recently, in ACM Computing Surveys, for which it was meant from the start since it is indeed a far-reaching survey article.

Its goal is to help address a key question of software engineering: are formal methods, particularly formal verification, practical? (We make no claim that it answers that question, but it provides a considerable amount of concrete evidence to support its discussion.) Formal verification uses mathematical techniques, implemented in verification software tools, “provers”, to prove the correctness of systems. Defenders and opponents throw abstract arguments at each other (“formal methods are inevitable if you want reliable software and they have been widely shown to be usable” versus “they are academic and cannot be applied to realistic systems!”). Although the authors are passionate about formal verification, we did not set out to prove anything (as we are aware of the limitations), but simply went around to collect and scrutinize the available facts.

Over two years, 2021-2022, we reviewed the literature and circulated a survey asking for factual information about systems that have been both formally verified and deployed for good. We applied strict selection criteria, described in the article. In the end 32 systems are reviewed in the extended version (reference below). The article also contains extended analyses of the state of the technology, the technologies involved, the success factors, the obstacles, the lessons for the future.

The issues it covers are becoming ever more relevant since (as I have argued in several places including earlier issues of this newsletter) the rise of AI-based coding makes verification critical.

The article was submitted in early 2023 based on the work performed over the previous two years. The refereeing unfortunately took a long time, but we made sure, without adding any system to the collection, to keep the information on the selected systems up to date. We have created a site (linked to in the article) to support future updates as we obtain new information, particularly as we learn of new relevant developments, formally verified and industrially deployed.

The fantastic team, which carried out enormous work, includes Li Huang (Constructor), Sophie Ebersold (Toulouse), Alexander Kogtenkov (then Eiffel Software and Constructor), Yinling Liu (Toulouse then Lorraine) and me. The full title is Lessons from Formally Verified Deployed Software Systems.

The journal version only covers, for space reasons, 11 systems. I do not have the final URL. (The paper has been accepted by ACM Computing Surveys, and we have produced and proofread the final version, but the publication information is not available yet. I will provide it when it is.) Fortunately, the full version — same basic material, just many more systems covered — is accessible, with its slightly formidable 93 pages, here on arXiv. (Thanks, Alexander! LaTeX conditional compilation facility was instrumental in allowing us to maintain both a short and a long versions, through countless revisions over some 5 years, without any copy-paste or the resulting heart attacks.)

I believe this large factual, information-rich and emotion-free synthesis will be helpful to many people wanting to form their own opinions about formal methods and their practicality.

New article: tests and proofs made complementary

Another recently accepted article will appear in Communications of the ACM. Again, I do not yet know when. (Another CACM paper waiting to be published is my viewpoint article on the combination of AI coding and verification, “From Probable to Provable”, draft available here.) The co-authors are Li Huang and Manuel Oriol. The paper is a synthesis of Li Huang's PhD work, which resulted in a good half-dozen papers on separate topics; the new one present a cohesive view, with the essential results from these previous publications and a number of newer developments.

The shortened title (for the arXiv version) is Combining Tests and Proofs for Better Software Verification. The core argument is simple: proofs (static verification of correctness of programs, performed on the program text only, without execution, and ascertaining formally whether a program satisfies its specification) and tests (executing the program on specified input and looking at the results to see if they violate certain properties) are naturally seen as opposite approaches, and have long been considered competitors, each with its partisans and opponents. The view of the article is that they can be combined for the greater benefit of verification.

In particular, modern program-proving tools often work by trying to generate counterexamples, normally hoping not to find any; but in the process of reaching this goal they do find some, which are particularly fruitful as a source of tests.

In the first application of this observation, we turn a failed proof, which is very disconcerting to a programmer (since it does not give concretely useful information) into a failed test, which is concrete and directly relevant to the programmer.

From there we go on to the automatic generation of test suites guaranteeing 100% coverage, to automatic program repair guaranteed to be correct, to better tests using loop unrolling, and even to an analysis of the widely recommended but controversial MCDC testing criterion, whose results might surprise you. Unlike the previous one, this article is short, and I think it provides important insights on modern verification.

An arXiv version is available here. I will provide a link to the official CACM version when available.

Should you trust LLMs for programming?

Contrary to the previous two, the last paper cited here has not been published. In fact it has already been rejected by two reputable conferences. I understand the reasons but am also a bit frustrated because I think this is the kind of paper that brings a significant contribution, even with its limitations (which it clearly states itself). It is under submission again and of course there is always arXiv.

This article needed a large group of authors for reasons that you will readily see. It is entitled Do AI models help produce verified bug fixes? and is by Li Huang (do we see a pattern here?), Ilgiz Mustafin, Marco Piccioni, Alessandro Schena, Reto Weber and me.We wanted to see how good AI-based techniques are at helping write programs, more specifically at correcting buggy programs.

We gave some buggy programs to a not insignificant set of programmers — 25 of them — divided into two groups, one using LLMs and the other not. We had a bit of bad luck because 29 people had firmly committed to participating (the test could be run any time over a period of several days) but 4 disappeared without warning and they were all in the LLM group (unpredictably, since the assignment was random), which introduced some imbalance. Still, the results were quite telling.

Apart from the inevitable idiot who rejected the paper because the work used Eiffel instead of Python or Java, the referees' criticism was about obvious limitations clearly analyzed in the paper: the sample size (although it is not ridiculous) and the use of the tools available at the time, namely GPT 4o-mini). Balancing these issues, however, is the depth of the analysis: all sessions (typically two hours) were fully recorded, and the authors collectively watched the entire video recordings. This Herculean effort (even with a sizable group of authors) enabled us to analyze in detail the path and development process of all the participants. The analysis was also very careful, slicing and dicing the participants' characteristics (including their programming experience) and their influence on the results. This feature also explains why it would not be possible, or at least not easy, to redo an experiment correcting the features (in my view minor) that lead people to questioning the results. LLM principles, for example, are still the same as those of a year ago, even if tools have improved.

I believe that even with the issues mentioned the paper highlights important information about AI-based (“vibe”) software development and its limits. We identified a number of important phenomena which I have not seen analyzed anywhere else in any depth. One of them is the “hallucination loop”, where the LLM gets off on the wrong tangent, producing results that do not work, then attempts to correct them, without questioning the validity of its flawed initial idea, and digs up a deeper and deeper hole, hallucinating even more, with the inevitable result that the user eventually gives up in disgust. (What do I hear? Holes do not have tangents? DON'T YOU DARE MESS UP WITH MY METAPHORS!)

This paper may or may not eventually find a publication venue but I believe that its lessons are significant and that anyone considering deploying vibe coding had better become familiar with them.

Cover photo: Bring us our daily bread. Dourdan, France, 2014