More sizzle than steak? Using an LLM to produce verified bug fixes (new preprint)

New article: “Do AI models help produce verified bug fixes?” (Huang Li, Ilgiz Mustafin, Marco Piccioni, Alessandro Schena, Reto Weber and Bertrand Meyer), submitted for publication, preprint available on arXiv.

Automatic Program Repair (APR) involves four steps:

Locating the bug.
Producing candidate corrections.
Validating them (to make sure that they do correct the problem).
Selecting the best.

We have been working on the topic for many years, producing some of the earliest articles and recently continuing with new ideas [1]. A team member remarked at some point, somewhat self-dejectedly, that for the step 2, producing fixes, our most carefully crafted techniques are not necessarily better than what we get by just asking ChatGPT. I tried and indeed the first results were mind-blowing. We decided to find out for good and conducted a study.

The result is the article linked to above. It suffers from some obvious limitations: the study sample (25 participants), while respectable, is small; partly as a result, there is no advanced statistical analysis; participants used only one LLM (GPT 4o-mini). Still, I believe these results are interesting; they were definitely a surprise (and disappointment) for me.

The paper’s contributions also include: the identification of 7 “personalities” of LLM use (the “collaborator”, the “copy-paster” etc.); practical advice on how best to use an LLM for debugging; the methodology that we used to assess the results.

Two distinctive features are worth pointing out:

The use of a program prover for fix validation (step 3 above). The sample programs are written with Eiffel contracts and verified through the AutoProof prover. So we can say without any qualms that a proposed bug fix is correct, or not. (Usual APR work uses tests for that purpose, but a successful test doesn’t mean anything.)
Full recordings. We recorded every participant’s shared screen session. That approach meant a lot of manual work (watching some 50 hours of video altogether) and might not scale to larger studies, but it gave us insights, which we tried to summarize in the paper, into how people make use of AI support.

This study is not the final word but it brings important considerations on the use of AI support for the programming and verification process in the current state of AI technology.

References

[1] Li Huang, Bertrand Meyer, Ilgiz Mustafin and Manuel Oriol: Execution-free Program Repair, in FSE 2024, in proceedings of ACM International Conference on the Foundations of Software Engineering, Porto de Galinhas, Brazil, 15-19 July 2024, preprint available here.

Be Sociable, Share!

Please wait...

References

Leave a Comment Cancel reply