On the Robustness of Code Generation Techniques: An Empirical Study on GitHub Copilot

被引：23

作者：

Mastropaolo, Antonio ^{[1
]}

Pascarella, Luca ^{[1
]}

Guglielmi, Emanuela ^{[2
]}

Ciniselli, Matteo ^{[1
]}

Scalabrino, Simone ^{[2
]}

Oliveto, Rocco ^{[2
]}

Bavota, Gabriele ^{[1
]}

机构：

[1] Univ Svizzera Italiana USI, SEART Software Inst, Lugano, Switzerland

[2] Univ Molise, STAKE Lab, Campobasso, Italy

来源：

2023 IEEE/ACM 45TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ICSE | 2023年

基金：

欧洲研究理事会;

关键词：

Empirical Study; Recommender Systems; USAGE;

D O I：

10.1109/ICSE48619.2023.00181

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Software engineering research has always being concerned with the improvement of code completion approaches, which suggest the next tokens a developer will likely type while coding. The release of GitHub Copilot constitutes a big step forward, also because of its unprecedented ability to automatically generate even entire functions from their natural language description. While the usefulness of Copilot is evident, it is still unclear to what extent it is robust. Specifically, we do not know the extent to which semantic-preserving changes in the natural language description provided to the model have an effect on the generated code function. In this paper we present an empirical study in which we aim at understanding whether different but semantically equivalent natural language descriptions result in the same recommended function. A negative answer would pose questions on the robustness of deep learning (DL)-based code generators since it would imply that developers using different wordings to describe the same code would obtain different recommendations. We asked Copilot to automatically generate 892 Java methods starting from their original Javadoc description. Then, we generated different semantically equivalent descriptions for each method both manually and automatically, and we analyzed the extent to which predictions generated by Copilot changed. Our results show that modifying the description results in different code recommendations in similar to 46% of cases. Also, differences in the semantically equivalent descriptions might impact the correctness of the generated code (+/- 28%).

引用

页码：2149 / 2160

页数：12

共 50 条

[1] An Empirical Evaluation of GitHub Copilot's Code Suggestions
Nhan Nguyen
Nadi, Sarah
2022 MINING SOFTWARE REPOSITORIES CONFERENCE (MSR 2022), 2022, : 1 - 5
[2] Assessing the Quality of GitHub Copilot's Code Generation
Yetistiren, Burak
Ozsoy, Isik
Tuzun, Eray
PROCEEDINGS OF THE 18TH INTERNATIONAL CONFERENCE ON PREDICTIVE MODELS AND DATA ANALYTICS IN SOFTWARE ENGINEERING, PROMISE 2022, 2022, : 62 - 71
[3] Using GitHub Copilot for Test Generation in Python']Python: An Empirical Study
El Haji, Khalid
Brandt, Carolin
Zaidman, Andy
PROCEEDINGS OF THE 2024 IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATION OF SOFTWARE TEST, AST 2024, 2024, : 45 - 55
[4] CodexLeaks: Privacy Leaks from Code Generation Language Models in GitHub Copilot
Niu, Liang
Mirza, Shujaat
Maradni, Zayd
Popper, Christina
PROCEEDINGS OF THE 32ND USENIX SECURITY SYMPOSIUM, 2023, : 2133 - 2150
[5] Is GitHub Copilot a Substitute for Human Pair-programming? An Empirical Study
Imai, Saki
2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: COMPANION PROCEEDINGS (ICSE-COMPANION 2022), 2022, : 319 - 321
[6] Is GitHub’s Copilot as bad as humans at introducing vulnerabilities in code?
Owura Asare
Meiyappan Nagappan
N. Asokan
Empirical Software Engineering, 2023, 28
[7] Students' Use of GitHub Copilot for Working with Large Code Bases
Shah, Anshul
Chernova, Anya
Tomson, Elena
Porter, Leo
Griswold, William G.
Raj, Adalbert Gerald Soosai
PROCEEDINGS OF THE 56TH ACM TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION, SIGCSE TS 2025, VOL 1, 2025, : 1050 - 1056
[8] Students' Use of GitHub Copilot for Working with Large Code Bases
Shah, Anshul
Chernova, Anya
Tomson, Elena
Porter, Leo
Griswold, William G.
Raj, Adalbert Gerald Soosai
PROCEEDINGS OF THE 56TH ACM TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION, SIGCSE TS 2025, VOL 2, 2025, : 1050 - 1056
[9] Is GitHub's Copilot as bad as humans at introducing vulnerabilities in code?
Asare, Owura
Nagappan, Meiyappan
Asokan, N.
EMPIRICAL SOFTWARE ENGINEERING, 2023, 28 (06)
[10] Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions
Pearce, Hammond
Ahmad, Baleegh
Tan, Benjamin
Dolan-Gavitt, Brendan
Karri, Ramesh
43RD IEEE SYMPOSIUM ON SECURITY AND PRIVACY (SP 2022), 2022, : 754 - 768

← 1 2 3 4 5 →