Generation-based Code Review Automation: How Far Are We?

被引：6

作者：

Zhou, Xin ^{[1
]}

Kim, Kisub ^{[1
]}

Xu, Bowen ^{[1
]}

Han, DongGyun ^{[2
]}

He, Junda ^{[1
]}

Lo, David ^{[1
]}

机构：

[1] Singapore Management Univ, Singapore, Singapore

[2] Royal Holloway Univ London, London, England

来源：

2023 IEEE/ACM 31ST INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION, ICPC | 2023年

基金：

新加坡国家研究基金会;

关键词：

D O I：

10.1109/ICPC58990.2023.00036

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Code review is an effective software quality assurance activity; however, it is labor-intensive and time-consuming. Thus, a number of generation-based automatic code review (ACR) approaches have been proposed recently, which leverage deep learning techniques to automate various activities in the code review process (e.g., code revision generation and review comment generation). We find the previous works carry three main limitations. First, the ACR approaches have been shown to be beneficial in each work, but those methods are not comprehensively compared with each other to show their superiority over their peer ACR approaches. Second, general-purpose pre-trained models such as CodeT5 are proven to be effective in a wide range of Software Engineering (SE) tasks. However, no prior work has investigated the effectiveness of these models in ACR tasks yet. Third, prior works heavily rely on the Exact Match (EM) metric which only focuses on the perfect predictions and ignores the positive progress made by incomplete answers. To fill such a research gap, we conduct a comprehensive study by comparing the effectiveness of recent ACR tools as well as the general-purpose pre-trained models. The results show that a general-purpose pre-trained model CodeT5 can outperform other models in most cases. Specifically, CodeT5 outperforms the prior state-of-the-art by 13.4%-38.9% in two code revision generation tasks. In addition, we introduce a new metric namely Edit Progress (EP) to quantify the partial progress made by ACR tools. The results show that the rankings of models for each task could be changed according to whether EM or EP is being utilized. Lastly, we derive several insightful lessons from the experimental results and reveal future research directions for generation-based code review automation.

引用

页码：215 / 226

页数：12

共 50 条

[41] How Far We Have Come, How Far We Have Yet to Go in Atherosclerosis Research
Libby, Peter
Bornfeldt, Karin E.
CIRCULATION RESEARCH, 2020, 126 (09) : 1107 - 1111
[42] Software engineering education: How far we've come and how far we have to go
Mead, Nancy R.
JOURNAL OF SYSTEMS AND SOFTWARE, 2009, 82 (04) : 571 - 575
[43] Software engineering education: How far we've come and how far we have to go
Mead, Nancy R.
21ST CONFERENCE ON SOFTWARE ENGINEERING EDUCATION AND TRAINING, PROCEEDINGS, 2008, : 18 - 22
[44] Measuring drinking practices: How far we've come and how far we need to go
Room, R
ALCOHOLISM-CLINICAL AND EXPERIMENTAL RESEARCH, 1998, 22 (02) : 70S - 75S
[45] GLQA: A Generation-based Method for Legal Question Answering
Zhang, Weiqi
Shen, Hechuan
Lei, Tianyi
Wang, Qian
Peng, Dezhong
Wang, Xu
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[46] Dependability benchmarking: How far are we?
Kanoun, K
DEPENDABLE COMPUTING, 2003, 2847 : 1 - 1
[47] How did we come this far?
De Waresquiel, Emmanuel
HISTORIA, 2019, (874): : 45 - 45
[48] How Far Will We See in the Future?
Nasmyth, Kim
MOLECULAR BIOLOGY OF THE CELL, 2010, 21 (22) : 3813 - 3814
[49] HOW FAR HAVE WE COME
不详
HUMAN ORGANIZATION, 1956, 15 (02) : 1 - 2
[50] How far did we get?
Fred T. Bosman
Virchows Archiv, 2013, 462 : 129 - 130

← 1 2 3 4 5 →