Correlating Automated and Human Evaluation of Code Documentation Generation Quality

被引：10

作者：

Hu, Xing ^{[1
]}

Chen, Qiuyuan ^{[2
]}

Wang, Haoye ^{[2
]}

Xia, Xin ^{[3
]}

Lo, David ^{[4
]}

Zimmermann, Thomas ^{[5
]}

机构：

[1] Zhejiang Univ, Sch Software Technol, 1689 Jiangnan Rd, Ningbo 315048, Zhejiang, Peoples R China

[2] Zhejiang Univ, Coll Comp Sci & Technol, Rd 38 West Lake Dist, Hangzhou 310027, Zhejiang, Peoples R China

[3] Monash Univ, Fac Informat Technol, Bldg 6,29 Ancora Imparo Way,Clayton Campus, Clayton, Vic 3800, Australia

[4] Singapore Management Univ, Sch Informat Syst, 80 Stamford Rd, Singapore 178902, Singapore

[5] Microsoft Res, 1 Microsoft Way, Redmond, WA 98052 USA

来源：

ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY | 2022年 / 31卷 / 04期

基金：

新加坡国家研究基金会; 美国国家科学基金会;

关键词：

Code documentation generation; evaluation metrics; empirical study;

D O I：

10.1145/3502853

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Automatic code documentation generation has been a crucial task in the field of software engineering. It not only relieves developers fromwriting code documentation but also helps them to understand programs better. Specifically, deep-learning-based techniques that leverage large-scale source code corpora have been widely used in code documentation generation. These works tend to use automatic metrics (such as BLEU, METEOR, ROUGE, CIDEr, and SPICE) to evaluate different models. These metrics compare generated documentation to reference texts by measuring the overlapping words. Unfortunately, there is no evidence demonstrating the correlation between these metrics and human judgment. We conduct experiments on two popular code documentation generation tasks, code comment generation and commit message generation, to investigate the presence or absence of correlations between these metrics and human judgments. For each task, we replicate three state-of-the-art approaches and the generated documentation is evaluated automatically in terms of BLEU, METEOR, ROUGE-L, CIDEr, and SPICE. We also ask 24 participants to rate the generated documentation considering three aspects (i.e., language, content, and effectiveness). Each participant is given Java methods or commit diffs along with the target documentation to be rated. The results show that the ranking of generated documentation from automatic metrics is different from that evaluated by human annotators. Thus, these automatic metrics are not reliable enough to replace human evaluation for code documentation generation tasks. In addition, METEOR shows the strongest correlation (with moderate Pearson correlation r about 0.7) to human evaluation metrics. However, it is still much lower than the correlation observed between different annotators (with a high Pearson correlation r about 0.8) and correlations that are reported in the literature for other tasks (e.g., Neural Machine Translation [39]). Our study points to the need to develop specialized automated evaluation metrics that can correlate more closely to human evaluation metrics for code generation tasks.

引用

页数：28

共 50 条

[1] Nursing Point of Care Documentation for the Evaluation of Human Quality
Whittenburg, LuAnn
CONNECTING HEALTH AND HUMANS, 2009, 146 : 713 - 714
[2] CODE: Chronic Opioid Documentation Evaluation
Wood, Gordon J.
JOURNAL OF PAIN AND SYMPTOM MANAGEMENT, 2011, 41 (01) : 253 - 253
[3] QUALITY OF INPATIENT DOCUMENTATION OF CODE STATUS DISCUSSION
Sharma, Rashmi K.
Thurston, Andrew
Wayne, Diane
JOURNAL OF GENERAL INTERNAL MEDICINE, 2013, 28 : S162 - S162
[4] Documentation Quality of Inpatient Code Status Discussions
Thurston, Andrew
Wayne, Diane B.
Feinglass, Joseph
Sharma, Rashmi K.
JOURNAL OF PAIN AND SYMPTOM MANAGEMENT, 2014, 48 (04) : 632 - 638
[5] Automated code generation by local search
Hyde, M. R.
Burke, E. K.
Kendall, G.
JOURNAL OF THE OPERATIONAL RESEARCH SOCIETY, 2013, 64 (12) : 1725 - 1741
[6] Automatic Documentation Generation via Source Code Summarization
McBurney, Paul W.
2015 IEEE/ACM 37TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, VOL 2, 2015, : 903 - 906
[7] Automated Question Generation for Quality Control in Human Computation Tasks
Seyler, Dominic
Yahya, Mohamed
Berberich, Klaus
Alonso, Omar
PROCEEDINGS OF THE 2016 ACM WEB SCIENCE CONFERENCE (WEBSCI'16), 2016, : 360 - 362
[8] A Comparative Analysis of Large Language Models for Code Documentation Generation
Dvivedi, Shubhang Shekhar
Vijay, Vyshnav
Pujari, Sai Leela Rahul
Lodh, Shoumik
Kumar, Dhruv
PROCEEDINGS OF THE 1ST ACM INTERNATIONAL CONFERENCE ON AI-POWERED SOFTWARE, AIWARE 2024, 2024, : 65 - 73
[9] Source Code based On-demand Class Documentation Generation
Liu, Mingwei
Peng, Xin
Meng, Xiujie
Xu, Huanjun
Xing, Shuangshuang
Wang, Xin
Liu, Yang
Lv, Gang
2020 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME 2020), 2020, : 864 - 865
[10] Automatic Code Documentation Generation Using GPT-3
Khan, Junaed Younus
Uddin, Gias
PROCEEDINGS OF THE 37TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE 2022, 2022,

← 1 2 3 4 5 →