Correlating Automated and Human Evaluation of Code Documentation Generation Quality

被引:10
|
作者
Hu, Xing [1 ]
Chen, Qiuyuan [2 ]
Wang, Haoye [2 ]
Xia, Xin [3 ]
Lo, David [4 ]
Zimmermann, Thomas [5 ]
机构
[1] Zhejiang Univ, Sch Software Technol, 1689 Jiangnan Rd, Ningbo 315048, Zhejiang, Peoples R China
[2] Zhejiang Univ, Coll Comp Sci & Technol, Rd 38 West Lake Dist, Hangzhou 310027, Zhejiang, Peoples R China
[3] Monash Univ, Fac Informat Technol, Bldg 6,29 Ancora Imparo Way,Clayton Campus, Clayton, Vic 3800, Australia
[4] Singapore Management Univ, Sch Informat Syst, 80 Stamford Rd, Singapore 178902, Singapore
[5] Microsoft Res, 1 Microsoft Way, Redmond, WA 98052 USA
基金
新加坡国家研究基金会; 美国国家科学基金会;
关键词
Code documentation generation; evaluation metrics; empirical study;
D O I
10.1145/3502853
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Automatic code documentation generation has been a crucial task in the field of software engineering. It not only relieves developers fromwriting code documentation but also helps them to understand programs better. Specifically, deep-learning-based techniques that leverage large-scale source code corpora have been widely used in code documentation generation. These works tend to use automatic metrics (such as BLEU, METEOR, ROUGE, CIDEr, and SPICE) to evaluate different models. These metrics compare generated documentation to reference texts by measuring the overlapping words. Unfortunately, there is no evidence demonstrating the correlation between these metrics and human judgment. We conduct experiments on two popular code documentation generation tasks, code comment generation and commit message generation, to investigate the presence or absence of correlations between these metrics and human judgments. For each task, we replicate three state-of-the-art approaches and the generated documentation is evaluated automatically in terms of BLEU, METEOR, ROUGE-L, CIDEr, and SPICE. We also ask 24 participants to rate the generated documentation considering three aspects (i.e., language, content, and effectiveness). Each participant is given Java methods or commit diffs along with the target documentation to be rated. The results show that the ranking of generated documentation from automatic metrics is different from that evaluated by human annotators. Thus, these automatic metrics are not reliable enough to replace human evaluation for code documentation generation tasks. In addition, METEOR shows the strongest correlation (with moderate Pearson correlation r about 0.7) to human evaluation metrics. However, it is still much lower than the correlation observed between different annotators (with a high Pearson correlation r about 0.8) and correlations that are reported in the literature for other tasks (e.g., Neural Machine Translation [39]). Our study points to the need to develop specialized automated evaluation metrics that can correlate more closely to human evaluation metrics for code generation tasks.
引用
收藏
页数:28
相关论文
共 50 条
  • [11] Semantically Aligned Question and Code Generation for Automated Insight Generation
    Singha, Ananya
    Chopra, Bhavya
    Khatry, Anirudh
    Gulwani, Sumit
    Henley, Austin
    Le, Vu
    Parnin, Chris
    Singh, Mukul
    Verbruggen, Gust
    2024 INTERNATIONAL WORKSHOP ON LARGE LANGUAGE MODELS FOR CODE, LLM4CODE 2024, 2024, : 127 - 134
  • [12] Automated Algorithm for Iris Detection and Code Generation
    Mohamed, M. A.
    Abou-El-Soud, M. A.
    Eid, M. M.
    2009 INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND SYSTEMS (ICCES 2009), 2009, : 475 - 481
  • [13] BugSpotter: Automated Generation of Code Debugging Exercises
    Padurean, Victor-Alexandru
    Denny, Paul
    Singla, Adish
    PROCEEDINGS OF THE 56TH ACM TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION, SIGCSE TS 2025, VOL 2, 2025, : 896 - 902
  • [14] Automated code generation tools for collaboration systems
    Hartrum, Thomas C.
    CTS 2007: PROCEEDINGS OF THE 2007 INTERNATIONAL SYMPOSIUM ON COLLABORATIVE TECHNOLOGIES AND SYSTEMS, 2007, : 183 - 190
  • [15] BugSpotter: Automated Generation of Code Debugging Exercises
    Padurean, Victor-Alexandru
    Denny, Paul
    Singla, Adish
    PROCEEDINGS OF THE 56TH ACM TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION, SIGCSE TS 2025, VOL 1, 2025, : 896 - 902
  • [16] Practitioners' Expectations on Automated Code Comment Generation
    Hu, Xing
    Xia, Xin
    Lo, David
    Wan, Zhiyuan
    Chen, Qiuyuan
    Zimmermann, Thomas
    2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), 2022, : 1693 - 1705
  • [17] Automated code generation for integrated layer processing
    Braun, T
    Diot, C
    PROTOCOLS FOR HIGH-SPEED NETWORK V, 1997, : 182 - 197
  • [18] AUTOMATED CODE GENERATION FOR DISCONTINUOUS GALERKIN METHODS
    Olgaard, Kristian B.
    Logg, Anders
    Wells, Garth N.
    SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2008, 31 (02): : 849 - 864
  • [19] A Framework for Automated Quality Assurance and Documentation for Pharma 4.0
    Schmidt, Andreas
    Frey, Joshua
    Hillen, Daniel
    Horbelt, Jessica
    Schandar, Markus
    Schneider, Daniel
    Sorokos, Ioannis
    COMPUTER SAFETY, RELIABILITY, AND SECURITY (SAFECOMP 2021), 2021, 12852 : 226 - 239
  • [20] Code Quality: Examining the Efficacy of Automated Tools
    Hooshangi, Sara
    Dasgupta, Subhasish
    AMCIS 2017 PROCEEDINGS, 2017,