CPC: Automatically Classifying and Propagating Natural Language Comments via Program Analysis

被引:30
|
作者
Zhai, Juan [1 ,2 ]
Xu, Xiangzhe [3 ]
Shi, Yu [1 ]
Tao, Guanhong [1 ]
Pan, Minxue [3 ]
Ma, Shiqing [2 ]
Xu, Lei [3 ]
Zhang, Weifeng [4 ]
Tan, Lin [1 ]
Zhang, Xiangyu [1 ]
机构
[1] Purdue Univ, W Lafayette, IN 47907 USA
[2] Rutgers State Univ, Piscataway, NJ USA
[3] Nanjing Univ, Nanjing, Peoples R China
[4] Nanjing Univ Posts & Telecommun, Nanjing, Peoples R China
关键词
D O I
10.1145/3377811.3380427
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Code comments provide abundant information that have been leveraged to help perform various software engineering tasks, such as bug detection, specification inference, and code synthesis. However, developers are less motivated to write and update comments, making it infeasible and error-prone to leverage comments to facilitate software engineering tasks. In this paper, we propose to leverage program analysis to systematically derive, refine, and propagate comments. For example, by propagation via program analysis, comments can be passed on to code entities that are not commented such that code bugs can be detected leveraging the propagated comments. Developers usually comment on different aspects of code elements like methods, and use comments to describe various contents, such as functionalities and properties. To more effectively utilize comments, a fine-grained and elaborated taxonomy of comments and a reliable classifier to automatically categorize a comment are needed. In this paper, we build a comprehensive taxonomy and propose using program analysis to propagate comments. We develop a prototype CPC, and evaluate it on 5 projects. The evaluation results demonstrate 41573 new comments can be derived by propagation from other code locations with 88% accuracy. Among them, we can derive precise functional comments for 87 native methods that have neither existing comments nor source code. Leveraging the propagated comments, we detect 37 new bugs in open source large projects, 30 of which have been confirmed and fixed by developers, and 304 defects in existing comments (by looking at inconsistencies between existing and propagated comments), including 12 incomplete comments and 292 wrong comments. This demonstrates the effectiveness of our approach. Our user study confirms propagated comments align well with existing comments in terms of quality.
引用
收藏
页码:1359 / 1371
页数:13
相关论文
共 33 条
  • [1] Classifying Code Comments via Pre-trained Programming Language Model
    Li, Ying
    Wang, Haibo
    Zhang, Huaien
    Tan, Shin Hwei
    2023 IEEE/ACM 2ND INTERNATIONAL WORKSHOP ON NATURAL LANGUAGE-BASED SOFTWARE ENGINEERING, NLBSE, 2023, : 24 - 27
  • [2] Introducing Natural Language Program Analysis
    Pollock, Lori
    Vijay-Shanker, K.
    Shepherd, David
    Hill, Emily
    Fry, Zachary P.
    Maloor, Kishen
    PASTE'07 PROCEEDINGS OF THE 2007 ACM SIGPLAN- SIGSOFT WORKSHOP ON PROGRAM ANALYSIS FOR SOFTWARE TOOLS & ENGINEERING, 2007, : 15 - 16
  • [3] Automatically Structuring on Chinese Ultrasound Report of Cerebrovascular Diseases via Natural Language Processing
    Chen, Pengyu
    Liu, Qiao
    Wei, Lan
    Zhao, Beier
    Jia, Yin
    Lv, Hairong
    Fei, Xiaolu
    IEEE ACCESS, 2019, 7 : 89043 - 89050
  • [4] Classifying requirements: Towards a more rigorous analysis of natural-language specifications
    Nikora, Allen P.
    16th IEEE International Symposium on Software Reliability Engineering, Proceedings, 2005, : 291 - 300
  • [5] C2S: Translating Natural Language Comments to Formal Program Specifications
    Zhai, Juan
    Shi, Yu
    Pan, Minxue
    Zhou, Guian
    Liu, Yongxiang
    Fang, Chunrong
    Ma, Shiqing
    Tan, Lin
    Zhang, Xiangyu
    PROCEEDINGS OF THE 28TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (ESEC/FSE '20), 2020, : 25 - 37
  • [6] AUTOMATICALLY BUILDING A KNOWLEDGE-BASE THROUGH NATURAL-LANGUAGE TEXT ANALYSIS
    HODGES, JE
    CORDOVA, JL
    INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 1993, 8 (09) : 921 - 938
  • [7] Robot Program Construction via Grounded Natural Language Semantics & Simulation
    Pomarlan, Mihai
    Bateman, John
    PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS AND MULTIAGENT SYSTEMS (AAMAS' 18), 2018, : 857 - 864
  • [8] COUNT - PL/I PROGRAM FOR CONTENT ANALYSIS OF NATURAL LANGUAGE
    MARTINDALE, C
    BEHAVIORAL SCIENCE, 1973, 18 (02): : 148 - 148
  • [9] Segmenting Natural Language Sentences via Lexical Unit Analysis
    Li, Yangming
    Liu, Lemao
    Shi, Shuming
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 181 - 187
  • [10] Novel Natural Language Summarization of Program Code via Leveraging Multiple Input Representations
    Chen, Fuxiang
    Kim, Mijung
    Choo, Jaegul
    Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021, 2021, : 2510 - 2520