On the "Naturalness" of Buggy Code

被引:145
|
作者
Ray, Baishakhi [1 ]
Hellendoorn, Vincent [2 ]
Godhane, Saheel [2 ]
Tu, Zhaopeng [3 ]
Bacchelli, Alberto [4 ]
Devanbu, Premkumar [2 ]
机构
[1] Univ Virginia, Charlottesville, VA 22903 USA
[2] Univ Calif Davis, Davis, CA 95616 USA
[3] Huawei Technol Co Ltd, Shenzhen, Guangdong, Peoples R China
[4] Delft Univ Technol, Delft, Netherlands
基金
美国国家科学基金会;
关键词
PREDICTING FAULTS;
D O I
10.1145/2884781.2884848
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Real software, the kind working programmers produce by the kLOC to solve real-world problems, tends to be "natural", like speech or natural language; it tends to be highly repetitive and predictable. Researchers have captured this naturalness of software through statistical models and used them to good effect in suggestion engines, porting tools, coding standards checkers, and idiom miners. This suggests that code that appears improbable, or surprising, to a good statistical language model is "unnatural" in some sense, and thus possibly suspicious. In this paper, we investigate this hypothesis. We consider a large corpus of bug fix commits (ca.7,139), from 10 different Java projects, and focus on its language statistics, evaluating the naturalness of buggy code and the corresponding fixes. We find that code with bugs tends to be more entropic (i.e. unnatural), becoming less so as bugs are fixed. Ordering files for inspection by their average entropy yields cost-effectiveness scores comparable to popular defect prediction methods. At a finer granularity, focusing on highly entropic lines is similar in cost-effectiveness to some well-known static bug finders (PMD, FindBugs) and ordering warnings from these bug finders using an entropy measure improves the cost-effectiveness of inspecting code implicated in warnings. This suggests that entropy may be a valid, simple way to complement the effectiveness of PMD or FindBugs, and that search-based bug-fixing methods may benefit from using entropy both for fault-localization and searching for fixes.
引用
收藏
页码:428 / 439
页数:12
相关论文
共 50 条
  • [1] CBCD: Cloned Buggy Code Detector
    Li, Jingyue
    Ernst, Michael D.
    2012 34TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), 2012, : 310 - 320
  • [2] On the Characteristics of Buggy Code Clones: A Code Quality Perspective
    Islam, Md Rakibul
    Zibran, Minhaz F.
    2018 IEEE 12TH INTERNATIONAL WORKSHOP ON SOFTWARE CLONES (IWSC), 2018, : 23 - 29
  • [3] On the Naturalness of Fuzzer-Generated Code
    Kambhamettu, Rajeswari Hita
    Billos, John
    Oluwaseun-Apo, Tomi
    Gafford, Benjamin
    Padhye, Rohan
    Hellendoorn, Vincent J.
    2022 MINING SOFTWARE REPOSITORIES CONFERENCE (MSR 2022), 2022, : 506 - 510
  • [4] Dependency-Aware Code Naturalness
    Yang, Chen
    Chen, Junjie
    Jiang, Jiajun
    Huang, Yuliang
    Proceedings of the ACM on Programming Languages, 2024, 8 (OOPSLA2)
  • [5] On the Impact of Refactoring Operations on Code Naturalness
    Lin, Bin
    Nagy, Csaba
    Bavota, Gabriele
    Lanza, Michele
    2019 IEEE 26TH INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING (SANER), 2019, : 594 - 598
  • [6] Scalable and Systematic Detection of Buggy Inconsistencies in Source Code
    Gabel, Mark
    Yang, Junfeng
    Yu, Yuan
    Goldszmidt, Moises
    Su, Zhendong
    ACM SIGPLAN NOTICES, 2010, 45 (10) : 175 - 190
  • [7] Toward Refactoring Evaluation with Code Naturalness
    Arima, Ryo
    Higo, Yoshiki
    Kusumoto, Shinji
    2018 IEEE/ACM 26TH INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC 2018), 2018, : 316 - 319
  • [8] Supporting Code Review by Automatic Detection of Potentially Buggy Changes
    Fejzer, Mikolaj
    Wojtyna, Michal
    Burzanska, Marta
    Wisniewski, Piotr
    Stencel, Krzysztof
    BEYOND DATABASES, ARCHITECTURES AND STRUCTURES, BDAS 2015, 2015, 521 : 473 - 482
  • [9] Research Progress of Code Naturalness and Its Application
    Chen Z.-Z.
    Yan M.
    Xia X.
    Liu Z.-X.
    Xu Z.
    Lei Y.
    Ruan Jian Xue Bao/Journal of Software, 2022, 33 (08): : 3015 - 3034
  • [10] A Survey of Machine Learning for Big Code and Naturalness
    Allamanis, Miltiadis
    Barr, Earl T.
    Devanbu, Premkumar
    Sutton, Charles
    ACM COMPUTING SURVEYS, 2018, 51 (04)