Bugram: Bug Detection with N-gram Language Models

被引:78
|
作者
Wang, Song [1 ]
Chollak, Devin [1 ]
Movshovitz-Attias, Dana [2 ]
Tan, Lin [1 ]
机构
[1] Univ Waterloo, Elect & Comp Engn, Waterloo, ON N2L 3G1, Canada
[2] Carnegie Mellon Univ, Dept Comp Sci, Pittsburgh, PA 15213 USA
关键词
Bug Detection; Static Code Analysis; N-gram Language Model;
D O I
10.1145/2970276.2970341
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
To improve software reliability, many rule-based techniques have been proposed to infer programming rules and detect violations of these rules as bugs. These rule-based approaches often rely on the highly frequent appearances of certain patterns in a project to infer rules. It is known that if a pattern does not appear frequently enough, rules are not learned, thus missing many bugs. In this paper, we propose a new approach-Bugram-that leverages n-gram language models instead of rules to detect bugs. Bugram models program tokens sequentially, using the n-gram language model. Token sequences from the program are then assessed according to their probability in the learned model, and low probability sequences are marked as potential bugs. The assumption is that low probability token sequences in a program are unusual, which may indicate bugs, bad practices, or unusual/special uses of code of which developers may want to be aware. We evaluate Bugram in two ways. First, we apply Bugram on the latest versions of 16 open source Java projects. Results show that Bugram detects 59 bugs, 42 of which are manually verified as correct, 25 of which are true bugs and 17 are code snippets that should be refactored. Among the 25 true bugs, 23 cannot be detected by PR-Miner. We have reported these bugs to developers, 7 of which have already been confirmed by developers ( 4 of them have already been fixed), while the rest await confirmation. Second, we further compare Bugram with three additional graph-and rule-based bug detection tools, i.e., JADET, Tikanga, and GrouMiner. We apply Bugram on 14 Java projects evaluated in these three studies. Bugram detects 21 true bugs, at least 10 of which cannot be detected by these three tools. Our results suggest that Bugram is complementary to existing rule-based bug detection approaches.
引用
收藏
页码:708 / 719
页数:12
相关论文
共 50 条
  • [1] On compressing n-gram language models
    Hirsimaki, Teemu
    [J]. 2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 949 - 952
  • [2] MIXTURE OF MIXTURE N-GRAM LANGUAGE MODELS
    Sak, Hasim
    Allauzen, Cyril
    Nakajima, Kaisuke
    Beaufays, Francoise
    [J]. 2013 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2013, : 31 - 36
  • [3] Perplexity of n-Gram and Dependency Language Models
    Popel, Martin
    Marecek, David
    [J]. TEXT, SPEECH AND DIALOGUE, 2010, 6231 : 173 - 180
  • [4] Profile based compression of n-gram language models
    Olsen, Jesper
    Oria, Daniela
    [J]. 2006 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-13, 2006, : 1041 - 1044
  • [5] N-gram language models for massively parallel devices
    Bogoychev, Nikolay
    Lopez, Adam
    [J]. PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2016, : 1944 - 1953
  • [6] Efficient MDI Adaptation for n-gram Language Models
    Huang, Ruizhe
    Li, Ke
    Arora, Ashish
    Povey, Daniel
    Khudanpur, Sanjeev
    [J]. INTERSPEECH 2020, 2020, : 4916 - 4920
  • [7] Improved N-gram Phonotactic Models For Language Recognition
    BenZeghiba, Mohamed Faouzi
    Gauvain, Jean-Luc
    Lamel, Lori
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2718 - 2721
  • [8] N-gram language models for document image decoding
    Kopec, GE
    Said, MR
    Popat, K
    [J]. DOCUMENT RECOGNITION AND RETRIEVAL IX, 2002, 4670 : 191 - 202
  • [9] Multilingual stochastic n-gram class language models
    Jardino, M
    [J]. 1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 161 - 163
  • [10] Constrained Discriminative Training of N-gram Language Models
    Rastrow, Ariya
    Sethy, Abhinav
    Ramabhadran, Bhuvana
    [J]. 2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), 2009, : 311 - +