On Distribution Shift in Learning-based Bug Detectors

被引：0

作者：

He, Jingxuan ^{[1
]}

Beurer-Kellner, Luca ^{[1
]}

Vechev, Martin ^{[1
]}

机构：

[1] Swiss Fed Inst Technol, Dept Comp Sci, Zurich, Switzerland

来源：

INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162 | 2022年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Deep learning has recently achieved initial success in program analysis tasks such as bug detection. Lacking real bugs, most existing works construct training and test data by injecting synthetic bugs into correct programs. Despite achieving high test accuracy (e.g., >90%), the resulting bug detectors are found to be surprisingly unusable in practice, i.e., <10% precision when used to scan real software repositories. In this work, we argue that this massive performance difference is caused by a distribution shift, i.e., a fundamental mismatch between the real bug distribution and the synthetic bug distribution used to train and evaluate the detectors. To address this key challenge, we propose to train a bug detector in two phases, first on a synthetic bug distribution to adapt the model to the bug detection domain, and then on a real bug distribution to drive the model towards the real distribution. During these two phases, we leverage a multi-task hierarchy, focal loss, and contrastive learning to further boost performance. We evaluate our approach extensively on three widely studied bug types, for which we construct new datasets carefully designed to capture the real bug distribution. The results demonstrate that our approach is practically effective and successfully mitigates the distribution shift: our learned detectors are highly performant on both our test set and the latest version of open source repositories. Our code, datasets, and models are publicly available at https://github.com/eth-sri/learning-real-bug-detector.

引用

页数：22

共 50 条

[31] Countering Evasion Attacks for Smart Grid Reinforcement Learning-Based Detectors
El-Toukhy, Ahmed T.
Mahmoud, Mohamed M. E. A.
Bondok, Atef H.
Fouda, Mostafa M.
Alsabaan, Maazen
IEEE ACCESS, 2023, 11 : 97373 - 97390
[32] Investigating the impact of vulnerability datasets on deep learning-based vulnerability detectors
Liu, Lili
Li, Zhen
Wen, Yu
Chen, Penglong
PEERJ COMPUTER SCIENCE, 2022, 8 : 1 - 22
[33] A Machine Learning-Based Framework for Water Quality Index Estimation in the Southern Bug River
Masood, Adil
Niazkar, Majid
Zakwan, Mohammad
Piraei, Reza
WATER, 2023, 15 (20)
[34] Deep Learning-based Production and Test Bug Report Classification using Source Files
Kim, Misoo
Kim, Youngkyoung
Lee, Eunseok
2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: COMPANION PROCEEDINGS (ICSE-COMPANION 2022), 2022, : 343 - 344
[35] Zonal Machine Learning-Based Protection for Distribution Systems
Poudel, Binod P.
Bidram, Ali
Reno, Matthew J.
Summers, Adam
IEEE ACCESS, 2022, 10 : 66634 - 66645
[36] The Research of Q Learning-Based Estimation of Distribution Algorithm
Hu Yugang
2011 TENTH INTERNATIONAL SYMPOSIUM ON DISTRIBUTED COMPUTING AND APPLICATIONS TO BUSINESS, ENGINEERING AND SCIENCE (DCABES), 2011, : 6 - 9
[37] Deep Learning-Based Intelligent Reflecting Surface Phase Shift Control
Kim, Hyunsoo
Wu, Jiao
Park, Yosub
Kim, Seungnyun
Shim, Byonghyo
2021 IEEE 94TH VEHICULAR TECHNOLOGY CONFERENCE (VTC2021-FALL), 2021,
[38] Evading Deep Learning-Based Malware Detectors via Obfuscation: A Deep Reinforcement Learning Approach
Etter, Brian
Hu, James Lee
Ebrahimi, Mohammadreza
Li, Weifeng
Li, Xin
Chen, Hsinchun
2023 23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW 2023, 2023, : 1313 - 1321
[39] Evaluating and Improving Adversarial Robustness of Machine Learning-Based Network Intrusion Detectors
Han, Dongqi
Wang, Zhiliang
Zhong, Ying
Chen, Wenqi
Yang, Jiahai
Lu, Shuqiang
Shi, Xingang
Yin, Xia
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2021, 39 (08) : 2632 - 2647
[40] Reinforcement Learning-based Adversarial Attacks on Object Detectors using Reward Shaping
Shi, Zhenbo
Yang, Wei
Xu, Zhenbo
Yu, Zhidong
Huang, Liusheng
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 8424 - 8432

← 1 2 3 4 5 →