On the Impact of Sample Duplication in Machine-Learning-Based Android Malware Detection

被引:42
|
作者
Zhao, Yanjie [1 ]
Li, Li [1 ]
Wang, Haoyu [2 ]
Cai, Haipeng [3 ]
Bissyande, Tegawende F. [4 ]
Klein, Jacques [4 ]
Grundy, John [1 ]
机构
[1] Monash Univ, Wellington Rd, Clayton, Vic 3800, Australia
[2] Beijing Univ Posts & Telecommun, 10 Xitucheng Rd, Beijing 100876, Peoples R China
[3] Washington State Univ, Pullman, WA 99163 USA
[4] Univ Luxembourg, 2 Ave Univ, L-4365 Esch Sur Alzette, Luxembourg
基金
澳大利亚研究理事会; 中国国家自然科学基金; 欧盟地平线“2020”;
关键词
Duplication; dataset; machine learning; android; malware detection;
D O I
10.1145/3446905
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Malware detection at scale in the Android realm is often carried out using machine learning techniques. State-of-the-art approaches such as DREBIN and MaMaDroid are reported to yield high detection rates when assessed against well-known datasets. Unfortunately, such datasets may include a large portion of duplicated samples, which may bias recorded experimental results and insights. In this article, we perform extensive experiments to measure the performance gap that occurs when datasets are de-duplicated. Our experimental results reveal that duplication in published datasets has a limited impact on supervised malware classification models. This observation contrasts with the finding of Allamanis on the general case of machine learning bias for big code. Our experiments, however, showthat sample duplication more substantially affects unsupervised learning models (e.g., malware family clustering). Nevertheless, we argue that our fellow researchers and practitioners should always take sample duplication into consideration when performing machine-learningbased (via either supervised or unsupervised learning) Android malware detections, no matter howsignificant the impact might be.
引用
收藏
页数:38
相关论文
共 50 条
  • [31] Enhanced Android Malware Detection: An SVM-based Machine Learning Approach
    Han, Hyoil
    Lim, SeungJin
    Suh, Kyoungwon
    Park, Seonghyun
    Cho, Seong-je
    Park, Minkyu
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP 2020), 2020, : 75 - 81
  • [32] Malware Detection System Based on Machine Learning Methods for Android Operating Systems
    Utku, Anil
    Dogru, Ibrahim Alper
    [J]. 2017 25TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2017,
  • [33] Mlifdect: Android Malware Detection Based on Parallel Machine Learning and Information Fusion
    Wang, Xin
    Zhang, Dafang
    Su, Xin
    Li, Wenjia
    [J]. SECURITY AND COMMUNICATION NETWORKS, 2017,
  • [34] Empirical Study on Intelligent Android Malware Detection based on Supervised Machine Learning
    Abdullah, Talal A. A.
    Ali, Waleed
    Abdulghafor, Rawad
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (04) : 215 - 224
  • [35] Dynamic Permissions based Android Malware Detection using Machine Learning Techniques
    Mahindru, Arvind
    Singh, Paramvir
    [J]. PROCEEDINGS OF THE 10TH INNOVATIONS IN SOFTWARE ENGINEERING CONFERENCE, 2017, : 202 - 210
  • [36] An Android Behavior-Based Malware Detection Method using Machine Learning
    Chang, Wei-Ling
    Sun, Hung-Min
    Wu, Wei
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, COMMUNICATIONS AND COMPUTING (ICSPCC), 2016,
  • [37] Android malware detection based on image-based features and machine learning techniques
    Halil Murat Ünver
    Khaled Bakour
    [J]. SN Applied Sciences, 2020, 2
  • [38] Android malware detection based on image-based features and machine learning techniques
    Unver, Halil Murat
    Bakour, Khaled
    [J]. SN APPLIED SCIENCES, 2020, 2 (07)
  • [39] Understanding Update of Machine-Learning-Based Malware Detection by Clustering Changes in Feature Attributions
    Fan, Yun
    Shibahara, Toshiki
    Ohsita, Yuichi
    Chiba, Daiki
    Akiyama, Mitsuaki
    Murata, Masayuki
    [J]. ADVANCES IN INFORMATION AND COMPUTER SECURITY, IWSEC 2021, 2021, 12835 : 99 - 118
  • [40] Detecting Android Malware Based on Extreme Learning Machine
    Sun, Yuxia
    Xie, Yunlong
    Qiu, Zhi
    Pan, Yuchang
    Weng, Jian
    Guo, Song
    [J]. 2017 IEEE 15TH INTL CONF ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING, 15TH INTL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING, 3RD INTL CONF ON BIG DATA INTELLIGENCE AND COMPUTING AND CYBER SCIENCE AND TECHNOLOGY CONGRESS(DASC/PICOM/DATACOM/CYBERSCI, 2017, : 47 - 53