On the Impact of Sample Duplication in Machine-Learning-Based Android Malware Detection

被引:42
|
作者
Zhao, Yanjie [1 ]
Li, Li [1 ]
Wang, Haoyu [2 ]
Cai, Haipeng [3 ]
Bissyande, Tegawende F. [4 ]
Klein, Jacques [4 ]
Grundy, John [1 ]
机构
[1] Monash Univ, Wellington Rd, Clayton, Vic 3800, Australia
[2] Beijing Univ Posts & Telecommun, 10 Xitucheng Rd, Beijing 100876, Peoples R China
[3] Washington State Univ, Pullman, WA 99163 USA
[4] Univ Luxembourg, 2 Ave Univ, L-4365 Esch Sur Alzette, Luxembourg
基金
澳大利亚研究理事会; 中国国家自然科学基金; 欧盟地平线“2020”;
关键词
Duplication; dataset; machine learning; android; malware detection;
D O I
10.1145/3446905
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Malware detection at scale in the Android realm is often carried out using machine learning techniques. State-of-the-art approaches such as DREBIN and MaMaDroid are reported to yield high detection rates when assessed against well-known datasets. Unfortunately, such datasets may include a large portion of duplicated samples, which may bias recorded experimental results and insights. In this article, we perform extensive experiments to measure the performance gap that occurs when datasets are de-duplicated. Our experimental results reveal that duplication in published datasets has a limited impact on supervised malware classification models. This observation contrasts with the finding of Allamanis on the general case of machine learning bias for big code. Our experiments, however, showthat sample duplication more substantially affects unsupervised learning models (e.g., malware family clustering). Nevertheless, we argue that our fellow researchers and practitioners should always take sample duplication into consideration when performing machine-learningbased (via either supervised or unsupervised learning) Android malware detections, no matter howsignificant the impact might be.
引用
收藏
页数:38
相关论文
共 50 条
  • [1] Significant Permission Identification for Machine-Learning-Based Android Malware Detection
    Li, Jin
    Sun, Lichao
    Yan, Qiben
    Li, Zhiqiang
    Srisa-an, Witawas
    Ye, Heng
    [J]. IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2018, 14 (07) : 3216 - 3225
  • [2] Android Malware Detection Based on Machine Learning
    Wang, Qing-Fei
    Fang, Xiang
    [J]. 2018 4TH ANNUAL INTERNATIONAL CONFERENCE ON NETWORK AND INFORMATION SYSTEMS FOR COMPUTERS (ICNISC 2018), 2018, : 434 - 436
  • [3] Evading Machine-Learning-Based Android Malware Detector for IoT Devices
    Renjith, G.
    Vinod, P.
    Aji, S.
    [J]. IEEE SYSTEMS JOURNAL, 2023, 17 (02): : 2745 - 2755
  • [4] An Insight into the Machine-Learning-Based Fileless Malware Detection
    Khalid, Osama
    Ullah, Subhan
    Ahmad, Tahir
    Saeed, Saqib
    Alabbad, Dina A.
    Aslam, Mudassar
    Buriro, Attaullah
    Ahmad, Rizwan
    [J]. SENSORS, 2023, 23 (02)
  • [5] An Android Malware Detection System Based on Machine Learning
    Wen, Long
    Yu, Haiyang
    [J]. GREEN ENERGY AND SUSTAINABLE DEVELOPMENT I, 2017, 1864
  • [6] A Machine-Learning-Based Framework for Supporting Malware Detection and Analysis
    Cuzzocrea, Alfredo
    Mercaldo, Francesco
    Martinelli, Fabio
    [J]. COMPUTATIONAL SCIENCE AND ITS APPLICATIONS, ICCSA 2021, PT III, 2021, 12951 : 353 - 365
  • [7] Machine-Learning-Based Malware Detection for Virtual Machine by Analyzing Opcode Sequence
    Wang, Xiao
    Zhang, Jianbiao
    Zhang, Ai
    [J]. ADVANCES IN BRAIN INSPIRED COGNITIVE SYSTEMS, BICS 2018, 2018, 10989 : 717 - 726
  • [8] Impact of datasets on machine learning based methods in Android malware detection: an empirical study
    Ge, Xiuting
    Huang, Yifan
    Hui, Zhanwei
    Wang, Xiaojuan
    Cao, Xu
    [J]. 2021 IEEE 21ST INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY (QRS 2021), 2021, : 81 - 92
  • [9] Study on Android Hybrid Malware Detection Based on Machine Learning
    Kuo, Wen-Chung
    Liu, Tsung-Ping
    Wang, Chun-Cheng
    [J]. 2019 IEEE 4TH INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION SYSTEMS (ICCCS 2019), 2019, : 31 - 35
  • [10] A Review of Android Malware Detection Approaches Based on Machine Learning
    Liu, Kaijun
    Xu, Shengwei
    Xu, Guoai
    Zhang, Miao
    Sun, Dawei
    Liu, Haifeng
    [J]. IEEE ACCESS, 2020, 8 : 124579 - 124607