Understanding and Mitigating Label Bias in Malware Classification: An Empirical Study

被引:3
|
作者
Yan, Jia [1 ,2 ]
Jia, Xiangkun [1 ]
Ying, Lingyun [3 ]
Yan, Jia [1 ,2 ]
Su, Purui [1 ,4 ]
机构
[1] Chinese Acad Sci, TCA SKLCS Inst Software, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Comp Sci & Technol, Beijing, Peoples R China
[3] QI ANXIN Technol Grp Inc, Beijing, Peoples R China
[4] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing, Peoples R China
关键词
Malware Classification; machine learning; annotation bias; FRAMEWORK; WRAPPER;
D O I
10.1109/QRS57517.2022.00057
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Machine learning techniques are promising for malware classification, but there is a neglected problem of label bias in the annotation process which decreases the performance in practice. To understand the label bias problems and existing solutions, we conduct an empirical study based on two Portable Executable (PE) malware sample datasets (i.e., open-sourced BODMAS with 52,793 samples and a new collected MAIN dataset of 153,811 samples), and 67 anti-virus engines in VirusTotal. We first show the two ways of label bias problems, including chaotic naming rules and annotation inconsistency. Then we present the effects of two solutions (i.e., electing one reputable AV engine and aggregating multiple labels based on majority voting) and find they face the problems of feature preference and engine independence. Finally, we propose some recommendations for improvements and get a 7.79% increase in the F1 score (i.e., from 84.83% to 92.62%). The dataset will be open-source for further study.
引用
收藏
页码:492 / 503
页数:12
相关论文
共 50 条
  • [1] An empirical study of problems and evaluation of IoT malware classification label sources
    Lei, Tianwei
    Xue, Jingfeng
    Wang, Yong
    Baker, Thar
    Niu, Zequn
    [J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2024, 36 (01)
  • [2] Empirical Study on Microsoft Malware Classification
    Chivukula, Rohit
    Sajja, Mohan Vamsi
    Lakshmi, T. Jaya
    Harini, Muddana
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2021, 12 (03) : 509 - 515
  • [3] The Authors Matter: Understanding and Mitigating Implicit Bias in Deep Text Classification
    Liu, Haochen
    Jin, Wei
    Karimi, Hamid
    Liu, Zitao
    Tang, Jiliang
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 74 - 85
  • [4] An Empirical Study on Noisy Label Learning for Program Understanding
    Wang, Wenhan
    Li, Yanzhou
    Li, Anran
    Zhang, Jian
    Ma, Wei
    Liu, Yang
    [J]. arXiv, 2023,
  • [5] Sampling Bias in Deep Active Classification: An Empirical Study
    Prabhu, Ameya
    Dognin, Charles
    Singh, Maneesh
    [J]. 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 4058 - 4068
  • [6] Understanding and Mitigating Bias in Online Health Search
    Hashavit, Anat
    Wang, Hongning
    Lin, Raz
    Stern, Tamar
    Kraus, Sarit
    [J]. SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 265 - 274
  • [7] Understanding and Mitigating Bias in Imaging Artificial Intelligence
    Tejani, Ali S.
    Ng, Yee Seng
    Xi, Yin
    Rayan, Jesse C.
    [J]. RADIOGRAPHICS, 2024, 44 (05)
  • [8] Measuring and Mitigating Unintended Bias in Text Classification
    Dixon, Lucas
    Li, John
    Sorensen, Jeffrey
    Thain, Nithum
    Vasserman, Lucy
    [J]. PROCEEDINGS OF THE 2018 AAAI/ACM CONFERENCE ON AI, ETHICS, AND SOCIETY (AIES'18), 2018, : 67 - 73
  • [9] An Empirical Study of Malware Evolution
    Gupta, Archit
    Kuppili, Pavan
    Akella, Aditya
    Barford, Paul
    [J]. 2009 FIRST INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS AND NETWORKS (COMSNETS 2009), 2009, : 356 - 365
  • [10] An empirical study of empty prediction of multi-label classification
    Liu, Shuhua
    Chen, Jiun-Hung
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2015, 42 (13) : 5567 - 5579