Decoding the Secrets of Machine Learning in Windows Malware Classification: A Deep Dive into Datasets, Features, and Model Performance

被引：3

作者：

Dambra, Savino ^{[1
]}

Han, Yufei ^{[2
]}

Aonzo, Simone ^{[3
]}

Kotzias, Platon ^{[1
]}

Vitale, Antonino ^{[3
]}

Caballero, Juan ^{[4
]}

Balzarotti, Davide ^{[3
]}

Bilge, Leyla ^{[1
]}

机构：

[1] Norton Res Grp, Norton, MA 02766 USA

[2] INRIA, Paris, France

[3] Eurecom, Biot, France

[4] IMDEA Software Inst, Madrid, Spain

来源：

PROCEEDINGS OF THE 2023 ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, CCS 2023 | 2023年

基金：

欧洲研究理事会;

关键词：

malware detection; malware family classification; machine learning for malware;

D O I：

10.1145/3576915.3616589

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Many studies have proposed machine-learning (ML) models for malware detection and classification, reporting an almost-perfect performance. However, they assemble ground-truth in different ways, use diverse static- and dynamic-analysis techniques for feature extraction, and even differ on what they consider a malware family. As a consequence, our community still lacks an understanding of malware classification results: whether they are tied to the nature and distribution of the collected dataset, to what extent the number of families and samples in the training dataset influence performance, and how well static and dynamic features complement each other. This work sheds light on those open questions by investigating the impact of datasets, features, and classifiers on ML-based malware detection and classification. For this, we collect the largest balanced malware dataset so far with 67k samples from 670 families (100 samples each), and train state-of-the-art models for malware detection and family classification using our dataset. Our results reveal that static features perform better than dynamic features, and that combining both only provides marginal improvement over static features. We discover no correlation between packing and classification accuracy, and that missing behaviors in dynamically-extracted features highly penalise their performance. We also demonstrate how a larger number of families to classify makes the classification harder, while a higher number of samples per family increases accuracy. Finally, we find that models trained on a uniform distribution of samples per family better generalize on unseen data.

引用

页码：60 / 74

页数：15

共 50 条

[1] Deep Learning Model with Sequential Features for Malware Classification
Wu, Xuan
Song, Yafei
Hou, Xiaoyi
Ma, Zexuan
Chen, Chen
[J]. APPLIED SCIENCES-BASEL, 2022, 12 (19):
[2] Deep Learning Applied to Imbalanced Malware Datasets Classification
Salas, Marcelo Palma
de Geus, Paulo Licio
[J]. JOURNAL OF INTERNET SERVICES AND APPLICATIONS, 2024, 15 (01) : 342 - 359
[3] A Novel and Dedicated Machine Learning Model for Malware Classification
Li, Miles Q.
Fung, Benjamin C. M.
Charland, Philippe
Ding, Steven H. H.
[J]. PROCEEDINGS OF THE 16TH INTERNATIONAL CONFERENCE ON SOFTWARE TECHNOLOGIES (ICSOFT), 2021, : 617 - 628
[4] MCSMGS: Malware Classification Model Based on Deep Learning
Meng, Xi
Shan, Zhen
Liu, Fudong
Zhao, Bingling
Han, Jin
Wang, Jing
Wang, Hongyan
[J]. 2017 INTERNATIONAL CONFERENCE ON CYBER-ENABLED DISTRIBUTED COMPUTING AND KNOWLEDGE DISCOVERY (CYBERC), 2017, : 272 - 275
[5] A comparative study of deep transfer learning models for malware classification using image datasets
Ranjan, Ranjeet Kumar
Singh, Amit
[J]. INTERNATIONAL JOURNAL OF INFORMATION AND COMPUTER SECURITY, 2023, 21 (3-4) : 293 - 319
[6] Ensemble Machine Learning Approach for Android Malware Classification Using Hybrid Features
Pektas, Abdurrahman
Acarman, Tankut
[J]. PROCEEDINGS OF THE 10TH INTERNATIONAL CONFERENCE ON COMPUTER RECOGNITION SYSTEMS CORES 2017, 2018, 578 : 191 - 200
[7] High Performance Classification of Android Malware Using Ensemble Machine Learning
Ouk, Pagnchakneat C.
Pak, Wooguil
[J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 72 (01): : 381 - 398
[8] Performance Analysis of Machine Learning and Pattern Recognition Algorithms for Malware Classification
Narayanan, Barath Narayanan
Djaneye-Boundjou, Ouboti
Kebede, Temesguen M.
[J]. PROCEEDINGS OF THE 2016 IEEE NATIONAL AEROSPACE AND ELECTRONICS CONFERENCE (NAECON) AND OHIO INNOVATION SUMMIT (OIS), 2016, : 338 - 342
[9] A Comparison of Machine and Deep Learning Models for Detection and Classification of Android Malware Traffic
Bovenzi, Giampaolo
Cerasuolo, Francesco
Montieri, Antonio
Nascita, Alfredo
Persico, Valerio
Pescape, Antonio
[J]. 2022 27TH IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS (IEEE ISCC 2022), 2022,
[10] Dynamic Malware Classification and API Categorisation of Windows Portable Executable Files Using Machine Learning
Syeda, Durre Zehra
Asghar, Mamoona Naveed
[J]. APPLIED SCIENCES-BASEL, 2024, 14 (03):

← 1 2 3 4 5 →