SFEM: Structural feature extraction methodology for the detection of malicious office documents using machine learning methods

被引:54
|
作者
Cohen, Aviad [1 ,2 ]
Nissim, Nir [1 ,2 ]
Rokach, Lior [1 ,2 ]
Elovici, Yuval [1 ,2 ]
机构
[1] Ben Gurion Univ Negev, Dept Informat Syst Engn, IL-84105 Beer Sheva, Israel
[2] Ben Gurion Univ Negev, Cyber Secur Res Ctr, Malware Lab, IL-84105 Beer Sheva, Israel
关键词
Machine learning; Malware detection; Static analysis; Structural features; Microsoft office open xml; Document; MALWARE DETECTION; PDF FILES; CLASSIFICATION;
D O I
10.1016/j.eswa.2016.07.010
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Office documents are used extensively by individuals and organizations. Most users consider these documents safe for use. Unfortunately, Office documents can contain malicious components and perform harmful operations. Attackers increasingly take advantage of naive users and leverage Office documents in order to launch sophisticated advanced persistent threat (APT) and ransomware attacks. Recently, targeted cyber-attacks against organizations have been initiated with emails containing malicious attachments. Since most email servers do not allow the attachment of executable files to emails, attackers prefer to use of non-executable files (e.g., documents) for malicious purposes. Existing anti-virus engines primarily use signature-based detection methods, and therefore fail to detect new unknown malicious code which has been embedded in an Office document. Machine learning methods have been shown to be effective at detecting known and unknown malware in various domains, however, to the best of our knowledge, machine learning methods have not been used for the detection of malicious XML-based Office documents (*.docx, *.xlsx, *.pptx, *.odt, *.ods, etc.). In this paper we present a novel structural feature extraction methodology (SFEM) for XML-based Office documents. SFEM extracts discriminative features from documents, based on their structure. We leveraged SFEM's features with machine learning algorithms for effective detection of malicious *.docx documents. We extensively evaluated SFEM with machine learning classifiers using a representative collection (16,938 *.docx documents collected "from the wild") which contains 4.9% malicious and similar to 95.1% benign documents. We examined 1,600 unique configurations based on different combinations of feature extraction, feature selection, feature representation, top-feature selection methods, and machine learning classifiers. The results show that machine learning algorithms trained on features provided by SFEM successfully detect new unknown malicious *.docx documents. The Random Forest classifier achieves the highest detection rates, with an AUC of 99.12% and true positive rate (TPR) of 97% that is accompanied by a false positive rate (FPR) of 4.9%. In comparison, the best anti-virus engine achieves a TPR which is 25% lower. (C) 2016 Elsevier Ltd. All rights reserved.
引用
收藏
页码:324 / 343
页数:20
相关论文
共 50 条
  • [21] An enhanced mechanism for malicious URL detection using deep learning and DistilBERT-based feature extraction
    Zaimi, Rania
    Eljil, Khouloud Safi
    Hafidi, Mohamed
    Lamia, Mahnane
    Nait-Abdesselam, Farid
    JOURNAL OF SUPERCOMPUTING, 2025, 81 (02):
  • [22] Feature Extraction of EEG Signals for Seizure Detection Using Machine Learning Algorthims
    Alsuwaiket, Mohammed A.
    ENGINEERING TECHNOLOGY & APPLIED SCIENCE RESEARCH, 2022, 12 (05) : 9247 - 9251
  • [23] Bone Cancer Detection Using Feature Extraction Based Machine Learning Model
    Sharma, Ashish
    Yadav, Dhirendra P.
    Garg, Hitendra
    Kumar, Mukesh
    Sharma, Bhisham
    Koundal, Deepika
    COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE, 2021, 2021
  • [24] Machine Learning Approaches to Malicious PowerShell Scripts Detection and Feature Combination Analysis
    Hung, Hsiang-Hua
    Chen, Jiann-Liang
    Ma, Yi-Wei
    JOURNAL OF INTERNET TECHNOLOGY, 2024, 25 (01): : 167 - 173
  • [25] DeMalC: A Feature-rich Machine Learning Framework for Malicious Call Detection
    Li, Yuhong
    Hou, Dongmei
    Pan, Aimin
    Gong, Zhiguo
    CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2017, : 1559 - 1567
  • [26] Intrusion Detection System Using Feature Extraction with Machine Learning Algorithms in IoT
    Musleh, Dhiaa
    Alotaibi, Meera
    Alhaidari, Fahd
    Rahman, Atta
    Mohammad, Rami M.
    JOURNAL OF SENSOR AND ACTUATOR NETWORKS, 2023, 12 (02)
  • [27] Brake Disc Deformation Detection Using Intuitive Feature Extraction and Machine Learning
    Dozsa, Tamas
    Ori, Peter
    Szabari, Matyas
    Simonyi, Erno
    Soumelidis, Alexandros
    Lakatos, Istvan
    MACHINES, 2024, 12 (04)
  • [28] A drowsiness detection architecture using feature extraction methodology
    Daphne, R. Reena
    Raj, A. Albert
    INTERNATIONAL CONFERENCE ON MODELLING OPTIMIZATION AND COMPUTING, 2012, 38 : 959 - 963
  • [29] GLDOC: detection of implicitly malicious MS-Office documents using graph convolutional networks
    Wang, Wenbo
    Yi, Peng
    Kou, Taotao
    Han, Weitao
    Wang, Chengyu
    CYBERSECURITY, 2024, 7 (01):
  • [30] Feature mining for encrypted malicious traffic detection with deep learning and other machine learning algorithms
    Wang, Zihao
    Thing, Vrizlynn L. L.
    COMPUTERS & SECURITY, 2023, 128