SFEM: Structural feature extraction methodology for the detection of malicious office documents using machine learning methods

被引:54
|
作者
Cohen, Aviad [1 ,2 ]
Nissim, Nir [1 ,2 ]
Rokach, Lior [1 ,2 ]
Elovici, Yuval [1 ,2 ]
机构
[1] Ben Gurion Univ Negev, Dept Informat Syst Engn, IL-84105 Beer Sheva, Israel
[2] Ben Gurion Univ Negev, Cyber Secur Res Ctr, Malware Lab, IL-84105 Beer Sheva, Israel
关键词
Machine learning; Malware detection; Static analysis; Structural features; Microsoft office open xml; Document; MALWARE DETECTION; PDF FILES; CLASSIFICATION;
D O I
10.1016/j.eswa.2016.07.010
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Office documents are used extensively by individuals and organizations. Most users consider these documents safe for use. Unfortunately, Office documents can contain malicious components and perform harmful operations. Attackers increasingly take advantage of naive users and leverage Office documents in order to launch sophisticated advanced persistent threat (APT) and ransomware attacks. Recently, targeted cyber-attacks against organizations have been initiated with emails containing malicious attachments. Since most email servers do not allow the attachment of executable files to emails, attackers prefer to use of non-executable files (e.g., documents) for malicious purposes. Existing anti-virus engines primarily use signature-based detection methods, and therefore fail to detect new unknown malicious code which has been embedded in an Office document. Machine learning methods have been shown to be effective at detecting known and unknown malware in various domains, however, to the best of our knowledge, machine learning methods have not been used for the detection of malicious XML-based Office documents (*.docx, *.xlsx, *.pptx, *.odt, *.ods, etc.). In this paper we present a novel structural feature extraction methodology (SFEM) for XML-based Office documents. SFEM extracts discriminative features from documents, based on their structure. We leveraged SFEM's features with machine learning algorithms for effective detection of malicious *.docx documents. We extensively evaluated SFEM with machine learning classifiers using a representative collection (16,938 *.docx documents collected "from the wild") which contains 4.9% malicious and similar to 95.1% benign documents. We examined 1,600 unique configurations based on different combinations of feature extraction, feature selection, feature representation, top-feature selection methods, and machine learning classifiers. The results show that machine learning algorithms trained on features provided by SFEM successfully detect new unknown malicious *.docx documents. The Random Forest classifier achieves the highest detection rates, with an AUC of 99.12% and true positive rate (TPR) of 97% that is accompanied by a false positive rate (FPR) of 4.9%. In comparison, the best anti-virus engine achieves a TPR which is 25% lower. (C) 2016 Elsevier Ltd. All rights reserved.
引用
收藏
页码:324 / 343
页数:20
相关论文
共 50 条
  • [1] ALDOCX: Detection of Unknown Malicious Microsoft Office Documents Using Designated Active Learning Methods Based on New Structural Feature Extraction Methodology
    Nissim, Nir
    Cohen, Aviad
    Elovici, Yuval
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2017, 12 (03) : 631 - 646
  • [2] UFADF: A Unified Feature Analysis and Detection Framework for Malicious Office Documents
    Hu, Yang
    Chen, Jia
    Luo, Xin
    2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023, 2024, : 881 - 888
  • [3] Boosting the Detection of Malicious Documents Using Designated Active Learning Methods
    Nissim, Nir
    Cohen, Aviad
    Elovici, Yuval
    2015 IEEE 14TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2015, : 760 - 765
  • [4] Feature Selection for Malicious Detection on Industrial IoT Using Machine Learning
    Chuang, Hong-Yu
    Chen, Ruey-Maw
    SENSORS AND MATERIALS, 2024, 36 (03) : 1035 - 1046
  • [5] CADefender: Detection of unknown malicious AutoLISP computer-aided design files using designated feature extraction and machine learning methods
    Yevsikov, Alexander
    Muralidharan, Trivikram
    Panker, Tomer
    Nissim, Nir
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 138
  • [6] Structural Analysis of URL For Malicious URL Detection Using Machine Learning
    Raja, A. Saleem
    Peerbasha, S.
    Iqbal, Y. Mohammed
    Sundarvadivazhagan, B.
    Surputheen, M. Mohamed
    JOURNAL OF ADVANCED APPLIED SCIENTIFIC RESEARCH, 2023, 5 (04): : 28 - 41
  • [7] Neonates Crying Detection Through Feature Extraction and Machine Learning Methods
    Nunez-Calvo, Lucia
    Velasco-Perez, Nuria
    Lozano-Juarez, Samuel
    Herrero, Alvaro
    Arnaez, Juan
    Urda, Daniel
    HYBRID ARTIFICIAL INTELLIGENT SYSTEM, PT I, HAIS 2024, 2025, 14857 : 275 - 285
  • [8] Feature Entropy Estimation (FEE) for Malicious IoT Traffic and Detection Using Machine Learning
    Diwan, Tarun Dhar
    Choubey, Siddartha
    Hota, H. S.
    Goyal, S. B.
    Jamal, Sajjad Shaukat
    Shukla, Piyush Kumar
    Tiwari, Basant
    MOBILE INFORMATION SYSTEMS, 2021, 2021
  • [9] Detection of malicious URLs using machine learning
    Reyes-Dorta, Nuria
    Caballero-Gil, Pino
    Rosa-Remedios, Carlos
    WIRELESS NETWORKS, 2024, 30 (09) : 7543 - 7560
  • [10] Malicious URL Detection Using Machine Learning
    Hani, Dr Raed Bani
    Amoura, Motasem
    Ammourah, Mohammad
    Abu Khalil, Yazeed
    2024 15TH INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION SYSTEMS, ICICS 2024, 2024,