SFEM: Structural feature extraction methodology for the detection of malicious office documents using machine learning methods

被引:54
|
作者
Cohen, Aviad [1 ,2 ]
Nissim, Nir [1 ,2 ]
Rokach, Lior [1 ,2 ]
Elovici, Yuval [1 ,2 ]
机构
[1] Ben Gurion Univ Negev, Dept Informat Syst Engn, IL-84105 Beer Sheva, Israel
[2] Ben Gurion Univ Negev, Cyber Secur Res Ctr, Malware Lab, IL-84105 Beer Sheva, Israel
关键词
Machine learning; Malware detection; Static analysis; Structural features; Microsoft office open xml; Document; MALWARE DETECTION; PDF FILES; CLASSIFICATION;
D O I
10.1016/j.eswa.2016.07.010
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Office documents are used extensively by individuals and organizations. Most users consider these documents safe for use. Unfortunately, Office documents can contain malicious components and perform harmful operations. Attackers increasingly take advantage of naive users and leverage Office documents in order to launch sophisticated advanced persistent threat (APT) and ransomware attacks. Recently, targeted cyber-attacks against organizations have been initiated with emails containing malicious attachments. Since most email servers do not allow the attachment of executable files to emails, attackers prefer to use of non-executable files (e.g., documents) for malicious purposes. Existing anti-virus engines primarily use signature-based detection methods, and therefore fail to detect new unknown malicious code which has been embedded in an Office document. Machine learning methods have been shown to be effective at detecting known and unknown malware in various domains, however, to the best of our knowledge, machine learning methods have not been used for the detection of malicious XML-based Office documents (*.docx, *.xlsx, *.pptx, *.odt, *.ods, etc.). In this paper we present a novel structural feature extraction methodology (SFEM) for XML-based Office documents. SFEM extracts discriminative features from documents, based on their structure. We leveraged SFEM's features with machine learning algorithms for effective detection of malicious *.docx documents. We extensively evaluated SFEM with machine learning classifiers using a representative collection (16,938 *.docx documents collected "from the wild") which contains 4.9% malicious and similar to 95.1% benign documents. We examined 1,600 unique configurations based on different combinations of feature extraction, feature selection, feature representation, top-feature selection methods, and machine learning classifiers. The results show that machine learning algorithms trained on features provided by SFEM successfully detect new unknown malicious *.docx documents. The Random Forest classifier achieves the highest detection rates, with an AUC of 99.12% and true positive rate (TPR) of 97% that is accompanied by a false positive rate (FPR) of 4.9%. In comparison, the best anti-virus engine achieves a TPR which is 25% lower. (C) 2016 Elsevier Ltd. All rights reserved.
引用
收藏
页码:324 / 343
页数:20
相关论文
共 50 条
  • [41] A Review of Recent Advances, Challenges, and Opportunities in Malicious Insider Threat Detection Using Machine Learning Methods
    Alzaabi, Fatima Rashed
    Mehmood, Abid
    IEEE ACCESS, 2024, 12 : 30907 - 30927
  • [42] Feature Extraction Based on Deep Learning for Some Traditional Machine Learning Methods
    Cayir, Aykut
    Yenidogan, Isil
    Dag, Hasan
    2018 3RD INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2018, : 494 - 497
  • [43] IoT-Enhanced Malicious URL Detection Using Machine Learning
    Weshahi, Aysar
    Dwaik, Feras
    Khouli, Mohammad
    Ashqar, Huthaifa I.
    Shatnawi, Amani
    ElKhodr, Mahmoud
    ADVANCED INFORMATION NETWORKING AND APPLICATIONS, VOL 5, AINA 2024, 2024, 203 : 470 - 482
  • [44] Comparing Methods of Feature Extraction of Brain Activities for Octave Illusion Classification Using Machine Learning
    Pilyugina, Nina
    Tsukahara, Akihiko
    Tanaka, Keita
    SENSORS, 2021, 21 (19)
  • [45] Detection analysis of malicious cyber attacks using machine learning algorithms
    Karthika, R. A.
    Maheswari, M.
    MATERIALS TODAY-PROCEEDINGS, 2022, 68 : 26 - 34
  • [46] Accuracy Improvement Method for Malicious Domain Detection using Machine Learning
    Koga, Toshiki
    Nobayashi, Daiki
    Ikenaga, Takeshi
    2024 IEEE 21ST CONSUMER COMMUNICATIONS & NETWORKING CONFERENCE, CCNC, 2024, : 1108 - 1109
  • [47] Malicious Log Detection Using Machine Learning to Maximize the Partial AUC
    Nishiyama, Taishi
    Kumagai, Atsutoshi
    Fujino, Akinori
    Kamiya, Kazunori
    2024 IEEE 21ST CONSUMER COMMUNICATIONS & NETWORKING CONFERENCE, CCNC, 2024, : 339 - 344
  • [48] Feature Extraction Methods for Binary Code Similarity Detection Using Neural Machine Translation Models
    Ito, Norimitsu
    Hashimoto, Masaki
    Otsuka, Akira
    IEEE ACCESS, 2023, 11 : 102796 - 102805
  • [49] Phishing detection based on machine learning and feature selection methods
    Almseidin M.
    Abu Zuraiq A.M.
    Al-kasassbeh M.
    Alnidami N.
    International Journal of Interactive Mobile Technologies, 2019, 13 (12) : 71 - 183
  • [50] Machine learning-based intrusion detection: feature selection versus feature extraction
    Ngo, Vu-Duc
    Vuong, Tuan-Cuong
    Van Luong, Thien
    Tran, Hung
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2024, 27 (03): : 2365 - 2379