Malicious code classification based on opcode sequences and textCNN network

被引：12

作者：

Wang, Qianhui ^{[1
]}

Qian, Quan ^{[1
,2
,3
]}

机构：

[1] Shanghai Univ, Sch Engn & Comp Sci, Shanghai 200444, Peoples R China

[2] Zhejiang Lab, Hangzhou 311100, Peoples R China

[3] Shanghai Univ, Ctr Mat Informat & Data Sci, Mat Genome Inst, Shanghai 200444, Peoples R China

来源：

JOURNAL OF INFORMATION SECURITY AND APPLICATIONS | 2022年 / 67卷

关键词：

Malicious code classification; Opcode sequences; Word embedding; Text convolutional neural network;

D O I：

10.1016/j.jisa.2022.103151

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

A malicious code classification problem is essential for the network security. Malicious code is the most common means of network attack, which threatens user information and property security. An effective malicious code classification method can improve the efficiency of malicious code detection and the ability to discover new malicious code families. This study proposes a new malicious code classification method to analyze, classify, and detect malicious code. The semantic features of opcode sequences are extracted effectively by introducing the concept of word vectors. Furthermore, the extracted sequence is regarded as a text sentence and then introduced to a text convolutional neural network (textCNN) to identify malicious code families. The experimental results revealed that the model has more than 98% accuracy (with macro-average precision above 98.65% and macro-average recall approximately 98.66%) on the Microsoft Malware Challenge dataset conducted in 2015. Meanwhile, the accuracy of the model on the SOREL-20M dataset is 91.93%. Mostly call instructions are used to call the API, library functions, and other user-defined functions through which the behavior of malicious code is generally realized. Thus, selecting the block that contains call instructions as the key block will reduce the model training speed. After selecting the key block, on average, the number of opcodes on Microsoft Malware Challenge dataset is reduced by 39.07% and has a 98.18% accuracy rate, which is slightly lower than the result obtained by using all opcodes. The number of opcodes on the SOREL20M dataset is reduced by 30.49% on average, and the accuracy is increased to 93.46%. Experimental results show that the proposed algorithm works well and outperforms the results obtained by using byte n-gram representation.

引用

页数：12

共 50 条

[1] Malicious Code Detection Using Penalized Splines on OPcode Frequency
Alazab, Mamoun
Al Kadiri, Mohammad
Venkatraman, Sitalakshmi
Al-Nemrat, Ameer
[J]. 2012 THIRD CYBERCRIME AND TRUSTWORTHY COMPUTING WORKSHOP (CTC 2012), 2012, : 38 - 47
[2] Malicious Code Detection Using Opcode Running Tree Representation
Ding Yuxin
Dai Wei
Zhang Yibin
Xue Chenglong
[J]. 2014 NINTH INTERNATIONAL CONFERENCE ON P2P, PARALLEL, GRID, CLOUD AND INTERNET COMPUTING (3PGCIC), 2014, : 616 - 621
[3] Using Opcode-Sequences to Detect Malicious Android Applications
Jerome, Quentin
Allix, Kevin
State, Radu
Engel, Thomas
[J]. 2014 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2014, : 914 - 919
[4] Decompiled APK based malicious code classification
Mateless, Roni
Rejabek, Daniel
Margalit, Oded
Moskovitch, Robert
[J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2020, 110 : 135 - 147
[5] Android malicious code Classification using Deep Belief Network
Luo Shiqi
Tian Shengwei
Yu Long
Yu Jiong
Sun Hua
[J]. KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2018, 12 (01): : 454 - 475
[6] Generating sparse explanations for malicious Android opcode sequences using hierarchical LIME
Mitchell, Jeff
McLaughlin, Niall
Martinez-del-Rincon, Jesus
[J]. COMPUTERS & SECURITY, 2024, 137
[7] Malicious Code Classification Method Based on Deep Forest
Lu, Xi-Dong
Duan, Zhe-Min
Qian, Ye-Kui
Zhou, Wei
[J]. Ruan Jian Xue Bao/Journal of Software, 2020, 31 (05): : 1454 - 1464
[8] Convolutional neural networks for functional classification of opcode sequences
Lee, Michael S.
[J]. DISRUPTIVE TECHNOLOGIES IN INFORMATION SCIENCES, 2018, 10652
[9] Malicious code detection based on heterogeneous information network
Liu, Yashu
Hou, Yueran
Yan, Hanbing
[J]. Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics, 2022, 48 (02): : 258 - 265
[10] Lightweight Malicious Code Classification Method Based on Improved SqueezeNet
Li, Li
Kong, Youran
Zhang, Qing
[J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 78 (01): : 551 - 567

← 1 2 3 4 5 →