Malicious code classification based on opcode sequences and textCNN network

被引:12
|
作者
Wang, Qianhui [1 ]
Qian, Quan [1 ,2 ,3 ]
机构
[1] Shanghai Univ, Sch Engn & Comp Sci, Shanghai 200444, Peoples R China
[2] Zhejiang Lab, Hangzhou 311100, Peoples R China
[3] Shanghai Univ, Ctr Mat Informat & Data Sci, Mat Genome Inst, Shanghai 200444, Peoples R China
关键词
Malicious code classification; Opcode sequences; Word embedding; Text convolutional neural network;
D O I
10.1016/j.jisa.2022.103151
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A malicious code classification problem is essential for the network security. Malicious code is the most common means of network attack, which threatens user information and property security. An effective malicious code classification method can improve the efficiency of malicious code detection and the ability to discover new malicious code families. This study proposes a new malicious code classification method to analyze, classify, and detect malicious code. The semantic features of opcode sequences are extracted effectively by introducing the concept of word vectors. Furthermore, the extracted sequence is regarded as a text sentence and then introduced to a text convolutional neural network (textCNN) to identify malicious code families. The experimental results revealed that the model has more than 98% accuracy (with macro-average precision above 98.65% and macro-average recall approximately 98.66%) on the Microsoft Malware Challenge dataset conducted in 2015. Meanwhile, the accuracy of the model on the SOREL-20M dataset is 91.93%. Mostly call instructions are used to call the API, library functions, and other user-defined functions through which the behavior of malicious code is generally realized. Thus, selecting the block that contains call instructions as the key block will reduce the model training speed. After selecting the key block, on average, the number of opcodes on Microsoft Malware Challenge dataset is reduced by 39.07% and has a 98.18% accuracy rate, which is slightly lower than the result obtained by using all opcodes. The number of opcodes on the SOREL20M dataset is reduced by 30.49% on average, and the accuracy is increased to 93.46%. Experimental results show that the proposed algorithm works well and outperforms the results obtained by using byte n-gram representation.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Malicious Code Detection Using Penalized Splines on OPcode Frequency
    Alazab, Mamoun
    Al Kadiri, Mohammad
    Venkatraman, Sitalakshmi
    Al-Nemrat, Ameer
    [J]. 2012 THIRD CYBERCRIME AND TRUSTWORTHY COMPUTING WORKSHOP (CTC 2012), 2012, : 38 - 47
  • [2] Malicious Code Detection Using Opcode Running Tree Representation
    Ding Yuxin
    Dai Wei
    Zhang Yibin
    Xue Chenglong
    [J]. 2014 NINTH INTERNATIONAL CONFERENCE ON P2P, PARALLEL, GRID, CLOUD AND INTERNET COMPUTING (3PGCIC), 2014, : 616 - 621
  • [3] Using Opcode-Sequences to Detect Malicious Android Applications
    Jerome, Quentin
    Allix, Kevin
    State, Radu
    Engel, Thomas
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2014, : 914 - 919
  • [4] Decompiled APK based malicious code classification
    Mateless, Roni
    Rejabek, Daniel
    Margalit, Oded
    Moskovitch, Robert
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2020, 110 : 135 - 147
  • [5] Android malicious code Classification using Deep Belief Network
    Luo Shiqi
    Tian Shengwei
    Yu Long
    Yu Jiong
    Sun Hua
    [J]. KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2018, 12 (01): : 454 - 475
  • [6] Generating sparse explanations for malicious Android opcode sequences using hierarchical LIME
    Mitchell, Jeff
    McLaughlin, Niall
    Martinez-del-Rincon, Jesus
    [J]. COMPUTERS & SECURITY, 2024, 137
  • [7] Malicious Code Classification Method Based on Deep Forest
    Lu, Xi-Dong
    Duan, Zhe-Min
    Qian, Ye-Kui
    Zhou, Wei
    [J]. Ruan Jian Xue Bao/Journal of Software, 2020, 31 (05): : 1454 - 1464
  • [8] Convolutional neural networks for functional classification of opcode sequences
    Lee, Michael S.
    [J]. DISRUPTIVE TECHNOLOGIES IN INFORMATION SCIENCES, 2018, 10652
  • [9] Malicious code detection based on heterogeneous information network
    Liu, Yashu
    Hou, Yueran
    Yan, Hanbing
    [J]. Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics, 2022, 48 (02): : 258 - 265
  • [10] Lightweight Malicious Code Classification Method Based on Improved SqueezeNet
    Li, Li
    Kong, Youran
    Zhang, Qing
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 78 (01): : 551 - 567