Classification of ransomware families with machine learning based on N-gram of opcodes

被引:140
|
作者
Zhang, Hanqi [1 ,2 ]
Xiao, Xi [2 ]
Mercaldo, Francesco [4 ]
Ni, Shiguang [3 ]
Martinelli, Fabio [5 ]
Sangaiah, Arun Kumar [6 ]
机构
[1] Cent China Normal Univ, Coll Phys Sci & Technol, Wuhan, Hubei, Peoples R China
[2] Tsinghua Univ, Grad Sch Shenzhen, Shenzhen, Peoples R China
[3] Tsinghua Univ, Grad Sch Shenzhen, Div Social Sci & Management, Shenzhen, Peoples R China
[4] Natl Res Council Italy, Inst Informat & Telemat, Pisa, Italy
[5] Natl Res Council Italy, Inst Informat & Telemat, Secur Grp, Pisa, Italy
[6] VIT Univ, Sch Comp Sci & Engn, Vellore, Tamil Nadu, India
关键词
Ransomware classification; Static analysis; Opcode; Machine learning; N-gram;
D O I
10.1016/j.future.2018.07.052
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Ransomware is a special type of malware that can lock victims' screen and/or encrypt their files to obtain ransoms, resulting in great damage to users. Mapping ransomware into families is useful for identifying the variants of a known ransomware sample and for reducing analysts' workload. However, ransomware that can fingerprint the environment can evade the precious work of dynamic analysis. To the best of our knowledge, to overcome this shortcoming, we are the first to propose an approach based on static analysis to classifying ransomware. First, opcode sequences from ransomware samples are transformed into N-gram sequences. Then, Term frequency-Inverse document frequency (TF-IDF) is calculated for each N-gram to select feature N-grams so that these N-grams exhibit better discrimination between families. Finally, we treat the vectors composed of the TF values of the feature N-grams as the feature vectors and subsequently feed them to five machine-learning methods to perform ransomware classification. Six evaluation criteria are employed to validate the model. Thorough experiments performed using real datasets demonstrate that our approach can achieve the best Accuracy of 91.43%. Furthermore, the average F1-measure of the "wannacry" ransomware family is up to 99%, and the Accuracy of binary classification is up to 99.3%. The proposed method can detect and classify ransomware that can fingerprint the environment. In addition, we discover that different feature dimensions are required for achieving similar classifier performance with feature N-grams of diverse lengths. (C) 2018 Elsevier B.V. All rights reserved.
引用
收藏
页码:211 / 221
页数:11
相关论文
共 50 条
  • [1] Classification of sentiment reviews using n-gram machine learning approach
    Tripathy, Abinash
    Agrawal, Ankit
    Rath, Santanu Kumar
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2016, 57 : 117 - 126
  • [2] N-gram MalGAN: Evading machine learning detection via feature n-gram
    Zhu, Enmin
    Zhang, Jianjie
    Yan, Jijie
    Chen, Kongyang
    Gao, Chongzhi
    [J]. DIGITAL COMMUNICATIONS AND NETWORKS, 2022, 8 (04) : 485 - 491
  • [3] N-gram MalGAN:Evading machine learning detection via feature n-gram
    Enmin Zhu
    Jianjie Zhang
    Jijie Yan
    Kongyang Chen
    Chongzhi Gao
    [J]. Digital Communications and Networks., 2022, 8 (04) - 491
  • [4] Detecting Malware Based on Opcode N-Gram and Machine Learning
    Li, Pengfei
    Chen, Zhouguo
    Cui, Baojiang
    [J]. ADVANCES ON P2P, PARALLEL, GRID, CLOUD AND INTERNET COMPUTING (3PGCIC-2017), 2018, 13 : 99 - 110
  • [5] Classification and Prediction of Antimicrobial Peptides Using N-gram Representation and Machine Learning
    Othman, Manal
    Ratna, Sujay
    Tewari, Anant
    Kang, Anthony M.
    Du, Katherine
    Vaisman, Iosif I.
    [J]. ACM-BCB' 2017: PROCEEDINGS OF THE 8TH ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY,AND HEALTH INFORMATICS, 2017, : 605 - 605
  • [6] A variant of n-gram based language classification
    Tomovic, Andrija
    Janicic, Predrag
    [J]. AI(ASTERISK)IA 2007: ARTIFICIAL INTELLIGENCE AND HUMAN-ORIENTED COMPUTING, 2007, 4733 : 410 - +
  • [7] A machine learning approach for Arabic text classification using N-gram frequency statistics
    Khreisat, Laila
    [J]. JOURNAL OF INFORMETRICS, 2009, 3 (01) : 72 - 77
  • [8] Sentiment Classification Using N-Gram Inverse Document Frequency and Automated Machine Learning
    Maipradit, Rungroj
    Hata, Hideaki
    Matsumoto, Kenichi
    [J]. IEEE SOFTWARE, 2019, 36 (05) : 65 - 70
  • [9] XSS Attack Detection With Machine Learning and n-Gram Methods
    Habibi, Gulit
    Surantha, Nico
    [J]. PROCEEDINGS OF 2020 INTERNATIONAL CONFERENCE ON INFORMATION MANAGEMENT AND TECHNOLOGY (ICIMTECH), 2020, : 516 - 520
  • [10] Proposal of n-gram Based Algorithm for Malware Classification
    Pektas, Abdurrahman
    Eris, Mehmet
    Acarman, Tankut
    [J]. PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON EMERGING SECURITY INFORMATION, SYSTEMS AND TECHNOLOGIES (SECURWARE 2011), 2011, : 14 - 18