Classification of ransomware families with machine learning based on N-gram of opcodes

被引：140

作者：

Zhang, Hanqi ^{[1
,2
]}

Xiao, Xi ^{[2
]}

Mercaldo, Francesco ^{[4
]}

Ni, Shiguang ^{[3
]}

Martinelli, Fabio ^{[5
]}

Sangaiah, Arun Kumar ^{[6
]}

机构：

[1] Cent China Normal Univ, Coll Phys Sci & Technol, Wuhan, Hubei, Peoples R China

[2] Tsinghua Univ, Grad Sch Shenzhen, Shenzhen, Peoples R China

[3] Tsinghua Univ, Grad Sch Shenzhen, Div Social Sci & Management, Shenzhen, Peoples R China

[4] Natl Res Council Italy, Inst Informat & Telemat, Pisa, Italy

[5] Natl Res Council Italy, Inst Informat & Telemat, Secur Grp, Pisa, Italy

[6] VIT Univ, Sch Comp Sci & Engn, Vellore, Tamil Nadu, India

来源：

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2019年 / 90卷

关键词：

Ransomware classification; Static analysis; Opcode; Machine learning; N-gram;

D O I：

10.1016/j.future.2018.07.052

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Ransomware is a special type of malware that can lock victims' screen and/or encrypt their files to obtain ransoms, resulting in great damage to users. Mapping ransomware into families is useful for identifying the variants of a known ransomware sample and for reducing analysts' workload. However, ransomware that can fingerprint the environment can evade the precious work of dynamic analysis. To the best of our knowledge, to overcome this shortcoming, we are the first to propose an approach based on static analysis to classifying ransomware. First, opcode sequences from ransomware samples are transformed into N-gram sequences. Then, Term frequency-Inverse document frequency (TF-IDF) is calculated for each N-gram to select feature N-grams so that these N-grams exhibit better discrimination between families. Finally, we treat the vectors composed of the TF values of the feature N-grams as the feature vectors and subsequently feed them to five machine-learning methods to perform ransomware classification. Six evaluation criteria are employed to validate the model. Thorough experiments performed using real datasets demonstrate that our approach can achieve the best Accuracy of 91.43%. Furthermore, the average F1-measure of the "wannacry" ransomware family is up to 99%, and the Accuracy of binary classification is up to 99.3%. The proposed method can detect and classify ransomware that can fingerprint the environment. In addition, we discover that different feature dimensions are required for achieving similar classifier performance with feature N-grams of diverse lengths. (C) 2018 Elsevier B.V. All rights reserved.

引用

页码：211 / 221

页数：11

共 50 条

[1] Classification of sentiment reviews using n-gram machine learning approach
Tripathy, Abinash
Agrawal, Ankit
Rath, Santanu Kumar
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2016, 57 : 117 - 126
[2] N-gram MalGAN: Evading machine learning detection via feature n-gram
Zhu, Enmin
Zhang, Jianjie
Yan, Jijie
Chen, Kongyang
Gao, Chongzhi
[J]. DIGITAL COMMUNICATIONS AND NETWORKS, 2022, 8 (04) : 485 - 491
[3] N-gram MalGAN:Evading machine learning detection via feature n-gram
Enmin Zhu
Jianjie Zhang
Jijie Yan
Kongyang Chen
Chongzhi Gao
[J]. Digital Communications and Networks., 2022, 8 (04) - 491
[4] Detecting Malware Based on Opcode N-Gram and Machine Learning
Li, Pengfei
Chen, Zhouguo
Cui, Baojiang
[J]. ADVANCES ON P2P, PARALLEL, GRID, CLOUD AND INTERNET COMPUTING (3PGCIC-2017), 2018, 13 : 99 - 110
[5] Classification and Prediction of Antimicrobial Peptides Using N-gram Representation and Machine Learning
Othman, Manal
Ratna, Sujay
Tewari, Anant
Kang, Anthony M.
Du, Katherine
Vaisman, Iosif I.
[J]. ACM-BCB' 2017: PROCEEDINGS OF THE 8TH ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY,AND HEALTH INFORMATICS, 2017, : 605 - 605
[6] A variant of n-gram based language classification
Tomovic, Andrija
Janicic, Predrag
[J]. AI(ASTERISK)IA 2007: ARTIFICIAL INTELLIGENCE AND HUMAN-ORIENTED COMPUTING, 2007, 4733 : 410 - +
[7] A machine learning approach for Arabic text classification using N-gram frequency statistics
Khreisat, Laila
[J]. JOURNAL OF INFORMETRICS, 2009, 3 (01) : 72 - 77
[8] Sentiment Classification Using N-Gram Inverse Document Frequency and Automated Machine Learning
Maipradit, Rungroj
Hata, Hideaki
Matsumoto, Kenichi
[J]. IEEE SOFTWARE, 2019, 36 (05) : 65 - 70
[9] XSS Attack Detection With Machine Learning and n-Gram Methods
Habibi, Gulit
Surantha, Nico
[J]. PROCEEDINGS OF 2020 INTERNATIONAL CONFERENCE ON INFORMATION MANAGEMENT AND TECHNOLOGY (ICIMTECH), 2020, : 516 - 520
[10] Proposal of n-gram Based Algorithm for Malware Classification
Pektas, Abdurrahman
Eris, Mehmet
Acarman, Tankut
[J]. PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON EMERGING SECURITY INFORMATION, SYSTEMS AND TECHNOLOGIES (SECURWARE 2011), 2011, : 14 - 18

← 1 2 3 4 5 →