Classification of ransomware families with machine learning based on N-gram of opcodes

被引：140

作者：

Zhang, Hanqi ^{[1
,2
]}

Xiao, Xi ^{[2
]}

Mercaldo, Francesco ^{[4
]}

Ni, Shiguang ^{[3
]}

Martinelli, Fabio ^{[5
]}

Sangaiah, Arun Kumar ^{[6
]}

机构：

[1] Cent China Normal Univ, Coll Phys Sci & Technol, Wuhan, Hubei, Peoples R China

[2] Tsinghua Univ, Grad Sch Shenzhen, Shenzhen, Peoples R China

[3] Tsinghua Univ, Grad Sch Shenzhen, Div Social Sci & Management, Shenzhen, Peoples R China

[4] Natl Res Council Italy, Inst Informat & Telemat, Pisa, Italy

[5] Natl Res Council Italy, Inst Informat & Telemat, Secur Grp, Pisa, Italy

[6] VIT Univ, Sch Comp Sci & Engn, Vellore, Tamil Nadu, India

来源：

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2019年 / 90卷

关键词：

Ransomware classification; Static analysis; Opcode; Machine learning; N-gram;

D O I：

10.1016/j.future.2018.07.052

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Ransomware is a special type of malware that can lock victims' screen and/or encrypt their files to obtain ransoms, resulting in great damage to users. Mapping ransomware into families is useful for identifying the variants of a known ransomware sample and for reducing analysts' workload. However, ransomware that can fingerprint the environment can evade the precious work of dynamic analysis. To the best of our knowledge, to overcome this shortcoming, we are the first to propose an approach based on static analysis to classifying ransomware. First, opcode sequences from ransomware samples are transformed into N-gram sequences. Then, Term frequency-Inverse document frequency (TF-IDF) is calculated for each N-gram to select feature N-grams so that these N-grams exhibit better discrimination between families. Finally, we treat the vectors composed of the TF values of the feature N-grams as the feature vectors and subsequently feed them to five machine-learning methods to perform ransomware classification. Six evaluation criteria are employed to validate the model. Thorough experiments performed using real datasets demonstrate that our approach can achieve the best Accuracy of 91.43%. Furthermore, the average F1-measure of the "wannacry" ransomware family is up to 99%, and the Accuracy of binary classification is up to 99.3%. The proposed method can detect and classify ransomware that can fingerprint the environment. In addition, we discover that different feature dimensions are required for achieving similar classifier performance with feature N-grams of diverse lengths. (C) 2018 Elsevier B.V. All rights reserved.

引用

页码：211 / 221

页数：11

共 50 条

[31] An n-gram based approach to the automatic classification of schoolchildren's writing
Cicres, Jordi
Queralt, Sheila
[J]. VIAL-VIGO INTERNATIONAL JOURNAL OF APPLIED LINGUISTICS, 2019, 16 : 53 - 80
[32] Web Page Classification using n-gram based URL Features
Rajalakshmi, R.
Aravindan, Chandrabose
[J]. 2013 FIFTH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING (ICOAC), 2013, : 15 - 21
[33] Ransomware Classification and Detection With Machine Learning Algorithms
Masum, Mohammad
Faruk, Md Jobair Hossain
Shahriar, Hossain
Qian, Kai
Lo, Dan
Adnan, Muhaiminul Islam
[J]. 2022 IEEE 12TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE (CCWC), 2022, : 316 - 322
[34] Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques
Ahmed, Hadeer
Traore, Issa
Saad, Sherif
[J]. INTELLIGENT, SECURE, AND DEPENDABLE SYSTEMS IN DISTRIBUTED AND CLOUD ENVIRONMENTS (ISDDC 2017), 2017, 10618 : 127 - 138
[35] Analysis of N-gram model on Telugu Document Classification
Rani, B. Padmaja
Vardhan, B. Vishnu
Durga, A. Kanaka
Reddy, L. Pratap
Babu, A. Vinaya
[J]. 2008 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION, VOLS 1-8, 2008, : 3199 - +
[36] An investigation of byte n-gram features for malware classification
Raff, Edward
Zak, Richard
Cox, Russell
Sylvester, Jared
Yacci, Paul
Ward, Rebecca
Tracy, Anna
McLean, Mark
Nicholas, Charles
[J]. JOURNAL OF COMPUTER VIROLOGY AND HACKING TECHNIQUES, 2018, 14 (01): : 1 - 20
[37] Combat Mobile Malware via N-gram Based Deep Learning
Dusun, Burak
Bulut, Irfan
Aygun, R. Can
Yavuz, A. Gokhan
[J]. 2018 26TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2018,
[38] Implementation of Machine Learning Algorithms in Arabic Sentiment Analysis Using N-Gram Features
Gamal, Donia
Alfonse, Marco
El-Horbaty, El-Sayed M.
Salem, Abdel-Badeeh M.
[J]. PROCEEDINGS OF THE 9TH INTERNATIONAL CONFERENCE OF INFORMATION AND COMMUNICATION TECHNOLOGY [ICICT-2019], 2019, 154 : 332 - 340
[39] A Corpus Based N-gram Hybrid Approach of Bengali to English Machine Translation
Rahman, Mohammad Masudur
Kabir, Md Faisal
Huda, Mohammad Nurul
[J]. 2018 21ST INTERNATIONAL CONFERENCE OF COMPUTER AND INFORMATION TECHNOLOGY (ICCIT), 2018,
[40] N-gram Based WSD for Improving Accuracy of Machine Translation using TM
Rawat, Sunita
Chandak, Manoj
Khan, Tabassum
[J]. HELIX, 2018, 8 (05): : 3916 - 3918

← 1 2 3 4 5 →