N-gram analysis for computer virus detection

被引:85
|
作者
Reddy, D. Krishna Sandeep [1 ]
Pujari, Arun K. [1 ]
机构
[1] Univ Hyderabad, Artificial Intelligence Lab, Hyderabad 500046, Andhra Pradesh, India
关键词
D O I
10.1007/s11416-006-0027-8
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Generic computer virus detection is the need of the hour as most commercial antivirus software fail to detect unknown and new viruses. Motivated by the success of datamining/machine learning techniques in intrusion detection systems, recent research in detecting malicious executables is directed towards devising efficient non-signature-based techniques that can profile the program characteristics from a set of training examples. Byte sequences and byte n-grams are considered to be basis of feature extraction. But as the number of n-grams is going to be very large, several methods of feature selections were proposed in literature. A recent report on use of information gain based feature selection has yielded the best-known result in classifying malicious executables from benign ones. We observe that information gain models the presence of n-gram in one class and its absence in the other. Through a simple example we show that this may lead to erroneous results. In this paper, we describe a new feature selection measure, class-wise document frequency of byte n-grams. We empirically demonstrate that the proposed method is a better method for feature selection. For detection, we combine several classifiers using Dempster Shafer Theory for better classification accuracy instead of using any single classifier. Our experimental results show that such a scheme detects virus program far more efficiently than the earlier known methods.
引用
收藏
页码:231 / 239
页数:9
相关论文
共 50 条
  • [41] Software Fault Localization Using N-gram Analysis
    Nessa, Syeda
    Abedin, Muhammad
    Wong, W. Eric
    Khan, Latifur
    Qi, Yu
    [J]. WIRELESS ALGORITHMS, SYSTEMS, AND APPLICATIONS, PROCEEDINGS, 2008, 5258 : 548 - 559
  • [42] Using n-gram analysis to cluster heartbeat signals
    Huang, Yu-Chen
    Lin, Hanjun
    Hsu, Yeh-Liang
    Lin, Jun-Lin
    [J]. BMC MEDICAL INFORMATICS AND DECISION MAKING, 2012, 12
  • [43] Fileprints: Identifying file types by n-gram analysis
    Li, WJ
    Wang, K
    Stolfo, SJ
    Herzog, B
    [J]. Proceedings from the Sixth Annual IEEE Systems, Man and Cybernetics Information Assurance Workshop, 2005, : 64 - 71
  • [44] Extended N-gram Model for Analysis of Polish Texts
    Banasiak, Dariusz
    Mierzwa, Jaroslaw
    Sterna, Antoni
    [J]. MAN-MACHINE INTERACTIONS 5, ICMMI 2017, 2018, 659 : 355 - 364
  • [45] XSS Attack Detection With Machine Learning and n-Gram Methods
    Habibi, Gulit
    Surantha, Nico
    [J]. PROCEEDINGS OF 2020 INTERNATIONAL CONFERENCE ON INFORMATION MANAGEMENT AND TECHNOLOGY (ICIMTECH), 2020, : 516 - 520
  • [46] BIGRAM VS N-GRAM
    HALPIN, P
    [J]. BYTE, 1988, 13 (08): : 26 - 26
  • [47] Recasting the discriminative n-gram model as a pseudo-conventional n-gram model for LVCSR
    Zhou, Zhengyu
    Meng, Helen
    [J]. 2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 4933 - 4936
  • [48] Optimisation of Character n-gram Profiles Method for Intrinsic Plagiarism Detection
    Kuta, Marcin
    Kitowski, Jacek
    [J]. ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING, ICAISC 2014, PT II, 2014, 8468 : 500 - 511
  • [49] A discriminative method for protein remote homology detection based on N-Gram
    Xie, S.
    Li, P.
    Jiang, Y.
    Zhao, Y.
    [J]. GENETICS AND MOLECULAR RESEARCH, 2015, 14 (01): : 69 - 78
  • [50] n-gram Effect in Malware Detection Using Multilayer Perceptron (MLP)
    Purnama, Benni
    Stiawan, Deris
    Hanapi, Darmawijoyo
    Winanto, Eko Arip
    Budiarto, Rahmat
    Bin Idris, Mohd Yazid
    [J]. 2021 8TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING, COMPUTERSCIENCE AND INFORMATICS (EECSI) 2021, 2021, : 45 - 49