Partitioning Based N-Gram Feature Selection for Malware Classification

被引:4
|
作者
Hu, Weiwei [1 ]
Tan, Ying [1 ]
机构
[1] Peking Univ, Sch Elect Engn & Comp Sci, Dept Machine Intelligence, Key Lab Machine Percept MOE, Beijing 100871, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
Malware classification; Feature selection; Data partitioning; Apache Spark; VIRUS DETECTION APPROACH; MALICIOUS EXECUTABLES; INFORMATION; DETECT;
D O I
10.1007/978-3-319-40973-3_18
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Byte level N-Gram is one of the most used feature extraction algorithms for malware classification because of its good performance and robustness. However, the N-Gram feature selection for a large dataset consumes huge time and space resources due to the large amount of different N-Grams. This paper proposes a partitioning based algorithm for large scale feature selection which efficiently resolves the original problem into in-memory solutions without heavy IO load. The partitioning process adopts an efficient implementation to convert the original interactional dataset to unrelated data partitions. Such data independence enables the effectiveness of the in-memory solutions and the parallelism on different partitions. The proposed algorithm was implemented on Apache Spark, and experimental results show that it is able to select features in a very short period of time which is nearly three times faster than the comparison MapReduce approach.
引用
收藏
页码:187 / 195
页数:9
相关论文
共 50 条
  • [1] Proposal of n-gram Based Algorithm for Malware Classification
    Pektas, Abdurrahman
    Eris, Mehmet
    Acarman, Tankut
    [J]. PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON EMERGING SECURITY INFORMATION, SYSTEMS AND TECHNOLOGIES (SECURWARE 2011), 2011, : 14 - 18
  • [2] Opcode n-gram based Malware Classification in Android
    Sihag, Vikas
    Mitharwal, Anita
    Vardhan, Manu
    Singh, Pradeep
    [J]. PROCEEDINGS OF THE 2020 FOURTH WORLD CONFERENCE ON SMART TRENDS IN SYSTEMS, SECURITY AND SUSTAINABILITY (WORLDS4 2020), 2020, : 645 - 650
  • [3] An investigation of byte n-gram features for malware classification
    Raff, Edward
    Zak, Richard
    Cox, Russell
    Sylvester, Jared
    Yacci, Paul
    Ward, Rebecca
    Tracy, Anna
    McLean, Mark
    Nicholas, Charles
    [J]. JOURNAL OF COMPUTER VIROLOGY AND HACKING TECHNIQUES, 2018, 14 (01): : 1 - 20
  • [4] N-gram Density based Malware Detection
    O'Kane, Philip
    Sezer, Sakir
    McLaughlin, Kieran
    [J]. 2014 WORLD SYMPOSIUM ON COMPUTER APPLICATIONS & RESEARCH (WSCAR), 2014,
  • [5] N-gram feature selection for authorship identification
    Houvardas, John
    Stamatatos, Efstathios
    [J]. ARTIFICIAL INTELLIGENCE: METHODOLOGY, SYSTEMS, AND APPLICATIONS, PROCEEDINGS, 2006, 4183 : 77 - 86
  • [6] BHMDC: A byte and hex n-gram based malware detection and classification method
    Tang, Yonghe
    Qi, Xuyan
    Jing, Jing
    Liu, Chunling
    Dong, Weiyu
    [J]. COMPUTERS & SECURITY, 2023, 128
  • [7] Clustering botnet communication traffic based on n-gram feature selection
    Lu, Wei
    Rammidi, Goaletsa
    Ghorbani, Ali A.
    [J]. COMPUTER COMMUNICATIONS, 2011, 34 (03) : 502 - 514
  • [8] Malware Visualization Methods Based on N-gram Features
    Ren, Zhuo-Jun
    Chen, Guang
    Lu, Wen-Ke
    [J]. Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2019, 47 (10): : 2108 - 2115
  • [9] Automatic malware mutant detection and group classification based on the n-gram and clustering coefficient
    Lee, Taejin
    Choi, Bomin
    Shin, Youngsang
    Kwak, Jin
    [J]. JOURNAL OF SUPERCOMPUTING, 2018, 74 (08): : 3489 - 3503
  • [10] Automatic malware mutant detection and group classification based on the n-gram and clustering coefficient
    Taejin Lee
    Bomin Choi
    Youngsang Shin
    Jin Kwak
    [J]. The Journal of Supercomputing, 2018, 74 : 3489 - 3503