What Can N-Grams Learn for Malware Detection?

被引:0
|
作者
Zak, Richard [1 ]
Raff, Edward [1 ]
Nicholas, Charles [2 ]
机构
[1] Booz Allen Hamilton, Lab Phys Sci, Mclean, VA 22102 USA
[2] Univ Maryland Baltimore Cty, Baltimore, MD 21228 USA
关键词
SELECTION;
D O I
暂无
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Recent work has shown that byte n-grams learn mostly low entropy features, such as function imports and strings, which has brought into question whether byte n-grams can learn information corresponding to higher entropy levels, such as binary code. We investigate that hypothesis in this work by performing byte n-gram analysis on only specific sub-sections of the binary file, and compare to results obtained by n-gram analysis on assembly code generated from disassembled binaries. We do this by leveraging the change in model performance and ensembles to glean insights about the data. In doing so we discover that byte n-grams can learn from the code regions, but do not necessarily learn any new information. We also discover that assembly n-grams may not be as effective as previously thought and that disambiguating instructions by their binary opcode, an approach not previously used for malware detection, is critical for model generalization.
引用
收藏
页码:109 / 118
页数:10
相关论文
共 50 条
  • [41] An effective combination of different order N-grams
    Zhang, S
    Dong, N
    [J]. PACLIC 17: Language, Information and Computation, Proceedings, 2003, : 251 - 256
  • [42] Protein classification using modified n-grams and skip-grams
    Islam, S. M. Ashiqul
    Heil, Benjamin J.
    Kearney, Christopher Michel
    Baker, Erich J.
    [J]. BIOINFORMATICS, 2018, 34 (09) : 1481 - 1487
  • [43] Automatic statistical translation based on n-grams
    Oliver, Antonio
    Badia, Toni
    Boleda, Gemma
    Melero, Maite
    [J]. PROCESAMIENTO DEL LENGUAJE NATURAL, 2005, (35): : 77 - 84
  • [44] Reconstructing Textual Documents from n-grams
    Galle, Matthias
    Tealdi, Matias
    [J]. KDD'15: PROCEEDINGS OF THE 21ST ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2015, : 329 - 338
  • [45] Detection and explanation of anomalous activities:: Representing activities as bags of event n-grams
    Hamid, R
    Johnson, A
    Batta, S
    Bobick, A
    Isbell, C
    Coleman, G
    [J]. 2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2005, : 1031 - 1038
  • [46] Applications of N-grams in textual information systems
    Robertson, AM
    Willett, P
    [J]. JOURNAL OF DOCUMENTATION, 1998, 54 (01) : 48 - 69
  • [47] Detection of algorithmically generated malicious domain names using masked N-grams
    Selvi, Jose
    Rodriguez, Ricardo J.
    Soria-Olivas, Emilio
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2019, 124 : 156 - 163
  • [48] Fake News detection using n-grams for PAN@CLEF competition
    Damian, Sergio
    Calvo, Hiram
    Gelbukh, Alexander
    [J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2022, 42 (05) : 4633 - 4640
  • [49] Building Wikipedia N-grams with Apache Spark
    Esmaeilzadeh, Armin
    Cacho, Jorge Ramon Fonseca
    Taghva, Kazem
    Kambar, Mina Esmail Zadeh Nojoo
    Hajiali, Mahdi
    [J]. INTELLIGENT COMPUTING, VOL 2, 2022, 507 : 672 - 684
  • [50] Interpolated N-Grams for Model Based Testing
    Tonella, Paolo
    Tiella, Roberto
    Cu Duy Nguyen
    [J]. 36TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2014), 2014, : 562 - 572