An improved text classification modelling approach to identify security messages in heterogeneous projects

被引:4
|
作者
Oyetoyan, Tosin Daniel [1 ,2 ]
Morrison, Patrick [3 ]
机构
[1] SINTEF Digital, Dept Software Engn Safety & Secur, Trondheim, Norway
[2] Western Norway Univ Appl Sci, Dept Comp Math & Phys, Bergen, Norway
[3] North Carolina State Univ, Dept Comp Sci, Raleigh, NC USA
关键词
Security; Classification model; Text classification; Software repository; Machine learning;
D O I
10.1007/s11219-020-09546-7
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Security remains under-addressed in many organisations, illustrated by the number of large-scale software security breaches. Preventing breaches can begin during software development if attention is paid to security during the software's design and implementation. One approach to security assurance during software development is to examine communications between developers as a means of studying the security concerns of the project. Prior research has investigated models for classifying project communication messages (e.g., issues or commits) as security related or not. A known problem is that these models are project-specific, limiting their use by other projects or organisations. We investigate whether we can build a generic classification model that can generalise across projects. We define a set of security keywords by extracting them from relevant security sources, dividing them into four categories: asset, attack/threat, control/mitigation, and implicit. Using different combinations of these categories and including them in the training dataset, we built a classification model and evaluated it on industrial, open-source, and research-based datasets containing over 45 different products. Our model based on harvested security keywords as a feature set shows average recall from 55 to 86%, minimum recall from 43 to 71% and maximum recall from 60 to 100%. An average f-score between 3.4 and 88%, an average g-measure of at least 66% across all the dataset, and an average AUC of ROC from 69 to 89%. In addition, models that use externally sourced features outperformed models that use project-specific features on average by a margin of 26-44% in recall, 22-50% in g-measure, 0.4-28% in f-score, and 15-19% in AUC of ROC. Further, our results outperform a state-of-the-art prediction model for security bug reports in all cases. We find using sound statistical and effect size tests that (1) using harvested security keywords as features to train a text classification model improve classification models and generalise to other projects significantly. (2) Including features in the training dataset before model construction improve classification models significantly. (3) Different security categories represent predictors for different projects. Finally, we introduce new and promising approaches to construct models that can generalise across different independent projects.
引用
收藏
页码:509 / 553
页数:45
相关论文
共 20 条
  • [1] An improved text classification modelling approach to identify security messages in heterogeneous projects
    Tosin Daniel Oyetoyan
    Patrick Morrison
    [J]. Software Quality Journal, 2021, 29 : 509 - 553
  • [2] An Improved Text Classification Model for Mobile Data Security Testing
    Feng Xiaorong
    Lin Jun
    Man Songtao
    Jia Shizhun
    [J]. 2017 IEEE 2ND ADVANCED INFORMATION TECHNOLOGY, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (IAEAC), 2017, : 1732 - 1736
  • [3] An improved TF-IDF approach for text classification
    张云涛
    龚玲
    王永成
    [J]. Journal of Zhejiang University-Science A(Applied Physics & Engineering), 2005, (01) : 50 - 56
  • [4] An improved TF-IDF approach for text classification
    Zhang Yun-tao
    Gong Ling
    Wang Yong-cheng
    [J]. Journal of Zhejiang University-SCIENCE A, 2005, 6 (1): : 49 - 55
  • [5] Short text classification approach to identify child sexual exploitation material
    Al-Nabki, M. H. D. Wesam
    Fidalgo, Eduardo
    Alegre, Enrique
    Alaiz-Rodriguez, Rocio
    [J]. SCIENTIFIC REPORTS, 2023, 13 (01)
  • [6] PROJECTS OF FISIOTERAPIA AND TERAPIA OCUPACIONAL: A CLASSIFICATION APPROACH USING TEXT MINING IN R
    Faria, Brigida Monica
    Pimenta, Rui
    Moreira, Jose
    [J]. SISTEMAS E TECHNOLOGIAS DE INFORMACAO: ACTAS DA 4A CONFERENCIA IBERICA DE SISTEMAS E TECNOLOGIAS DE LA INFORMACAO, 2009, : 367 - +
  • [7] A Bi-Level Text Classification Approach for SMS Spam Filtering and Identifying Priority Messages
    Nagwani, Naresh Kumar
    [J]. INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2017, 14 (04) : 473 - 480
  • [8] Classification of cyberbullying messages using text, image and audio in social networks: a deep learning approach
    Sangeethapriya R
    Akilandeswari J
    [J]. Multimedia Tools and Applications, 2024, 83 : 2237 - 2266
  • [9] Classification of cyberbullying messages using text, image and audio in social networks: a deep learning approach
    Sangeethapriya, R.
    Akilandeswari, J.
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (1) : 2237 - 2266
  • [10] Author Correction: Short text classification approach to identify child sexual exploitation material
    MHD Wesam Al-Nabki
    Eduardo Fidalgo
    Enrique Alegre
    Rocio Alaiz-Rodriguez
    [J]. Scientific Reports, 13