Multi-Level Two-Stream Fusion-Based Spatio-Temporal Attention Model for Violence Detection and Localization

被引:7
|
作者
Asad, Mujtaba [1 ]
Jiang, He [1 ]
Yang, Jie [1 ]
Tu, Enmei [1 ]
Malik, Aftab A. [2 ]
机构
[1] Shanghai Jiao Tong Univ, Inst Image Proc & Pattern Recognit, Shanghai 200240, Peoples R China
[2] Lahore Garrison Univ, Dept Software Engn, Lahore 54810, Pakistan
关键词
Violence detection; autonomous video surveillance; multi-layer feature fusion; spatio-temporal attention; RECOGNITION; NETWORKS;
D O I
10.1142/S0218001422550023
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Detection of violent human behavior is necessary for public safety and monitoring. However, it demands constant human observation and attention in human-based surveillance systems, which is a challenging task. Autonomous detection of violent human behavior is therefore essential for continuous uninterrupted video surveillance. In this paper, we propose a novel method for violence detection and localization in videos using the fusion of spatio-temporal features and attention model. The model consists of Fusion Convolutional Neural Network (Fusion-CNN), spatio-temporal attention modules and Bi-directional Convolutional LSTMs (BiConvLSTM). The Fusion-CNN learns both spatial and temporal features by combining multi-level inter-layer features from both RGB and Optical flow input frames. The spatial attention module is used to generate an importance mask to focus on the most important areas of the image frame. The temporal attention part, which is based on BiConvLSTM, identifies the most significant video frames which are related to violent activity. The proposed model can also localize and discriminate prominent regions in both spatial and temporal domains, given the weakly supervised training with only video-level classification labels. Experimental results evaluated on different publicly available benchmarking datasets show the superior performance of the proposed model in comparison with the existing methods. Our model achieves the improved accuracies (ACC) of 89.1%, 99.1% and 98.15% for RWF-2000, HockeyFight and Crowd-Violence datasets, respectively. For CCTV-FIGHTS dataset, we choose the mean average precision (mAp) performance metric and our model obtained 80.7% mAp.
引用
收藏
页数:25
相关论文
共 50 条
  • [41] Multi-level convolutional autoencoder networks for parametric prediction of spatio-temporal dynamics
    Xu, Jiayang
    Duraisamy, Karthik
    COMPUTER METHODS IN APPLIED MECHANICS AND ENGINEERING, 2020, 372
  • [42] Two-stream Multi-level Dynamic Point Transformer for Two-person Interaction Recognition
    Liu, Yao
    Cui, Gangfeng
    Luo, Jiahui
    Chang, Xiaojun
    Yao, Lina
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (05)
  • [43] Road Crack Model Based on Multi-Level Feature Fusion and Attention Mechanism
    Song, Rongrong
    Wang, Caiyong
    Tian, Qichuan
    Zhang, Qi
    Computer Engineering and Applications, 2023, 59 (13): : 281 - 288
  • [44] Novel Soft Sensor Model based on Spatio-Temporal Attention
    Hu, Xuan
    Geng, Zhiqiang
    Han, Yongming
    Huang, Wei
    Chen, Kai
    Xie, Feng
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [45] Remote Sensing Change Detection Based on Multi scale Spatio-temporal Perceptual Attention Network
    Jia, Haoran
    Guo, Yongde
    2024 CROSS STRAIT RADIO SCIENCE AND WIRELESS TECHNOLOGY CONFERENCE, CSRSWTC 2024, 2024, : 102 - 104
  • [46] Two-Stream Xception Structure Based on Feature Fusion for DeepFake Detection
    Wang, Bin
    Huang, Liqing
    Huang, Tianqiang
    Ye, Feng
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2023, 16 (01)
  • [47] L2-BiTCN-CNN: Spatio-temporal features fusion-based multi-classification model for various internet applications identification
    Li, Zhiyuan
    Xu, Xiaoping
    COMPUTER NETWORKS, 2024, 243
  • [48] Two-Stream Xception Structure Based on Feature Fusion for DeepFake Detection
    Bin Wang
    Liqing Huang
    Tianqiang Huang
    Feng Ye
    International Journal of Computational Intelligence Systems, 16
  • [49] BiFAT: Bilateral Filtering and Attention Mechanisms in a Two-Stream Model for Deepfake Detection
    Zhang, Lei
    Yi, Ceyuan
    Liu, Liang
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT II, 2024, 15017 : 231 - 247
  • [50] Dynamic spatio-temporal graph network based on multi-level feature interaction for sinter TFe prediction
    Chen, Xiaoxia
    Hu, Yifeng
    Liu, Chengshuo
    Chen, Ao
    Chi, Zhengwei
    JOURNAL OF PROCESS CONTROL, 2025, 148