Multi-Level Two-Stream Fusion-Based Spatio-Temporal Attention Model for Violence Detection and Localization

被引:7
|
作者
Asad, Mujtaba [1 ]
Jiang, He [1 ]
Yang, Jie [1 ]
Tu, Enmei [1 ]
Malik, Aftab A. [2 ]
机构
[1] Shanghai Jiao Tong Univ, Inst Image Proc & Pattern Recognit, Shanghai 200240, Peoples R China
[2] Lahore Garrison Univ, Dept Software Engn, Lahore 54810, Pakistan
关键词
Violence detection; autonomous video surveillance; multi-layer feature fusion; spatio-temporal attention; RECOGNITION; NETWORKS;
D O I
10.1142/S0218001422550023
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Detection of violent human behavior is necessary for public safety and monitoring. However, it demands constant human observation and attention in human-based surveillance systems, which is a challenging task. Autonomous detection of violent human behavior is therefore essential for continuous uninterrupted video surveillance. In this paper, we propose a novel method for violence detection and localization in videos using the fusion of spatio-temporal features and attention model. The model consists of Fusion Convolutional Neural Network (Fusion-CNN), spatio-temporal attention modules and Bi-directional Convolutional LSTMs (BiConvLSTM). The Fusion-CNN learns both spatial and temporal features by combining multi-level inter-layer features from both RGB and Optical flow input frames. The spatial attention module is used to generate an importance mask to focus on the most important areas of the image frame. The temporal attention part, which is based on BiConvLSTM, identifies the most significant video frames which are related to violent activity. The proposed model can also localize and discriminate prominent regions in both spatial and temporal domains, given the weakly supervised training with only video-level classification labels. Experimental results evaluated on different publicly available benchmarking datasets show the superior performance of the proposed model in comparison with the existing methods. Our model achieves the improved accuracies (ACC) of 89.1%, 99.1% and 98.15% for RWF-2000, HockeyFight and Crowd-Violence datasets, respectively. For CCTV-FIGHTS dataset, we choose the mean average precision (mAp) performance metric and our model obtained 80.7% mAp.
引用
收藏
页数:25
相关论文
共 50 条
  • [31] Dynamic Spatio-temporal traffic flow prediction based on multi fusion graph attention network
    Cheng, Manru
    Jiang, Guo-Ping
    Song, Yurong
    Yang, Chen
    2022 41ST CHINESE CONTROL CONFERENCE (CCC), 2022, : 7285 - 7291
  • [32] TEMPORAL ACTION LOCALIZATION WITH TWO-STREAM SEGMENT-BASED RNN
    Lin, Tianwei
    Zhao, Xu
    Fan, Zhaoxuan
    2017 24TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2017, : 3400 - 3404
  • [33] A Multi-Scale Spatio-Temporal Network for Violence Behavior Detection
    Zhou, Wei
    Min, Xuanlin
    Zhao, Yiheng
    Pang, Yiran
    Yi, Jun
    IEEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE, 2023, 5 (02): : 266 - 276
  • [34] Violence Detection Based on Spatio-Temporal Feature and Fisher Vector
    Cai, Huangkai
    Jiang, He
    Huang, Xiaolin
    Yang, Jie
    He, Xiangjian
    PATTERN RECOGNITION AND COMPUTER VISION (PRCV 2018), PT I, 2018, 11256 : 180 - 190
  • [35] Spatio-temporal multi-level attention crop mapping method using time-series SAR imagery
    Han, Zhu
    Zhang, Ce
    Gao, Lianru
    Zeng, Zhiqiang
    Zhang, Bing
    Atkinson, Peter M.
    ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2023, 206 : 293 - 310
  • [36] A Violence Detection Approach Based on Spatio-temporal Hypergraph Transition
    Huang, Jingjia
    Li, Ge
    Li, Nannan
    Wang, Ronggang
    Wang, Wenmin
    COMPUTER ANALYSIS OF IMAGES AND PATTERNS: 17TH INTERNATIONAL CONFERENCE, CAIP 2017, PT II, 2017, 10425 : 218 - 229
  • [37] Spatio-Temporal Data Fusion-Based Method for Working Cycles Identification of Hydraulic Excavators
    Cui, Jian
    Zhang, Dailin
    Zhu, Guoli
    Wang, Yuexing
    Zou, Kaiduan
    IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2025, 74
  • [38] Spatio-Temporal Attention Model Based on Multi-view for Social Relation Understanding
    Lv, Jinna
    Wu, Bin
    MULTIMEDIA MODELING, MMM 2019, PT II, 2019, 11296 : 390 - 401
  • [39] Two-Stream Neural Network Fusion Model for Highway Fog Detection
    Xiang Y.
    Cong D.
    Zhang Y.
    Yuan F.
    Xinan Jiaotong Daxue Xuebao/Journal of Southwest Jiaotong University, 2019, 54 (01): : 173 - 179
  • [40] Spatio-temporal context based recurrent visual attention model for lymph node detection
    Peng, Haixin
    Peng, Yinjun
    INTERNATIONAL JOURNAL OF IMAGING SYSTEMS AND TECHNOLOGY, 2020, 30 (04) : 1220 - 1242