Spatio-Temporal Deep Residual Network with Hierarchical Attentions for Video Event Recognition

被引:14
|
作者
Li, Yonggang [1 ]
Liu, Chunping [2 ]
Ji, Yi [2 ]
Gong, Shengrong [2 ,3 ,4 ]
Xu, Haibao [5 ]
机构
[1] Jiaxing Univ, Coll Math Phys & Informat Engn, 118 Pahang Rd, Jiaxing 314001, Peoples R China
[2] Soochow Univ, Sch Comp Sci & Technol, 1 Shizi St, Suzhou 215009, Peoples R China
[3] Changshu Inst Sci & Technol, Sch Comp Sci & Engn, 99 Hushan Rd, Suzhou 215500, Peoples R China
[4] Beijing Jiaotong Univ, Sch Comp & Informat Technol, Beijing, Peoples R China
[5] Zhejiang Univ, 269 Shixiang Rd, Hangzhou 310015, Peoples R China
基金
中国国家自然科学基金;
关键词
Event recognition; hierarchical attention; surveillance video; deep residual recurrent network; spatio-temporal; REPRESENTATION;
D O I
10.1145/3378026
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Event recognition in surveillance video has gained extensive attention from the computer vision community. This process still faces enormous challenges due to the tiny inter-class variations that are caused by various facets, such as severe occlusion, cluttered backgrounds, and so forth. To address these issues, we propose a spatio-temporal deep residual network with hierarchical attentions (STDRN-HA) for video event recognition. In the first attention layer, the ResNet fully connected feature guides the Faster R-CNN feature to generate object-based attention (O-attention) for target objects. In the second attention layer, the O-attention further guides the ResNet convolutional feature to yield the holistic attention (H-attention) in order to perceive more details of the occluded objects and the global background. In the third attention layer, the attention maps use the deep features to obtain the attention-enhanced features. Then, the attention-enhanced features are input into a deep residual recurrent network, which is used to mine more event clues from videos. Furthermore, an optimized loss function named softmax-RC is designed, which embeds the residual block regularization and center loss to solve the vanishing gradient in a deep network and enlarge the distance between inter-classes. We also build a temporal branch to exploit the long- and short-term motion information. The final results are obtained by fusing the outputs of the spatial and temporal streams. Experiments on the four realistic video datasets, CCV, VIRAT 1.0, VIRAT 2.0, and HMDB51, demonstrate that the proposed method has good performance and achieves state-of-the-art results.
引用
收藏
页数:21
相关论文
共 50 条
  • [1] MULTISCALE SPATIO-TEMPORAL NETWORK FOR AERIAL VIDEO EVENT RECOGNITION
    Yang, Feng
    Zhang, Jian
    Zhao, Yue
    Qin, Anyong
    Gao, Chenqiang
    [J]. 2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 7835 - 7838
  • [2] Resstanet: deep residual spatio-temporal attention network for violent action recognition
    Ajeet Pandey
    Piyush Kumar
    [J]. International Journal of Information Technology, 2024, 16 (5) : 2891 - 2900
  • [3] Unsupervised Video Prediction Network with Spatio-temporal Deep Features
    Jin, Beibei
    Zhou, Rong
    Zhang, Zhisheng
    Dai, Min
    [J]. PROCEEDINGS OF THE 2018 25TH INTERNATIONAL CONFERENCE ON MECHATRONICS AND MACHINE VISION IN PRACTICE (M2VIP), 2018, : 19 - 24
  • [4] Video object segmentation using spatio-temporal deep network
    Ramaswamy, Akshaya
    Gubbi, Jayavardhana
    Balamuralidhar, P.
    [J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [5] Residual Invertible Spatio-Temporal Network for Video Super-Resolution
    Zhu, Xiaobin
    Li, Zhuangzi
    Zhang, Xiao-Yu
    Li, Changsheng
    Liu, Yaqi
    Xue, Ziyu
    [J]. THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 5981 - 5988
  • [6] Fast Spatio-Temporal Residual Network for Video Super-Resolution
    Li, Sheng
    He, Fengxiang
    Du, Bo
    Zhang, Lefei
    Xu, Yonghao
    Tao, Dacheng
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10514 - 10523
  • [7] Deep Learning Based Video Spatio-Temporal Modeling for Emotion Recognition
    Fonnegra, Ruben D.
    Diaz, Gloria M.
    [J]. HUMAN-COMPUTER INTERACTION: THEORIES, METHODS, AND HUMAN ISSUES, HCI INTERNATIONAL 2018, PT I, 2018, 10901 : 397 - 408
  • [8] DeepRTP: A Deep Spatio-Temporal Residual Network for Regional Traffic Prediction
    Liu, Zhidan
    Huang, Mingliang
    Ye, Zhi
    Wu, Kaishun
    [J]. 2019 15TH INTERNATIONAL CONFERENCE ON MOBILE AD-HOC AND SENSOR NETWORKS (MSN 2019), 2019, : 291 - 296
  • [9] LSN: Long-Term Spatio-Temporal Network for Video Recognition
    Wang, Zhenwei
    Dong, Wei
    Zhang, Bingbing
    Zhang, Jianxin
    [J]. DATA SCIENCE (ICPCSEE 2022), PT I, 2022, 1628 : 326 - 338
  • [10] Video Fingerprint Algorithm Based on Spatio-Temporal Deep Neural Network
    Wang Dongdong
    Li Yuenan
    [J]. LASER & OPTOELECTRONICS PROGRESS, 2018, 55 (01)