Three-stream spatio-temporal attention network for first-person action and interaction recognition

被引:0
|
作者
Javed Imran
Balasubramanian Raman
机构
[1] University of Petroleum and Energy Studies,Department of Informatics, School of Computer Science
[2] Indian Institute of Technology Roorkee,Department of Computer Science and Engineering
关键词
First-person action recognition; 3D convolutional neural network; Recurrent neural network; Feature fusion; Soft attention;
D O I
暂无
中图分类号
学科分类号
摘要
The problem of action and interaction recognition of human activities from the perspective of first-person view-point is an interesting area of research in the field of human action recognition (HAR). This paper presents a data-driven spatio-temporal network to combine different modalities computed from first-person videos using a temporal attention mechanism. First, our proposed approach uses three-stream inflated 3D ConvNet (I3D) to extract low-level features from RGB frame difference (FD), optical flow (OF) and magnitude-orientation (MO) streams. An I3D network has the advantage to directly learn spatio-temporal features over short video snippets (like 16 frames). Second, the extracted features are fused together and fed to a Bidirectional long short-term memory (BiLSTM) network to model high-level temporal feature sequences. Third, we propose to incorporate attention mechanism with our BiLSTM network to automatically select the most relevant temporal snippets in the given video sequence. Finally, we conducted extensive experiments and achieve state-of-the-art results on JPL (98.5%), NUS (84.1%), UTK (91.5%) and DogCentric (83.3%) datasets. These results show that features extracted from three-stream network are complementary to each other, and attention mechanism further improves the results by a large margin than previous attempts based on handcrafted and deep features.
引用
收藏
页码:1137 / 1152
页数:15
相关论文
共 50 条
  • [21] Human Action Recognition via Spatio-temporal Dual Network Flow and Visual Attention Fusion
    Liu Tianliang
    Qiao Qingwei
    Wan Junwei
    Dai Xiubin
    Luo Jiebo
    [J]. JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, 2018, 40 (10) : 2395 - 2401
  • [22] Spatio-temporal segments attention for skeleton-based action recognition
    Qiu, Helei
    Hou, Biao
    Ren, Bo
    Zhang, Xiaohua
    [J]. NEUROCOMPUTING, 2023, 518 : 30 - 38
  • [23] Action Recognition With Spatio-Temporal Visual Attention on Skeleton Image Sequences
    Yang, Zhengyuan
    Li, Yuncheng
    Yang, Jianchao
    Luo, Jiebo
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2019, 29 (08) : 2405 - 2415
  • [24] Dynamic Gesture Recognition Based on Three-Stream Coordinate Attention Network and Knowledge Distillation
    Wan, Shanshan
    Yang, Lan
    Ding, Keliang
    Qiu, Dongwei
    [J]. IEEE ACCESS, 2023, 11 : 50547 - 50559
  • [25] ESTI: an action recognition network with enhanced spatio-temporal information
    ZhiYu Jiang
    Yi Zhang
    Shu Hu
    [J]. International Journal of Machine Learning and Cybernetics, 2023, 14 : 3059 - 3070
  • [26] A Spatio-Temporal Convolutional Neural Network for Skeletal Action Recognition
    Hu, Lizhang
    Xu, Jinhua
    [J]. NEURAL INFORMATION PROCESSING (ICONIP 2017), PT III, 2017, 10636 : 377 - 385
  • [27] ESTI: an action recognition network with enhanced spatio-temporal information
    Jiang, ZhiYu
    Zhang, Yi
    Hu, Shu
    [J]. INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2023, 14 (09) : 3059 - 3070
  • [28] Semantic three-stream network for social relation recognition
    Yan, Haibin
    Song, Chaohui
    [J]. PATTERN RECOGNITION LETTERS, 2019, 128 : 78 - 84
  • [29] First-Person Action Recognition With Temporal Pooling and Hilbert-Huang Transform
    Purwanto, Didik
    Chen, Yie-Tarng
    Fang, Wen-Hsien
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (12) : 3122 - 3135
  • [30] Skeleton-based human action recognition by fusing attention based three-stream convolutional neural network and SVM
    Fang Ren
    Chao Tang
    Anyang Tong
    Wenjian Wang
    [J]. Multimedia Tools and Applications, 2024, 83 : 6273 - 6295