Compressed Video Action Recognition With Dual-Stream and Dual-Modal Transformer

被引:4
|
作者
Mou, Yuting [1 ]
Jiang, Xinghao [1 ]
Xu, Ke [1 ]
Sun, Tanfeng [1 ]
Wang, Zepeng [1 ]
机构
[1] Shanghai Jiao Tong Univ, Sch Elect Informat & Elect Engn, Natl Engn Lab Informat Content Anal Tech, Shanghai 200240, Peoples R China
基金
中国国家自然科学基金;
关键词
Compressed video; action recognition; NETWORK; EFFICIENCY;
D O I
10.1109/TCSVT.2023.3319140
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Compressed video action recognition offers the advantage of reducing decoding and inference time compared to the RGB domain. However, the compressed domain poses unique challenges with different types of frames (I-frames and P-frames). I-frames consistent with RGB are rich in frame information, but the redundant information may interfere with the recognition task. There are two modalities in P-frames, residual (R) and motion vector (MV). Although with less information, they can reflect the motion cue. To address these challenges and leverage the independent information from different frames and modalities, we propose a novel approach called Dual-Stream and Dual-Modal Transformer (DSDMT). Our approach consists of two streams: 1) The short-span P-frames stream contains temporal information. We propose the Dual-Modal Attention Module (DAM) to mine different modal variability in P-frames and complement the orthogonal feature vector. Besides, considering the sparsity of P-frames, we extract action features with Frame-level Patch Embedding (FPE) to avoid redundant computation. 2) The long-span I-frames stream extracts the global context feature of the entire video, including content and scene information. By fusing the global video context and local key-frame features, our model represents the action feature in terms of fine-grained and coarse-grained. We evaluated our proposed DSDMT on three public benchmarks with different scales: HMDB-51, UCF-101, and Kinetics-400. Ours achieve better performance with fewer Flops and lower latency. Our analysis shows that the independence and complements of the I-frames and P-frames extracted from the compressed video stream play a crucial role in action recognition.
引用
收藏
页码:3299 / 3312
页数:14
相关论文
共 50 条
  • [31] Dual-Stream Fusion Network for Spatiotemporal Video Super-Resolution
    Tseng, Min-Yuan
    Chen, Yen-Chung
    Lee, Yi-Lun
    Lai, Wei-Sheng
    Tsai, Yi-Hsuan
    Chiu, Wei-Chen
    2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021, 2021, : 2683 - 2692
  • [32] Dual-Stream Contrastive Learning for Compositional Zero-Shot Recognition
    Yang, Yanhua
    Pan, Rui
    Li, Xiangyu
    Yang, Xu
    Deng, Cheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1909 - 1919
  • [33] Swin Transformer Embedding Dual-Stream for Semantic Segmentation of Remote Sensing Imagery
    Zhou, Xuanyu
    Zhou, Lifan
    Gong, Shengrong
    Zhong, Shan
    Yan, Wei
    Huang, Yizhou
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 175 - 189
  • [34] DUAL-STREAM SHALLOW NETWORKS FOR FACIAL MICRO-EXPRESSION RECOGNITION
    Khor, Huai-Qian
    See, John
    Liong, Sze-Teng
    Phan, Raphael C. W.
    Lin, Weiyao
    2019 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2019, : 36 - 40
  • [35] Surface electromyography based gesture recognition based on dual-stream CNN
    Wei W.
    Li Y.
    Jisuanji Jicheng Zhizao Xitong/Computer Integrated Manufacturing Systems, CIMS, 2022, 28 (01): : 124 - 131
  • [36] SwinTFNet: Dual-Stream Transformer With Cross Attention Fusion for Land Cover Classification
    Ren, Bo
    Liu, Bo
    Hou, Biao
    Wang, Zhao
    Yang, Chen
    Jiao, Licheng
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2024, 21 : 1 - 5
  • [37] VOLTER: Visual Collaboration and Dual-Stream Fusion for Scene Text Recognition
    Li, Jia-Nan
    Liu, Xiao-Qian
    Luo, Xin
    Xu, Xin-Shun
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 6437 - 6448
  • [38] Micro Expression Recognition via Dual-Stream Spatiotemporal Attention Network
    Wang, Yan
    Huang, Yikun
    Liu, Can
    Gu, Xiaoying
    Yang, Dandan
    Wang, Shuopeng
    Zhang, Bo
    JOURNAL OF HEALTHCARE ENGINEERING, 2021, 2021
  • [39] Cross-View Gait Recognition Based on Dual-Stream Network
    Zhao, Xiaoyan
    Zhang, Wenjing
    Zhang, Tianyao
    Zhang, Zhaohui
    JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS, 2021, 22 (05) : 671 - 678
  • [40] A non-local dual-stream fusion network for laryngoscope recognition
    Wei, Ran
    Liang, Yan
    Geng, Lei
    Wang, Wei
    Wei, Mei
    AMERICAN JOURNAL OF OTOLARYNGOLOGY, 2025, 46 (01)