STAR: Efficient SpatioTemporal Modeling for Action Recognition

被引:2
|
作者
Kumar, Abhijeet [1 ]
Abrams, Samuel [1 ]
Kumar, Abhishek [1 ]
Narayanan, Vijaykrishnan [1 ]
机构
[1] Penn State Univ, EECS Dept, State Coll, PA 16802 USA
关键词
Action recognition; Compressed domain; I-frames; Spatial-temporal 2D convolutional networks; DOMAIN;
D O I
10.1007/s00034-022-02160-x
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Action recognition in video has gained significant attention over the past several years. While conventional 2D CNNs have found great success in understanding images, they are not as effective in capturing temporal relationships present in video. By contrast, 3D CNNs capture spatiotemporal information well, but they incur a high computational cost, making deployment challenging. In video, key information is typically confined to a small number of frames, though many current approaches require decompressing and processing all frames, which wastes resources. Others work directly on the compressed domain but require multiple input streams to understand the data. In our work, we directly operate on compressed video and extract information solely from intracoded frames (I-frames) avoiding the use of motion vectors and residuals for motion information making this a single-stream network. This reduces processing time and energy consumption, by extension, making this approach more accessible for a wider range of machines and uses. Extensive testing is employed on the UCF101 (Soomro et al. in UCF101: a dataset of 101 human actions classes from videos in the Wild, 2012) and HMDB51 (Kuehne et al., in: Jhuang, Garrote, Poggio, Serre (eds) Proceedings of the international conference on computer vision (ICCV), 2011) datasets to evaluate our framework and show that computational complexity is reduced significantly while achieving competitive accuracy to existing compressed domain efforts, i.e., 92.6% top1 accuracy in UCF-101 and 62.9% in HMDB-51 dataset with 24.3M parameters and 4 GFLOPS and energy savings of over 11 x for the two datasets versus CoViAR (Wu et al. in Compressed video action recognition, 2018).
引用
收藏
页码:705 / 723
页数:19
相关论文
共 50 条
  • [1] STAR: Efficient SpatioTemporal Modeling for Action Recognition
    Abhijeet Kumar
    Samuel Abrams
    Abhishek Kumar
    Vijaykrishnan Narayanan
    Circuits, Systems, and Signal Processing, 2023, 42 : 705 - 723
  • [2] Efficient spatiotemporal context modeling for action recognition
    Cao, Congqi
    Lu, Yue
    Zhang, Yifan
    Jiang, Dongmei
    Zhang, Yanning
    NEUROCOMPUTING, 2023, 545
  • [3] Efficient local filter bank with over complete spatiotemporal pooling in action recognition
    Li, Yawei
    Jin, Lizuo
    Jie, Feiran
    Sun, Changyin
    2013 32ND CHINESE CONTROL CONFERENCE (CCC), 2013, : 3750 - 3755
  • [4] Efficient Human Vision Inspired Action Recognition Using Adaptive Spatiotemporal Sampling
    Mac, Khoi-Nguyen C.
    Do, Minh N.
    Vo, Minh P.
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 5245 - 5256
  • [5] An efficient human action recognition framework with pose-based spatiotemporal features
    Agahian, Saeid
    Negin, Farhood
    Kose, Cemal
    ENGINEERING SCIENCE AND TECHNOLOGY-AN INTERNATIONAL JOURNAL-JESTECH, 2020, 23 (01): : 196 - 203
  • [6] Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition
    Xiang, Wangmeng
    Li, Chao
    Wang, Biao
    Wei, Xihan
    Hua, Xian-Sheng
    Zhang, Lei
    COMPUTER VISION - ECCV 2022, PT III, 2022, 13663 : 627 - 644
  • [7] Temporal Modeling on Multi-Temporal-Scale Spatiotemporal Atoms for Action Recognition
    Yao, Guangle
    Lei, Tao
    Liu, Xianyuan
    Jiang, Ping
    APPLIED SCIENCES-BASEL, 2018, 8 (10):
  • [8] LONG-SHORT TEMPORAL MODELING FOR EFFICIENT ACTION RECOGNITION
    Wu, Liyu
    Zou, Yuexian
    Zhang, Can
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 2435 - 2439
  • [9] Long-Short Temporal Modeling for Efficient Action Recognition
    Wu, Liyu
    Zou, Yuexian
    Zhang, Can
    ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2021, 2021-June : 2435 - 2439
  • [10] Exploiting Spatiotemporal Features for Action Recognition
    Bin Muslim, Usairam
    Khan, Muhammad Hassan
    Farid, Muhammad Shahid
    PROCEEDINGS OF 2021 INTERNATIONAL BHURBAN CONFERENCE ON APPLIED SCIENCES AND TECHNOLOGIES (IBCAST), 2021, : 613 - 619