Masked Autoencoders for Spatial-Temporal Relationship in Video-Based Group Activity Recognition

被引:0
|
作者
Yadav, Rajeshwar [1 ]
Halder, Raju [1 ]
Banda, Gourinath [2 ]
机构
[1] Indian Inst Technol Patna, Dept Comp Sci & Engn, Bihta 801106, India
[2] Indian Inst Technol Indore, Dept Comp Sci & Engn, Indore 453552, Madhya Pradesh, India
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Group activity recognition (GAR); hostage crime; IITP hostage dataset; spatial and temporal interaction; vision transformer; masked autoencoder;
D O I
10.1109/ACCESS.2024.3457024
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Group Activity Recognition (GAR) is a challenging problem involving several intricacies. The core of GAR lies in delving into spatiotemporal features to generate appropriate scene representations. Previous methods, however, either feature a complex framework requiring individual action labels or need more adequate modelling of spatial and temporal features. To address these concerns, we propose a masking strategy for learning task-specific GAR scene representations through reconstruction. Furthermore, we elucidate how this methodology can effectively capture task-specific spatiotemporal features. In particular, three notable findings emerge from our framework: 1) GAR is simplified, eliminating the need for individual action labels; 2) the generation of target-specific spatiotemporal features yields favourable outcomes for various datasets; and 3) this method demonstrates effectiveness even for datasets with a small number of videos, highlighting its capability with limited training data. Further, the existing GAR datasets have fewer videos per class and only a few actors are considered, restricting the existing model from being generalised effectively. To this aim, we introduce 923 videos for a crime activity named IITP Hostage, which contains two categories, hostage and non-hostage. To our knowledge, this is the first attempt to recognize crime-based activities in GAR. Our framework achieves MCA of 96.8%, 97.0%, 97.0% on Collective Activity Dataset (CAD), new CAD, extended CAD datasets and 84.3%, 95.6%, 96.78% for IITP Hostage, hostage+CAD and subset of UCF crime datasets. The hostage and non-hostage scenarios introduce additional complexity, making it more challenging for the model to accurately recognize the activities compared to hostage+CAD and other datasets. This observation underscores the necessity to delve deeper into the complexity of GAR activities.
引用
收藏
页码:132084 / 132095
页数:12
相关论文
共 50 条
  • [1] Video-based Driver Action Recognition via Spatial-Temporal and Motion Deep Learning
    Ma, Fangzhi
    Xing, Guanyu
    Liu, Yanli
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [2] Video-based Driver Action Recognition via Spatial-Temporal and Motion Deep Learning
    Ma, Fangzhi
    Xing, Guanyu
    Liu, Yanli
    Proceedings of the International Joint Conference on Neural Networks, 2023, 2023-June
  • [3] Video-based driver action recognition via hybrid spatial-temporal deep learning framework
    Hu, Yaocong
    Lu, Mingqi
    Xie, Chao
    Lu, Xiaobo
    MULTIMEDIA SYSTEMS, 2021, 27 (03) : 483 - 501
  • [4] Complex interactive activity recognition with spatial-temporal relationship
    Liu, Y.-T. (lytgreat@163.com), 1600, Editorial Board of Jilin University (44):
  • [5] Spatial-temporal attention for video-based assessment of intraoperative surgical skill
    Bohua Wan
    Michael Peven
    Gregory Hager
    Shameema Sikder
    S. Swaroop Vedula
    Scientific Reports, 14 (1)
  • [6] Fusing HOG and convolutional neural network spatial-temporal features for video-based facial expression recognition
    Pan, Xianzhang
    IET IMAGE PROCESSING, 2020, 14 (01) : 176 - 182
  • [7] Spatial-Temporal Masked Autoencoder for Multi-DeviceWearable Human Activity Recognition
    Miao, Shenghuan
    Chen, Ling
    Hu, Rong
    PROCEEDINGS OF THE ACM ON INTERACTIVE MOBILE WEARABLE AND UBIQUITOUS TECHNOLOGIES-IMWUT, 2023, 7 (04):
  • [8] GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer
    Li, Shuaicheng
    Cao, Qianggang
    Liu, Lingbo
    Yang, Kunlin
    Liu, Shinan
    Hou, Jun
    Yi, Shuai
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 13648 - 13657
  • [9] Activity Recognition Based on Spatial-Temporal Attention LSTM
    Xie, Zhao
    Zhou, Yi
    Wu, Ke-Wei
    Zhang, Shun-Ran
    Jisuanji Xuebao/Chinese Journal of Computers, 2021, 44 (02): : 261 - 274
  • [10] Spatial-temporal aware network for video-based person re-identification
    Wang, Jun
    Zhao, Qi
    Jia, Di
    Huang, Ziqing
    Zhang, Miaohui
    Ren, Xing
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (12) : 36355 - 36373