Masked Autoencoders for Spatial-Temporal Relationship in Video-Based Group Activity Recognition

被引:0
|
作者
Yadav, Rajeshwar [1 ]
Halder, Raju [1 ]
Banda, Gourinath [2 ]
机构
[1] Indian Inst Technol Patna, Dept Comp Sci & Engn, Bihta 801106, India
[2] Indian Inst Technol Indore, Dept Comp Sci & Engn, Indore 453552, Madhya Pradesh, India
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Group activity recognition (GAR); hostage crime; IITP hostage dataset; spatial and temporal interaction; vision transformer; masked autoencoder;
D O I
10.1109/ACCESS.2024.3457024
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Group Activity Recognition (GAR) is a challenging problem involving several intricacies. The core of GAR lies in delving into spatiotemporal features to generate appropriate scene representations. Previous methods, however, either feature a complex framework requiring individual action labels or need more adequate modelling of spatial and temporal features. To address these concerns, we propose a masking strategy for learning task-specific GAR scene representations through reconstruction. Furthermore, we elucidate how this methodology can effectively capture task-specific spatiotemporal features. In particular, three notable findings emerge from our framework: 1) GAR is simplified, eliminating the need for individual action labels; 2) the generation of target-specific spatiotemporal features yields favourable outcomes for various datasets; and 3) this method demonstrates effectiveness even for datasets with a small number of videos, highlighting its capability with limited training data. Further, the existing GAR datasets have fewer videos per class and only a few actors are considered, restricting the existing model from being generalised effectively. To this aim, we introduce 923 videos for a crime activity named IITP Hostage, which contains two categories, hostage and non-hostage. To our knowledge, this is the first attempt to recognize crime-based activities in GAR. Our framework achieves MCA of 96.8%, 97.0%, 97.0% on Collective Activity Dataset (CAD), new CAD, extended CAD datasets and 84.3%, 95.6%, 96.78% for IITP Hostage, hostage+CAD and subset of UCF crime datasets. The hostage and non-hostage scenarios introduce additional complexity, making it more challenging for the model to accurately recognize the activities compared to hostage+CAD and other datasets. This observation underscores the necessity to delve deeper into the complexity of GAR activities.
引用
收藏
页码:132084 / 132095
页数:12
相关论文
共 50 条
  • [21] Spatial-Temporal Graph Convolutional Network for Video-based Person Re-identification
    Yang, Jinrui
    Zheng, Wei-Shi
    Yang, Qize
    Chen, Ying-Cong
    Tian, Qi
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 3286 - 3296
  • [22] A spatial-temporal approach for video caption detection and recognition
    Tang, X
    Gao, XB
    Liu, JZ
    Zhang, HJ
    IEEE TRANSACTIONS ON NEURAL NETWORKS, 2002, 13 (04): : 961 - 971
  • [23] A Deep Spatial and Temporal Aggregation Framework for Video-Based Facial Expression Recognition
    Pan, Xianzhang
    Ying, Guoliang
    Chen, Guodong
    Li, Hongming
    Li, Wenshu
    IEEE ACCESS, 2019, 7 : 48807 - 48815
  • [24] Deep Temporal-Spatial Aggregation for Video-Based Facial Expression Recognition
    Pan, Xianzhang
    Guo, Wenping
    Guo, Xiaoying
    Li, Wenshu
    Xu, Junjie
    Wu, Jinzhao
    SYMMETRY-BASEL, 2019, 11 (01):
  • [25] Video-Based Temporal Enhanced Action Recognition
    Zhang H.
    Fu D.
    Zhou K.
    Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence, 2020, 33 (10): : 951 - 958
  • [26] 3D-unified spatial-temporal graph for group activity recognition
    Wang, Lukun
    Feng, Wancheng
    Tian, Chunpeng
    Chen, Liquan
    Pei, Jiaming
    NEUROCOMPUTING, 2023, 556
  • [27] Joint Attentive Spatial-Temporal Feature Aggregation for Video-Based Person Re-Identification
    Chen, Lin
    Yang, Hua
    Gao, Zhiyong
    IEEE ACCESS, 2019, 7 : 41230 - 41240
  • [28] Spatial-Temporal Attention-Aware Learning for Video-Based Person Re-Identification
    Chen, Guangyi
    Lu, Jiwen
    Yang, Ming
    Zhou, Jie
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (09) : 4192 - 4205
  • [29] Jointly Attentive Spatial-Temporal Pooling Networks for Video-based Person Re-Identification
    Xu, Shuangjie
    Cheng, Yu
    Gu, Kang
    Yang, Yang
    Chang, Shiyu
    Zhou, Pan
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 4743 - 4752
  • [30] Video-Based Facial Expression Recognition using Deep Temporal-Spatial Networks
    Pan, Xianzhang
    Zhang, Shiqing
    Guo, WenPing
    Zhao, Xiaoming
    Chuang, Yuelong
    Chen, Ying
    Zhang, Haibo
    IETE TECHNICAL REVIEW, 2020, 37 (04) : 402 - 409