Masked Autoencoders for Spatial-Temporal Relationship in Video-Based Group Activity Recognition

被引:0
|
作者
Yadav, Rajeshwar [1 ]
Halder, Raju [1 ]
Banda, Gourinath [2 ]
机构
[1] Indian Inst Technol Patna, Dept Comp Sci & Engn, Bihta 801106, India
[2] Indian Inst Technol Indore, Dept Comp Sci & Engn, Indore 453552, Madhya Pradesh, India
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Group activity recognition (GAR); hostage crime; IITP hostage dataset; spatial and temporal interaction; vision transformer; masked autoencoder;
D O I
10.1109/ACCESS.2024.3457024
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Group Activity Recognition (GAR) is a challenging problem involving several intricacies. The core of GAR lies in delving into spatiotemporal features to generate appropriate scene representations. Previous methods, however, either feature a complex framework requiring individual action labels or need more adequate modelling of spatial and temporal features. To address these concerns, we propose a masking strategy for learning task-specific GAR scene representations through reconstruction. Furthermore, we elucidate how this methodology can effectively capture task-specific spatiotemporal features. In particular, three notable findings emerge from our framework: 1) GAR is simplified, eliminating the need for individual action labels; 2) the generation of target-specific spatiotemporal features yields favourable outcomes for various datasets; and 3) this method demonstrates effectiveness even for datasets with a small number of videos, highlighting its capability with limited training data. Further, the existing GAR datasets have fewer videos per class and only a few actors are considered, restricting the existing model from being generalised effectively. To this aim, we introduce 923 videos for a crime activity named IITP Hostage, which contains two categories, hostage and non-hostage. To our knowledge, this is the first attempt to recognize crime-based activities in GAR. Our framework achieves MCA of 96.8%, 97.0%, 97.0% on Collective Activity Dataset (CAD), new CAD, extended CAD datasets and 84.3%, 95.6%, 96.78% for IITP Hostage, hostage+CAD and subset of UCF crime datasets. The hostage and non-hostage scenarios introduce additional complexity, making it more challenging for the model to accurately recognize the activities compared to hostage+CAD and other datasets. This observation underscores the necessity to delve deeper into the complexity of GAR activities.
引用
收藏
页码:132084 / 132095
页数:12
相关论文
共 50 条
  • [31] Spatial-temporal Activity Interactions Detection in Video Survalliance
    Fan, Yawen
    Zheng, Shibao
    2013 2ND INTERNATIONAL SYMPOSIUM ON INSTRUMENTATION AND MEASUREMENT, SENSOR NETWORK AND AUTOMATION (IMSNA), 2013, : 432 - 435
  • [32] Multiscale Temporal Network for Video-Based Gait Recognition
    Wu, Xinhui
    Yu, Shiqi
    Huang, Yongzhen
    BIOMETRIC RECOGNITION (CCBR 2019), 2019, 11818 : 75 - 83
  • [33] A Review on Video-Based Human Activity Recognition
    Ke, Shian-Ru
    Hoang Le Uyen Thuc
    Lee, Yong-Jin
    Hwang, Jenq-Neng
    Yoo, Jang-Hee
    Choi, Kyoung-Ho
    COMPUTERS, 2013, 2 (02) : 88 - 131
  • [34] Video-Based Sports Activity Recognition for Children
    Olalere, Feyisayo
    Brouwers, Vincent
    Doyran, Metehan
    Poppe, Ronald
    Salah, Albert Ali
    2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 1563 - 1570
  • [35] Action Recognition in Video using a Spatial-Temporal Graph-based Feature Representation
    Jargalsaikhan, Iveel
    Little, Suzanne
    Trichet, Remi
    O'Connor, Noel E.
    2015 12TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS), 2015,
  • [36] MSTN: A Multi-granular Spatial-Temporal Network for video-based person re-identification
    Zhao, Wei
    Zhang, Bo
    Yang, Cong
    Chen, Xianfu
    Chen, Hui
    INTERNET OF THINGS, 2022, 20
  • [37] STA: Spatial-Temporal Attention for Large-Scale Video-Based Person Re-Identification
    Fu, Yang
    Wang, Xiaoyang
    Wei, Yunchao
    Huang, Thomas
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8287 - 8294
  • [38] Video-based driver action recognition via hybrid spatial–temporal deep learning framework
    Yaocong Hu
    Mingqi Lu
    Chao Xie
    Xiaobo Lu
    Multimedia Systems, 2021, 27 : 483 - 501
  • [39] Video Captioning Based on the Spatial-Temporal Saliency Tracing
    Zhou, Yuanen
    Hu, Zhenzhen
    Liu, Xueliang
    Wang, Meng
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT I, 2018, 11164 : 59 - 70
  • [40] Contrast Based Hierarchical Spatial-Temporal Saliency for Video
    Le, Trung-Nghia
    Sugimoto, Akihiro
    IMAGE AND VIDEO TECHNOLOGY, PSIVT 2015, 2016, 9431 : 734 - 748