Masked Autoencoders for Spatial-Temporal Relationship in Video-Based Group Activity Recognition

被引:0
|
作者
Yadav, Rajeshwar [1 ]
Halder, Raju [1 ]
Banda, Gourinath [2 ]
机构
[1] Indian Inst Technol Patna, Dept Comp Sci & Engn, Bihta 801106, India
[2] Indian Inst Technol Indore, Dept Comp Sci & Engn, Indore 453552, Madhya Pradesh, India
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Group activity recognition (GAR); hostage crime; IITP hostage dataset; spatial and temporal interaction; vision transformer; masked autoencoder;
D O I
10.1109/ACCESS.2024.3457024
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Group Activity Recognition (GAR) is a challenging problem involving several intricacies. The core of GAR lies in delving into spatiotemporal features to generate appropriate scene representations. Previous methods, however, either feature a complex framework requiring individual action labels or need more adequate modelling of spatial and temporal features. To address these concerns, we propose a masking strategy for learning task-specific GAR scene representations through reconstruction. Furthermore, we elucidate how this methodology can effectively capture task-specific spatiotemporal features. In particular, three notable findings emerge from our framework: 1) GAR is simplified, eliminating the need for individual action labels; 2) the generation of target-specific spatiotemporal features yields favourable outcomes for various datasets; and 3) this method demonstrates effectiveness even for datasets with a small number of videos, highlighting its capability with limited training data. Further, the existing GAR datasets have fewer videos per class and only a few actors are considered, restricting the existing model from being generalised effectively. To this aim, we introduce 923 videos for a crime activity named IITP Hostage, which contains two categories, hostage and non-hostage. To our knowledge, this is the first attempt to recognize crime-based activities in GAR. Our framework achieves MCA of 96.8%, 97.0%, 97.0% on Collective Activity Dataset (CAD), new CAD, extended CAD datasets and 84.3%, 95.6%, 96.78% for IITP Hostage, hostage+CAD and subset of UCF crime datasets. The hostage and non-hostage scenarios introduce additional complexity, making it more challenging for the model to accurately recognize the activities compared to hostage+CAD and other datasets. This observation underscores the necessity to delve deeper into the complexity of GAR activities.
引用
收藏
页码:132084 / 132095
页数:12
相关论文
共 50 条
  • [41] Video Quality Assessment Based on Spatial-temporal Distortion
    Yang, Chunting
    Liu, Yang
    Yu, Jing
    PROCEEDINGS OF THE FIRST INTERNATIONAL WORKSHOP ON EDUCATION TECHNOLOGY AND COMPUTER SCIENCE, VOL I, 2009, : 818 - +
  • [42] A video segmentation algorithm based on spatial-temporal information
    Zhu, H
    Li, ZM
    2002 INTERNATIONAL CONFERENCE ON COMMUNICATIONS, CIRCUITS AND SYSTEMS AND WEST SINO EXPOSITION PROCEEDINGS, VOLS 1-4, 2002, : 566 - 569
  • [43] AdaFocusV3: On Unified Spatial-Temporal Dynamic Video Recognition
    Wang, Yulin
    Yue, Yang
    Xu, Xinhong
    Hassani, Ali
    Kulikov, Victor
    Orlov, Nikita
    Song, Shiji
    Shi, Humphrey
    Huang, Gao
    COMPUTER VISION - ECCV 2022, PT IV, 2022, 13664 : 226 - 243
  • [44] MLST-Former: Multi-Level Spatial-Temporal Transformer for Group Activity Recognition
    Zhu, Xiaolin
    Zhou, Yan
    Wang, Dongli
    Ouyang, Wanli
    Su, Rui
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (07) : 3383 - 3397
  • [45] Video-Based Sign Language Recognition without Temporal Segmentation
    Huang, Jie
    Zhou, Wengang
    Zhang, Qilin
    Li, Houqiang
    Li, Weiping
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 2257 - 2264
  • [46] TEMPORAL PYRAMID RELATION NETWORK FOR VIDEO-BASED GESTURE RECOGNITION
    Yang, Ke
    Li, Rongchun
    Qiao, Peng
    Wang, Qiang
    Li, Dongsheng
    Dou, Yong
    2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 3104 - 3108
  • [47] Spatio-temporal keypoints for video-based face recognition
    Franco, A.
    Maio, D.
    Turroni, F.
    2014 22ND INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2014, : 489 - 494
  • [48] Attention-guided spatial-temporal graph relation network for video-based person re-identification
    Qi, Yu
    Ge, Hongwei
    Pei, Wenbin
    Liu, Yuxuan
    Hou, Yaqing
    Sun, Liang
    NEURAL COMPUTING & APPLICATIONS, 2023, 35 (19): : 14227 - 14241
  • [49] A two-stream network with joint spatial-temporal distance for video-based person re-identification
    Han, Zhisong
    Liang, Yaling
    Chen, Zengqun
    Zhou, Zhiheng
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 39 (03) : 3769 - 3781
  • [50] ST-VLAD: Video Face Recognition Based on Aggregated Local Spatial-Temporal Descriptors
    Wang, Yu
    Huang, Yong-Ping
    Shen, Xuan-Jing
    IEEE ACCESS, 2021, 9 : 31170 - 31178