Dynamic Spatio-Temporal Graph Reasoning for VideoQA With Self-Supervised Event Recognition

被引:0
|
作者
Nie, Jie [1 ]
Wang, Xin [2 ]
Hou, Runze [2 ]
Li, Guohao [2 ]
Chen, Hong [2 ]
Zhu, Wenwu [2 ]
机构
[1] Ocean Univ China, Coll Informat Sci & Engn, Qingdao 266005, Peoples R China
[2] Tsinghua Univ, Dept Comp Sci & Technol, BNRist, Beijing 100084, Peoples R China
基金
中国国家自然科学基金;
关键词
Videos; Visualization; Cognition; Task analysis; Semantics; Question answering (information retrieval); Feature extraction; Vision and language model; video question answering; video understanding; spatio-temporal graph; NETWORKS;
D O I
10.1109/TIP.2024.3411448
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video question answering (VideoQA) requires the ability of comprehensively understanding visual contents in videos. Existing VideoQA models mainly focus on scenarios involving a single event with simple object interactions and leave event-centric scenarios involving multiple events with dynamically complex object interactions largely unexplored. These conventional VideoQA models are usually based on features extracted from the global visual signals, making it difficult to capture the object-level and event-level semantics. Although there exists a recent work utilizing a static spatio-temporal graph to explicitly model object interactions in videos, it ignores the dynamic impact of questions for graph construction and fails to exploit the implicit event-level semantic clues in questions. To overcome these limitations, we propose a Self-supervised Dynamic Graph Reasoning (SDGraphR) model for video question answering (VideoQA). Our SDGraphR model learns a question-guided spatio-temporal graph that dynamically encodes intra-frame spatial correlations and inter-frame correspondences between objects in the videos. Furthermore, the proposed SDGraphR model discovers event-level cues from questions to conduct self-supervised learning with an auxiliary event recognition task, which in turn helps to improve its VideoQA performances without using any extra annotations. We carry out extensive experiments to validate the substantial improvements of our proposed SDGraphR model over existing baselines.
引用
收藏
页码:4145 / 4158
页数:14
相关论文
共 50 条
  • [1] Self-supervised dynamic stochastic graph network for spatio-temporal wind speed forecasting
    Wu, Tangjie
    Ling, Qiang
    [J]. ENERGY, 2024, 304
  • [2] Self-Supervised Spatio-Temporal Graph Learning for Point-of-Interest Recommendation
    Liu, Jiawei
    Gao, Haihan
    Shi, Chuan
    Cheng, Hongtao
    Xie, Qianlong
    [J]. APPLIED SCIENCES-BASEL, 2023, 13 (15):
  • [3] Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning
    Luo, Dezhao
    Liu, Chang
    Zhou, Yu
    Yang, Dongbao
    Ma, Can
    Ye, Qixiang
    Wang, Weiping
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11701 - 11708
  • [4] Spatio-Temporal Self-Supervised Learning for Traffic Flow Prediction
    Ji, Jiahao
    Wang, Jingyuan
    Huang, Chao
    Wu, Junjie
    Xu, Boren
    Wu, Zhenhe
    Zhang, Junbo
    Zheng, Yu
    [J]. THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 4, 2023, : 4356 - 4364
  • [5] Self-Supervised Global Spatio-Temporal Interaction Pre-Training for Group Activity Recognition
    Du, Zexing
    Wang, Xue
    Wang, Qing
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 5076 - 5088
  • [6] Anomaly detection for key performance indicators by fusing self-supervised spatio-temporal graph attention networks
    Chen, Ningjiang
    Tu, Huan
    Zeng, Haoyang
    Ou, Yangjie
    [J]. KNOWLEDGE-BASED SYSTEMS, 2024, 300
  • [7] Self-Supervised Video Representation Learning by Uncovering Spatio-Temporal Statistics
    Wang, Jiangliu
    Jiao, Jianbo
    Bao, Linchao
    He, Shengfeng
    Liu, Wei
    Liu, Yun-hui
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (07) : 3791 - 3806
  • [8] Contrastive Spatio-Temporal Pretext Learning for Self-Supervised Video Representation
    Zhang, Yujia
    Po, Lai-Man
    Xu, Xuyuan
    Liu, Mengyang
    Wang, Yexin
    Ou, Weifeng
    Zhao, Yuzhi
    Yu, Wing-Yin
    [J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 3380 - 3389
  • [9] Spatio-Temporal Catcher: a Self-Supervised Transformer for Deepfake Video Detection
    Li, Maosen
    Li, Xurong
    Yu, Kun
    Deng, Cheng
    Huang, Heng
    Mao, Feng
    Xue, Hui
    Li, Minghao
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 8707 - 8718
  • [10] CONTRASTIVE SELF-SUPERVISED LEARNING FOR SPATIO-TEMPORAL ANALYSIS OF LUNG ULTRASOUND VIDEOS
    Chen, Li
    Rubin, Jonathan
    Ouyang, Jiahong
    Balaraju, Naveen
    Patil, Shubham
    Mehanian, Courosh
    Kulhare, Sourabh
    Millin, Rachel
    Gregory, Kenton W.
    Gregory, Cynthia R.
    Zhu, Meihua
    Kessler, David O.
    Malia, Laurie
    Dessie, Almaz
    Rabiner, Joni
    Coneybeare, Di
    Shopsin, Bo
    Hersh, Andrew
    Madar, Cristian
    Shupp, Jeffrey
    Johnson, Laura S.
    Avila, Jacob
    Dwyer, Kristin
    Weimersheimer, Peter
    Raju, Balasundar
    Kruecker, Jochen
    Chen, Alvin
    [J]. 2023 IEEE 20TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING, ISBI, 2023,