Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling

被引:3
|
作者
Zhao, Yu [1 ]
Fei, Hao [2 ]
Cao, Yixin [3 ]
Li, Bobo [4 ]
Zhang, Meishan [5 ]
Wei, Jianguo [1 ]
Zhang, Min [5 ]
Chua, Tat-Seng [2 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China
[2] Natl Univ Singapore, NExT Res Ctr, Singapore, Singapore
[3] Singapore Management Univ, Singapore, Singapore
[4] Wuhan Univ, Wuhan, Peoples R China
[5] Harbin Inst Technol Shenzhen, Harbin, Peoples R China
关键词
Video Understanding; Semantics Role Labeling; Event Extraction; Scene Graph;
D O I
10.1145/3581783.3612096
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video Semantic Role Labeling (VidSRL) aims to detect the salient events from given videos, by recognizing the predict-argument event structures and the interrelationships between events. While recent endeavors have put forth methods for VidSRL, they can be mostly subject to two key drawbacks, including the lack of fine-grained spatial scene perception and the insufficiently modeling of video temporality. Towards this end, this work explores a novel holistic spatio-temporal scene graph (namely HostSG) representation based on the existing dynamic scene graph structures, which well model both the fine-grained spatial semantics and temporal dynamics of videos for VidSRL. Built upon the HostSG, we present a nichetargeting VidSRL framework. A scene-event mapping mechanism is first designed to bridge the gap between the underlying scene structure and the high-level event semantic structure, resulting in an overall hierarchical scene-event (termed ICE) graph structure. We further perform iterative structure refinement to optimize the ICE graph, such that the overall structure representation can best coincide with end task demand. Finally, three subtask predictions of VidSRL are jointly decoded, where the end-to-end paradigm effectively avoids error propagation. On the benchmark dataset, our framework boosts significantly over the current best-performing model. Further analyses are shown for a better understanding of the advances of our methods.
引用
收藏
页码:5281 / 5291
页数:11
相关论文
共 50 条
  • [1] Meta Spatio-Temporal Debiasing for Video Scene Graph Generation
    Xu, Li
    Qu, Haoxuan
    Kuen, Jason
    Gu, Jiuxiang
    Liu, Jun
    COMPUTER VISION - ECCV 2022, PT XXVII, 2022, 13687 : 374 - 390
  • [2] Spatio-Temporal Graph-based Semantic Compositional Network for Video Captioning
    Li, Shun
    Zhang, Ze-Fan
    Ji, Yi
    Li, Ying
    Liu, Chun-Ping
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [3] VR plus HD: Video Semantic Reconstruction From Spatio-Temporal Scene Graphs
    Li, Chenxing
    Duan, Yiping
    Du, Qiyuan
    Sun, Shiqi
    Deng, Xin
    Tao, Xiaoming
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2023, 17 (05) : 935 - 948
  • [4] Video Relation Detection with Spatio-Temporal Graph
    Qian, Xufeng
    Zhuang, Yueting
    Li, Yimeng
    Xiao, Shaoning
    Pu, Shiliang
    Xiao, Jun
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 84 - 93
  • [5] Spatio-temporal graph-based self-labeling for video anomaly detection
    Xing, Meng
    Feng, Zhiyong
    Su, Yong
    Zhang, Yiming
    Oh, Changjae
    Gribova, Valeriya
    Filaretoy, Vladimir Fedorovich
    Huang, Deshuang
    NEUROCOMPUTING, 2025, 627
  • [6] Cross-Attentional Spatio-Temporal Semantic Graph Networks for Video Question Answering
    Liu, Yun
    Zhang, Xiaoming
    Huang, Feiran
    Zhang, Bo
    Li, Zhoujun
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 1684 - 1696
  • [7] Semantic spatio-temporal segmentation for extracting video objects
    Mao, JH
    Ma, KK
    IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA COMPUTING AND SYSTEMS, PROCEEDINGS VOL 1, 1999, : 738 - 743
  • [8] Empowering UAV scene perception by semantic spatio-temporal features
    Cavaliere, Danilo
    Saggese, Alessia
    Senatore, Sabrina
    Vento, Mario
    Loia, Vincenzo
    2018 IEEE INTERNATIONAL CONFERENCE ON ENVIRONMENTAL ENGINEERING (EE), 2018,
  • [9] Capturing the spatio-temporal continuity for video semantic segmentation
    Chen, Xin
    Wu, Aming
    Han, Yahong
    IET IMAGE PROCESSING, 2019, 13 (14) : 2813 - 2820
  • [10] Exploring the Spatio-Temporal Aware Graph for video captioning
    Xue, Ping
    Zhou, Bing
    IET COMPUTER VISION, 2022, 16 (05) : 456 - 467