Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling

被引:3
|
作者
Zhao, Yu [1 ]
Fei, Hao [2 ]
Cao, Yixin [3 ]
Li, Bobo [4 ]
Zhang, Meishan [5 ]
Wei, Jianguo [1 ]
Zhang, Min [5 ]
Chua, Tat-Seng [2 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China
[2] Natl Univ Singapore, NExT Res Ctr, Singapore, Singapore
[3] Singapore Management Univ, Singapore, Singapore
[4] Wuhan Univ, Wuhan, Peoples R China
[5] Harbin Inst Technol Shenzhen, Harbin, Peoples R China
关键词
Video Understanding; Semantics Role Labeling; Event Extraction; Scene Graph;
D O I
10.1145/3581783.3612096
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video Semantic Role Labeling (VidSRL) aims to detect the salient events from given videos, by recognizing the predict-argument event structures and the interrelationships between events. While recent endeavors have put forth methods for VidSRL, they can be mostly subject to two key drawbacks, including the lack of fine-grained spatial scene perception and the insufficiently modeling of video temporality. Towards this end, this work explores a novel holistic spatio-temporal scene graph (namely HostSG) representation based on the existing dynamic scene graph structures, which well model both the fine-grained spatial semantics and temporal dynamics of videos for VidSRL. Built upon the HostSG, we present a nichetargeting VidSRL framework. A scene-event mapping mechanism is first designed to bridge the gap between the underlying scene structure and the high-level event semantic structure, resulting in an overall hierarchical scene-event (termed ICE) graph structure. We further perform iterative structure refinement to optimize the ICE graph, such that the overall structure representation can best coincide with end task demand. Finally, three subtask predictions of VidSRL are jointly decoded, where the end-to-end paradigm effectively avoids error propagation. On the benchmark dataset, our framework boosts significantly over the current best-performing model. Further analyses are shown for a better understanding of the advances of our methods.
引用
收藏
页码:5281 / 5291
页数:11
相关论文
共 50 条
  • [21] Semantic Scene Mapping with Spatio-temporal Deep Neural Network for Robotic Applications
    Ruihao Li
    Dongbing Gu
    Qiang Liu
    Zhiqiang Long
    Huosheng Hu
    Cognitive Computation, 2018, 10 : 260 - 271
  • [22] Semantic Scene Mapping with Spatio-temporal Deep Neural Network for Robotic Applications
    Li, Ruihao
    Gu, Dongbing
    Liu, Qiang
    Long, Zhiqiang
    Hu, Huosheng
    COGNITIVE COMPUTATION, 2018, 10 (02) : 260 - 271
  • [23] A hierarchical spatio-temporal object knowledge graph model for dynamic scene representation
    Zhao, Xinke
    Cao, Yibing
    Wang, Jiahe
    Fan, Xinhua
    Chen, Minjie
    TRANSACTIONS IN GIS, 2023, 27 (07) : 1992 - 2016
  • [24] Spatio-temporal Sampling for Video
    Shankar, Mohan
    Pitsiauis, Nikos P.
    Brady, David
    IMAGE RECONSTRUCTION FROM INCOMPLETE DATA V, 2008, 7076
  • [25] Video2Vec: Learning Semantic Spatio-Temporal Embeddings for Video Representation
    Hu, Sheng-Hung
    Li, Yikang
    Li, Baoxin
    2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2016, : 811 - 816
  • [26] Video Relationship Reasoning using Gated Spatio-Temporal Energy Graph
    Tsai, Yao-Hung Hubert
    Divvala, Santosh
    Morency, Louis-Philippe
    Salakhutdinov, Ruslan
    Farhadi, Ali
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10416 - 10425
  • [27] Learning Social Spatio-Temporal Relation Graph in the Wild and a Video Benchmark
    Wang, Haoran
    Jiao, Licheng
    Liu, Fang
    Li, Lingling
    Liu, Xu
    Ji, Deyi
    Gan, Weihao
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (06) : 2951 - 2964
  • [28] VIDEO ACTION RECOGNITION WITH SPATIO-TEMPORAL GRAPH EMBEDDING AND SPLINE MODELING
    Yuan, Yin
    Zheng, Haomian
    Li, Zhu
    Zhang, David
    2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 2422 - 2425
  • [29] Holistic Spatio-Temporal Graph Attention for Trajectory Prediction in Vehicle-Pedestrian Interactions
    Alghodhaifi, Hesham
    Lakshmanan, Sridhar
    SENSORS, 2023, 23 (17)
  • [30] Holistic OR domain modeling: a semantic scene graph approach
    Oezsoy, Ege
    Czempiel, Tobias
    Oernek, Evin Pinar
    Eck, Ulrich
    Tombari, Federico
    Navab, Nassir
    INTERNATIONAL JOURNAL OF COMPUTER ASSISTED RADIOLOGY AND SURGERY, 2024, 19 (05) : 791 - 799