Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling

被引:3
|
作者
Zhao, Yu [1 ]
Fei, Hao [2 ]
Cao, Yixin [3 ]
Li, Bobo [4 ]
Zhang, Meishan [5 ]
Wei, Jianguo [1 ]
Zhang, Min [5 ]
Chua, Tat-Seng [2 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China
[2] Natl Univ Singapore, NExT Res Ctr, Singapore, Singapore
[3] Singapore Management Univ, Singapore, Singapore
[4] Wuhan Univ, Wuhan, Peoples R China
[5] Harbin Inst Technol Shenzhen, Harbin, Peoples R China
关键词
Video Understanding; Semantics Role Labeling; Event Extraction; Scene Graph;
D O I
10.1145/3581783.3612096
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video Semantic Role Labeling (VidSRL) aims to detect the salient events from given videos, by recognizing the predict-argument event structures and the interrelationships between events. While recent endeavors have put forth methods for VidSRL, they can be mostly subject to two key drawbacks, including the lack of fine-grained spatial scene perception and the insufficiently modeling of video temporality. Towards this end, this work explores a novel holistic spatio-temporal scene graph (namely HostSG) representation based on the existing dynamic scene graph structures, which well model both the fine-grained spatial semantics and temporal dynamics of videos for VidSRL. Built upon the HostSG, we present a nichetargeting VidSRL framework. A scene-event mapping mechanism is first designed to bridge the gap between the underlying scene structure and the high-level event semantic structure, resulting in an overall hierarchical scene-event (termed ICE) graph structure. We further perform iterative structure refinement to optimize the ICE graph, such that the overall structure representation can best coincide with end task demand. Finally, three subtask predictions of VidSRL are jointly decoded, where the end-to-end paradigm effectively avoids error propagation. On the benchmark dataset, our framework boosts significantly over the current best-performing model. Further analyses are shown for a better understanding of the advances of our methods.
引用
收藏
页码:5281 / 5291
页数:11
相关论文
共 50 条
  • [31] Holistic OR domain modeling: a semantic scene graph approach
    Ege Özsoy
    Tobias Czempiel
    Evin Pınar Örnek
    Ulrich Eck
    Federico Tombari
    Nassir Navab
    International Journal of Computer Assisted Radiology and Surgery, 2024, 19 : 791 - 799
  • [32] Efficient Spatio-Temporal Video Grounding with Semantic-Guided Feature Decomposition
    Wang, Weikang
    Liu, Jing
    Su, Yuting
    Nie, Weizhi
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4867 - 4876
  • [33] Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning
    Aafaq, Nayyer
    Akhtar, Naveed
    Liu, Wei
    Gilani, Syed Zulqarnain
    Mian, Ajmal
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 12479 - 12488
  • [34] Spatio-Temporal Scene Analysis Based on Graph Algorithms to Determine Rigid and Articulated Objects
    Kieneke, Stephan
    Steffens, Markus
    Aufderheide, Dominik
    Krybus, Werner
    Kohring, Christine
    Morton, Danny
    COMPUTER VISION/COMPUTER GRAPHICS COLLABORATION TECHNIQUES, PROCEEDINGS, 2009, 5496 : 254 - +
  • [35] Experience Graph using Spatio-Temporal Scene Data for Replaying Mixed Reality Interaction
    Kim, Seonji
    2024 IEEE CONFERENCE ON VIRTUAL REALITY AND 3D USER INTERFACES ABSTRACTS AND WORKSHOPS, VRW 2024, 2024, : 1112 - 1113
  • [36] A Graph Model for Spatio-temporal Evolution
    Del Mondo, Geraldine
    Stell, John G.
    Claramunt, Christophe
    Thibaud, Remy
    JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2010, 16 (11) : 1452 - 1477
  • [37] Spatio-Temporal Action Graph Networks
    Herzig, Roei
    Levi, Elad
    Xu, Huijuan
    Gao, Hang
    Brosh, Eli
    Wang, Xiaolong
    Globerson, Amir
    Darrell, Trevor
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 2347 - 2356
  • [38] 3D SPATIO-TEMPORAL GRAPH CUTS FOR VIDEO OBJECTS SEGMENTATION
    Tian, Zhiqiang
    Xue, Jianru
    Zheng, Nanning
    Lan, Xuguang
    Li, Ce
    2011 18TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2011,
  • [39] Global-local spatio-temporal graph convolutional networks for video summarization
    Wu, Guangli
    Song, Shanshan
    Zhang, Jing
    COMPUTERS & ELECTRICAL ENGINEERING, 2024, 118
  • [40] Video action detection by learning graph-based spatio-temporal interactions
    Tomei, Matteo
    Baraldi, Lorenzo
    Calderara, Simone
    Bronzin, Simone
    Cucchiara, Rita
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2021, 206