Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling

被引：3

作者：

Zhao, Yu ^{[1
]}

Fei, Hao ^{[2
]}

Cao, Yixin ^{[3
]}

Li, Bobo ^{[4
]}

Zhang, Meishan ^{[5
]}

Wei, Jianguo ^{[1
]}

Zhang, Min ^{[5
]}

Chua, Tat-Seng ^{[2
]}

机构：

[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China

[2] Natl Univ Singapore, NExT Res Ctr, Singapore, Singapore

[3] Singapore Management Univ, Singapore, Singapore

[4] Wuhan Univ, Wuhan, Peoples R China

[5] Harbin Inst Technol Shenzhen, Harbin, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

关键词：

Video Understanding; Semantics Role Labeling; Event Extraction; Scene Graph;

D O I：

10.1145/3581783.3612096

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video Semantic Role Labeling (VidSRL) aims to detect the salient events from given videos, by recognizing the predict-argument event structures and the interrelationships between events. While recent endeavors have put forth methods for VidSRL, they can be mostly subject to two key drawbacks, including the lack of fine-grained spatial scene perception and the insufficiently modeling of video temporality. Towards this end, this work explores a novel holistic spatio-temporal scene graph (namely HostSG) representation based on the existing dynamic scene graph structures, which well model both the fine-grained spatial semantics and temporal dynamics of videos for VidSRL. Built upon the HostSG, we present a nichetargeting VidSRL framework. A scene-event mapping mechanism is first designed to bridge the gap between the underlying scene structure and the high-level event semantic structure, resulting in an overall hierarchical scene-event (termed ICE) graph structure. We further perform iterative structure refinement to optimize the ICE graph, such that the overall structure representation can best coincide with end task demand. Finally, three subtask predictions of VidSRL are jointly decoded, where the end-to-end paradigm effectively avoids error propagation. On the benchmark dataset, our framework boosts significantly over the current best-performing model. Further analyses are shown for a better understanding of the advances of our methods.

引用

页码：5281 / 5291

页数：11

共 50 条

[41] Video Segmentation Using Iterated Graph Cuts Based on Spatio-temporal Volumes
Nagahashi, Tomoyuki
Fujiyoshi, Hironobu
Kanade, Takeo
COMPUTER VISION - ACCV 2009, PT II, 2010, 5995 : 655 - +
[42] (2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering
Cherian, Anoop
Hori, Chiori
Marks, Tim K.
Le Roux, Jonathan
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 444 - 453
[43] Video Segmentation with Spatio-Temporal Tubes
Trichet, Remi
Nevatia, Ramakant
2013 10TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS 2013), 2013, : 330 - 335
[44] Spatio-temporal segmentation for video surveillance
Sun, HZ
Tan, TN
ELECTRONICS LETTERS, 2001, 37 (01) : 20 - 21
[45] Spatio-temporal segmentation for video surveillance
Sun, HZ
Feng, T
Tan, TN
15TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 1, PROCEEDINGS: COMPUTER VISION AND IMAGE ANALYSIS, 2000, : 843 - 846
[46] VideoZoom Spatio-Temporal Video Browser
Smith, John R.
IEEE TRANSACTIONS ON MULTIMEDIA, 1999, 1 (02) : 157 - 171
[47] Spatio-temporal video contrast enhancement
Celik, Turgay
IET IMAGE PROCESSING, 2013, 7 (06) : 543 - 555
[48] Spatio-Temporal Perturbations for Video Attribution
Li, Zhenqiang
Wang, Weimin
Li, Zuoyue
Huang, Yifei
Sato, Yoichi
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (04) : 2043 - 2056
[49] Spatio-temporal querying in video databases
Köprülü, M
Çiçekli, NK
Yazici, A
FLEXIBLE QUERY ANSWERING SYSTEMS, PROCEEDINGS, 2002, 2522 : 251 - 262
[50] Spatio-temporal querying in video databases
Koprulu, M
Cicekli, NK
Yazici, A
INFORMATION SCIENCES, 2004, 160 (1-4) : 131 - 152

← 1 2 3 4 5 →