VIDEO CAPTIONING WITH TEMPORAL AND REGION GRAPH CONVOLUTION NETWORK

被引：3

作者：

Xiao, Xinlong ^{[1
]}

Zhang, Yuejie ^{[1
]}

Feng, Rui ^{[1
]}

Zhang, Tao ^{[2
]}

Gao, Shang ^{[3
]}

Fan, Weiguo ^{[4
]}

机构：

[1] Fudan Univ, Sch Comp Sci, Shanghai Key Lab Intelligent Informat Proc, Shanghai, Peoples R China

[2] Shanghai Univ Finance & Econn, Sch Informat Managerment & Engn, Shanghai, Peoples R China

[3] Deakin Univ, Sch Informat Technol, Geelong, Vic, Australia

[4] Univ Iowa, Tippie Coll Business, Dept Business Analyt, Iowa City, IA 52242 USA

来源：

2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME) | 2020年

基金：

中国国家自然科学基金;

关键词：

Video Captioning; Graph Convolution Network; Temporal Graph Network; Region Graph Network; Language Generation Model;

D O I：

10.1109/icme46284.2020.9102967

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Video captioning aims to generate a natural language description for a given video clip that includes not only spatial information but also temporal information. To better exploit such spatial-temporal information attached to videos, we propose a novel video captioning framework with Temporal Graph Network (TGN) and Region Graph Network (RGN). TGN mainly focuses on utilizing the sequential information of frames that most of existing methods ignore. RGN is designed to explore the relationships among salient objects. Different from previous work, we introduce Graph Convolution Network (GCN) to encode frames with their sequential information and build a region graph for utilizing object information. We also particularly adopt a stack GRU decoder with a coarse-to-fine structure for caption generation. Very promising experimental results on two benchmark datasets (MSVD and MSR-VTT) show the effectiveness of our model.

引用

页数：6

共 50 条

[1] Using Spatial Temporal Graph Convolutional Network Dynamic Scene Graph for Video Captioning of Pedestrians Intention
Cao, Dong
Zhao, Qunhe
Fu, Yunbin
[J]. 2020 4TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL, NLPIR 2020, 2020, : 179 - 183
[2] Spatio-Temporal Graph-based Semantic Compositional Network for Video Captioning
Li, Shun
Zhang, Ze-Fan
Ji, Yi
Li, Ying
Liu, Chun-Ping
[J]. 2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
[3] Multimodal graph neural network for video procedural captioning
Ji, Lei
Tu, Rongcheng
Lin, Kevin
Wang, Lijuan
Duan, Nan
[J]. NEUROCOMPUTING, 2022, 488 : 88 - 96
[4] Exploring the Spatio-Temporal Aware Graph for video captioning
Xue, Ping
Zhou, Bing
[J]. IET COMPUTER VISION, 2022, 16 (05) : 456 - 467
[5] Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning
Zhang, Junchao
Peng, Yuxin
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 8319 - 8328
[6] CMGNet: Collaborative multi-modal graph network for video captioning
Rao, Qi
Yu, Xin
Li, Guang
Zhu, Linchao
[J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 238
[7] Incorporating the Graph Representation of Video and Text into Video Captioning
Lu, Min
Li, Yuan
[J]. 2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 396 - 401
[8] MIVCN: Multimodal interaction video captioning network based on semantic association graph
Wang, Ying
Huang, Guoheng
Lin Yuming
Yuan, Haoliang
Pun, Chi-Man
Ling, Wing-Kuen
Cheng, Lianglun
[J]. APPLIED INTELLIGENCE, 2022, 52 (05) : 5241 - 5260
[9] MIVCN: Multimodal interaction video captioning network based on semantic association graph
Ying Wang
Guoheng Huang
Lin Yuming
Haoliang Yuan
Chi-Man Pun
Wing-Kuen Ling
Lianglun Cheng
[J]. Applied Intelligence, 2022, 52 : 5241 - 5260
[10] Reconstruction Network for Video Captioning
Wang, Bairui
Ma, Lin
Zhang, Wei
Liu, Wei
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7622 - 7631

← 1 2 3 4 5 →