Multimodal graph neural network for video procedural captioning

被引：6

作者：

Ji, Lei ^{[1
,2
,3
]}

Tu, Rongcheng ^{[4
]}

Lin, Kevin ^{[5
]}

Wang, Lijuan ^{[5
]}

Duan, Nan ^{[3
]}

机构：

[1] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Beijing, Peoples R China

[3] Microsoft Res Asia, Beijing, Peoples R China

[4] Beijing Inst Technol, Beijing, Peoples R China

[5] Microsoft, Redmond, WA USA

来源：

NEUROCOMPUTING | 2022年 / 488卷

关键词：

Multimodal video captioning; Graph neural network;

D O I：

10.1016/j.neucom.2022.02.062

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video procedural captioning aims to generate detailed descriptive captions for all steps in a long instructional video. The peculiarity of this problem is the procedural dependency between the events to generate consistent captions among the video. However, existing video (dense) captioning methods only consider intra-event or sequential inter-event context and are hard to model the non-sequential context dependency between events. In this paper, inspired by the recent success of graph neural networks in capturing the relations for structured data, we propose a novel Multimodal Graph Neural Network (MGNN) for dense video procedural captioning in capturing the procedural structure between events. Specifically, we construct temporal sequential graph and semantic non-sequential graph for a multi modal heterogeneous graph. Moreover, we adopt the graph neural network to enhance the visual and text features, and fuse both features for further caption generation. Extensive experiments demonstrate the proposed MGNN is effective in generating coherent captions on both the Youcook2 and Activitynet Captions benchmark.(c) 2022 Elsevier B.V. All rights reserved.

引用

页码：88 / 96

页数：9

共 50 条

[31] Spatio-Temporal Graph-based Semantic Compositional Network for Video Captioning
Li, Shun
Zhang, Ze-Fan
Ji, Yi
Li, Ying
Liu, Chun-Ping
[J]. 2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
[32] Multimodal attention-based transformer for video captioning
Munusamy, Hemalatha
Sekhar, C. Chandra
[J]. APPLIED INTELLIGENCE, 2023, 53 (20) : 23349 - 23368
[33] Learning Multimodal Attention LSTM Networks for Video Captioning
Xu, Jun
Yao, Ting
Zhang, Yongdong
Mei, Tao
[J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 537 - 545
[34] DVC-Net: A deep neural network model for dense video captioning
Lee, Sujin
Kim, Incheol
[J]. IET COMPUTER VISION, 2021, 15 (01) : 12 - 23
[35] Hierarchical Modular Network for Video Captioning
Ye, Hanhua
Li, Guorong
Qi, Yuankai
Wang, Shuhui
Huang, Qingming
Yang, Ming-Hsuan
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17918 - 17927
[36] Semantic Grouping Network for Video Captioning
Ryu, Hobin
Kang, Sunghun
Kang, Haeyong
Yoo, Chang D.
[J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2514 - 2522
[37] Rethinking Network for Classroom Video Captioning
Zhu, Mingjian
Duan, Chenrui
Yu, Changbin
[J]. TWELFTH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING SYSTEMS, 2021, 11719
[38] Multimodal object description network for dense captioning
Wang, Weixuan
Hu, Haifeng
[J]. ELECTRONICS LETTERS, 2017, 53 (15) : 1041 - +
[39] Semantic guidance network for video captioning
Lan Guo
Hong Zhao
ZhiWen Chen
ZeYu Han
[J]. Scientific Reports, 13
[40] Guidance Module Network for Video Captioning
Zhang, Xiao
Liu, Chunsheng
Chang, Faliang
[J]. 2021 PROCEEDINGS OF THE 40TH CHINESE CONTROL CONFERENCE (CCC), 2021, : 7955 - 7959

← 1 2 3 4 5 →