Multimodal graph neural network for video procedural captioning

被引：6

作者：

Ji, Lei ^{[1
,2
,3
]}

Tu, Rongcheng ^{[4
]}

Lin, Kevin ^{[5
]}

Wang, Lijuan ^{[5
]}

Duan, Nan ^{[3
]}

机构：

[1] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Beijing, Peoples R China

[3] Microsoft Res Asia, Beijing, Peoples R China

[4] Beijing Inst Technol, Beijing, Peoples R China

[5] Microsoft, Redmond, WA USA

来源：

NEUROCOMPUTING | 2022年 / 488卷

关键词：

Multimodal video captioning; Graph neural network;

D O I：

10.1016/j.neucom.2022.02.062

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video procedural captioning aims to generate detailed descriptive captions for all steps in a long instructional video. The peculiarity of this problem is the procedural dependency between the events to generate consistent captions among the video. However, existing video (dense) captioning methods only consider intra-event or sequential inter-event context and are hard to model the non-sequential context dependency between events. In this paper, inspired by the recent success of graph neural networks in capturing the relations for structured data, we propose a novel Multimodal Graph Neural Network (MGNN) for dense video procedural captioning in capturing the procedural structure between events. Specifically, we construct temporal sequential graph and semantic non-sequential graph for a multi modal heterogeneous graph. Moreover, we adopt the graph neural network to enhance the visual and text features, and fuse both features for further caption generation. Extensive experiments demonstrate the proposed MGNN is effective in generating coherent captions on both the Youcook2 and Activitynet Captions benchmark.(c) 2022 Elsevier B.V. All rights reserved.

引用

页码：88 / 96

页数：9

共 50 条

[1] Multimodal Deep Neural Network with Image Sequence Features for Video Captioning
Oura, Soichiro
Matsukawa, Tetsu
Suzuki, Einoshin
[J]. 2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
[2] MIVCN: Multimodal interaction video captioning network based on semantic association graph
Wang, Ying
Huang, Guoheng
Lin Yuming
Yuan, Haoliang
Pun, Chi-Man
Ling, Wing-Kuen
Cheng, Lianglun
[J]. APPLIED INTELLIGENCE, 2022, 52 (05) : 5241 - 5260
[3] MIVCN: Multimodal interaction video captioning network based on semantic association graph
Ying Wang
Guoheng Huang
Lin Yuming
Haoliang Yuan
Chi-Man Pun
Wing-Kuen Ling
Lianglun Cheng
[J]. Applied Intelligence, 2022, 52 : 5241 - 5260
[4] Concept Parser With Multimodal Graph Learning for Video Captioning
Wu, Bofeng
Liu, Buyu
Huang, Peng
Bao, Jun
Peng, Xi
Yu, Jun
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 4484 - 4495
[5] MULTIMODAL SEMANTIC ATTENTION NETWORK FOR VIDEO CAPTIONING
Sun, Liang
Li, Bing
Yuan, Chunfeng
Zha, Zhengjun
Hu, Weiming
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1300 - 1305
[6] Action knowledge for video captioning with graph neural networks
Hendria, Willy Fitra
Velda, Vania
Putra, Bahy Helmi Hartoyo
Adzaka, Fikriansyah
Jeong, Cheol
[J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2023, 35 (04) : 50 - 62
[7] VIDEO CAPTIONING WITH TEMPORAL AND REGION GRAPH CONVOLUTION NETWORK
Xiao, Xinlong
Zhang, Yuejie
Feng, Rui
Zhang, Tao
Gao, Shang
Fan, Weiguo
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
[8] Multimodal-enhanced hierarchical attention network for video captioning
Zhong, Maosheng
Chen, Youde
Zhang, Hao
Xiong, Hao
Wang, Zhixiang
[J]. MULTIMEDIA SYSTEMS, 2023, 29 (05) : 2469 - 2482
[9] Multimodal-enhanced hierarchical attention network for video captioning
Maosheng Zhong
Youde Chen
Hao Zhang
Hao Xiong
Zhixiang Wang
[J]. Multimedia Systems, 2023, 29 : 2469 - 2482
[10] Multirate Multimodal Video Captioning
Yang, Ziwei
Xu, Youjiang
Wang, Huiyun
Wang, Bo
Han, Yahong
[J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1877 - 1882

← 1 2 3 4 5 →