Multimodal graph neural network for video procedural captioning

被引:6
|
作者
Ji, Lei [1 ,2 ,3 ]
Tu, Rongcheng [4 ]
Lin, Kevin [5 ]
Wang, Lijuan [5 ]
Duan, Nan [3 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] Microsoft Res Asia, Beijing, Peoples R China
[4] Beijing Inst Technol, Beijing, Peoples R China
[5] Microsoft, Redmond, WA USA
关键词
Multimodal video captioning; Graph neural network;
D O I
10.1016/j.neucom.2022.02.062
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video procedural captioning aims to generate detailed descriptive captions for all steps in a long instructional video. The peculiarity of this problem is the procedural dependency between the events to generate consistent captions among the video. However, existing video (dense) captioning methods only consider intra-event or sequential inter-event context and are hard to model the non-sequential context dependency between events. In this paper, inspired by the recent success of graph neural networks in capturing the relations for structured data, we propose a novel Multimodal Graph Neural Network (MGNN) for dense video procedural captioning in capturing the procedural structure between events. Specifically, we construct temporal sequential graph and semantic non-sequential graph for a multi modal heterogeneous graph. Moreover, we adopt the graph neural network to enhance the visual and text features, and fuse both features for further caption generation. Extensive experiments demonstrate the proposed MGNN is effective in generating coherent captions on both the Youcook2 and Activitynet Captions benchmark.(c) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页码:88 / 96
页数:9
相关论文
共 50 条
  • [31] Spatio-Temporal Graph-based Semantic Compositional Network for Video Captioning
    Li, Shun
    Zhang, Ze-Fan
    Ji, Yi
    Li, Ying
    Liu, Chun-Ping
    [J]. 2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [32] Multimodal attention-based transformer for video captioning
    Munusamy, Hemalatha
    Sekhar, C. Chandra
    [J]. APPLIED INTELLIGENCE, 2023, 53 (20) : 23349 - 23368
  • [33] Learning Multimodal Attention LSTM Networks for Video Captioning
    Xu, Jun
    Yao, Ting
    Zhang, Yongdong
    Mei, Tao
    [J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 537 - 545
  • [34] DVC-Net: A deep neural network model for dense video captioning
    Lee, Sujin
    Kim, Incheol
    [J]. IET COMPUTER VISION, 2021, 15 (01) : 12 - 23
  • [35] Hierarchical Modular Network for Video Captioning
    Ye, Hanhua
    Li, Guorong
    Qi, Yuankai
    Wang, Shuhui
    Huang, Qingming
    Yang, Ming-Hsuan
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17918 - 17927
  • [36] Semantic Grouping Network for Video Captioning
    Ryu, Hobin
    Kang, Sunghun
    Kang, Haeyong
    Yoo, Chang D.
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2514 - 2522
  • [37] Rethinking Network for Classroom Video Captioning
    Zhu, Mingjian
    Duan, Chenrui
    Yu, Changbin
    [J]. TWELFTH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING SYSTEMS, 2021, 11719
  • [38] Multimodal object description network for dense captioning
    Wang, Weixuan
    Hu, Haifeng
    [J]. ELECTRONICS LETTERS, 2017, 53 (15) : 1041 - +
  • [39] Semantic guidance network for video captioning
    Lan Guo
    Hong Zhao
    ZhiWen Chen
    ZeYu Han
    [J]. Scientific Reports, 13
  • [40] Guidance Module Network for Video Captioning
    Zhang, Xiao
    Liu, Chunsheng
    Chang, Faliang
    [J]. 2021 PROCEEDINGS OF THE 40TH CHINESE CONTROL CONFERENCE (CCC), 2021, : 7955 - 7959