Multimodal graph neural network for video procedural captioning

被引:6
|
作者
Ji, Lei [1 ,2 ,3 ]
Tu, Rongcheng [4 ]
Lin, Kevin [5 ]
Wang, Lijuan [5 ]
Duan, Nan [3 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] Microsoft Res Asia, Beijing, Peoples R China
[4] Beijing Inst Technol, Beijing, Peoples R China
[5] Microsoft, Redmond, WA USA
关键词
Multimodal video captioning; Graph neural network;
D O I
10.1016/j.neucom.2022.02.062
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video procedural captioning aims to generate detailed descriptive captions for all steps in a long instructional video. The peculiarity of this problem is the procedural dependency between the events to generate consistent captions among the video. However, existing video (dense) captioning methods only consider intra-event or sequential inter-event context and are hard to model the non-sequential context dependency between events. In this paper, inspired by the recent success of graph neural networks in capturing the relations for structured data, we propose a novel Multimodal Graph Neural Network (MGNN) for dense video procedural captioning in capturing the procedural structure between events. Specifically, we construct temporal sequential graph and semantic non-sequential graph for a multi modal heterogeneous graph. Moreover, we adopt the graph neural network to enhance the visual and text features, and fuse both features for further caption generation. Extensive experiments demonstrate the proposed MGNN is effective in generating coherent captions on both the Youcook2 and Activitynet Captions benchmark.(c) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页码:88 / 96
页数:9
相关论文
共 50 条
  • [1] Multimodal Deep Neural Network with Image Sequence Features for Video Captioning
    Oura, Soichiro
    Matsukawa, Tetsu
    Suzuki, Einoshin
    [J]. 2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
  • [2] MIVCN: Multimodal interaction video captioning network based on semantic association graph
    Wang, Ying
    Huang, Guoheng
    Lin Yuming
    Yuan, Haoliang
    Pun, Chi-Man
    Ling, Wing-Kuen
    Cheng, Lianglun
    [J]. APPLIED INTELLIGENCE, 2022, 52 (05) : 5241 - 5260
  • [3] MIVCN: Multimodal interaction video captioning network based on semantic association graph
    Ying Wang
    Guoheng Huang
    Lin Yuming
    Haoliang Yuan
    Chi-Man Pun
    Wing-Kuen Ling
    Lianglun Cheng
    [J]. Applied Intelligence, 2022, 52 : 5241 - 5260
  • [4] Concept Parser With Multimodal Graph Learning for Video Captioning
    Wu, Bofeng
    Liu, Buyu
    Huang, Peng
    Bao, Jun
    Peng, Xi
    Yu, Jun
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 4484 - 4495
  • [5] MULTIMODAL SEMANTIC ATTENTION NETWORK FOR VIDEO CAPTIONING
    Sun, Liang
    Li, Bing
    Yuan, Chunfeng
    Zha, Zhengjun
    Hu, Weiming
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1300 - 1305
  • [6] Action knowledge for video captioning with graph neural networks
    Hendria, Willy Fitra
    Velda, Vania
    Putra, Bahy Helmi Hartoyo
    Adzaka, Fikriansyah
    Jeong, Cheol
    [J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2023, 35 (04) : 50 - 62
  • [7] VIDEO CAPTIONING WITH TEMPORAL AND REGION GRAPH CONVOLUTION NETWORK
    Xiao, Xinlong
    Zhang, Yuejie
    Feng, Rui
    Zhang, Tao
    Gao, Shang
    Fan, Weiguo
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
  • [8] Multimodal-enhanced hierarchical attention network for video captioning
    Zhong, Maosheng
    Chen, Youde
    Zhang, Hao
    Xiong, Hao
    Wang, Zhixiang
    [J]. MULTIMEDIA SYSTEMS, 2023, 29 (05) : 2469 - 2482
  • [9] Multimodal-enhanced hierarchical attention network for video captioning
    Maosheng Zhong
    Youde Chen
    Hao Zhang
    Hao Xiong
    Zhixiang Wang
    [J]. Multimedia Systems, 2023, 29 : 2469 - 2482
  • [10] Multirate Multimodal Video Captioning
    Yang, Ziwei
    Xu, Youjiang
    Wang, Huiyun
    Wang, Bo
    Han, Yahong
    [J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1877 - 1882