CMGNet: Collaborative multi-modal graph network for video captioning

被引:1
|
作者
Rao, Qi [1 ]
Yu, Xin [1 ]
Li, Guang [1 ]
Zhu, Linchao [1 ]
机构
[1] Univ Technol Sydney, Fac Engn & Informat Technol, Australian Artificial Intelligence Inst AAII, Ultimo, NSW 2007, Australia
关键词
Video Captioning; Multiple Modality Learning; Graph Neural Networks;
D O I
10.1016/j.cviu.2023.103864
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In video captioning, it is very challenging to comprehensively describe multi-modal content information of a video, such as appearance, motion, and object. Prior arts often neglect interactions among multiple modalities and thus their video representations may not fully depict scene contents. In this paper, we propose a collaborative multi-modal graph network (CMGNet) to explore the interactions among multi-modal features in video captioning. Our CMGNet is composed of an encoder-decoder structure: a Compression-driven Intrainter Attentive Graph (CIAG) encoder and an Adaptive Multi-modal Selection (AMS) decoder. Specifically, in our CIAG encoder, we first design a Basis Vector Compression (BVC) module to reduce the redundant nodes in graphs and thus improve the efficiency in coping with a large number of nodes. Then we propose an Intra-inter Attentive Graph (IAG) to improve the graph representation by sharing information across intra-and-inter nodes. Afterwards, we present an AMS decoder to generate video captions from the encoded video res presentations. In particular, we let the proposed AMS decoder learn to produce words by adaptively focusing on different modality information, thus leading to comprehensive and accurate captions. Extensive experiments on the large-scale benchmarks, i.e., MSR-VTT and TGIF, demonstrate that our proposed CMGNet achieves the state-of-the-art.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Multi-modal Dense Video Captioning
    Iashin, Vladimir
    Rahtu, Esa
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 4117 - 4126
  • [2] Multi-modal Dependency Tree for Video Captioning
    Zhao, Wentian
    Wu, Xinxiao
    Luo, Jiebo
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [3] MULTI-MODAL HIERARCHICAL ATTENTION-BASED DENSE VIDEO CAPTIONING
    Munusamy, Hemalatha
    Sekhar, Chandra C.
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 475 - 479
  • [4] Towards Video Captioning with Naming: A Novel Dataset and a Multi-modal Approach
    Pini, Stefano
    Cornia, Marcella
    Baraldi, Lorenzo
    Cucchiara, Rita
    IMAGE ANALYSIS AND PROCESSING (ICIAP 2017), PT II, 2017, 10485 : 384 - 395
  • [5] Boosting Entity-Aware Image Captioning With Multi-Modal Knowledge Graph
    Zhao, Wentian
    Wu, Xinxiao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 2659 - 2670
  • [6] MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video
    Wei, Yinwei
    Wang, Xiang
    Nie, Liqiang
    He, Xiangnan
    Hong, Richang
    Chua, Tat-Seng
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1437 - 1445
  • [7] Event-centric multi-modal fusion method for dense video captioning
    Chang, Zhi
    Zhao, Dexin
    Chen, Huilin
    Li, Jingdan
    Liu, Pengfei
    NEURAL NETWORKS, 2022, 146 : 120 - 129
  • [8] Collaborative denoised graph contrastive learning for multi-modal recommendation
    Xu, Fuyong
    Zhu, Zhenfang
    Fu, Yixin
    Wang, Ru
    Liu, Peiyu
    INFORMATION SCIENCES, 2024, 679
  • [9] Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval
    Zeng, Yawen
    Cao, Da
    Wei, Xiaochi
    Liu, Meng
    Zhao, Zhou
    Qin, Zheng
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 2215 - 2224
  • [10] Multi-modal graph reasoning for structured video text extraction
    Shi, Weitao
    Wang, Han
    Lou, Xin
    COMPUTERS & ELECTRICAL ENGINEERING, 2023, 107