Multimodal-enhanced hierarchical attention network for video captioning

被引:0
|
作者
Maosheng Zhong
Youde Chen
Hao Zhang
Hao Xiong
Zhixiang Wang
机构
[1] Jiangxi Normal University,
来源
Multimedia Systems | 2023年 / 29卷
关键词
Video captioning; Bidirectional decoding transformer; Multimodal enhancement; Hierarchical attention network;
D O I
暂无
中图分类号
学科分类号
摘要
In video captioning, many pioneering approaches have been developed to generate higher-quality captions by exploring and adding new video feature modalities. However, as the number of modalities increases, the negative interaction between them gradually reduces the gain of caption generation. To address this problem, we propose a three-layer hierarchical attention network based on a bidirectional decoding transformer that enhances multimodal features. In the first layer, we execute different encoders according to the characteristics of each modality to enhance the vector representation of each modality. Then, in the second layer, we select keyframes from all sampled frames of the modality by calculating the attention value between the generated words and each frame of the modality. Finally, in the third layer, we allocate weights to different modalities to reduce redundancy between them before generating the current word. Additionally, we use a bidirectional decoder to consider the context of the ground-truth caption when generating captions. Experiments on two mainstream benchmark datasets, MSVD and MSR-VTT, demonstrate the effectiveness of our proposed model. The model achieves state-of-the-art performance in significant metrics, and the generated sentences are more in line with human language habits. Overall, our three-layer hierarchical attention network based on a bidirectional decoding transformer effectively enhances multimodal features and generates high-quality video captions. Codes are available on https://github.com/nickchen121/MHAN.
引用
收藏
页码:2469 / 2482
页数:13
相关论文
共 50 条
  • [41] Dense Video Captioning with Hierarchical Attention-Based Encoder-Decoder Networks
    Yu, Mingjing
    Zheng, Huicheng
    Liu, Zehua
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [42] Video Captioning with Guidance of Multimodal Latent Topics
    Chen, Shizhe
    Chen, Jia
    Jin, Qin
    Hauptmann, Alexander
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1838 - 1846
  • [43] Attend to Knowledge: Memory-Enhanced Attention Network for Image Captioning
    Chen, Hui
    Ding, Guiguang
    Lin, Zijia
    Guo, Yuchen
    Han, Jungong
    ADVANCES IN BRAIN INSPIRED COGNITIVE SYSTEMS, BICS 2018, 2018, 10989 : 161 - 171
  • [44] Critic-based Attention Network for Event-based Video Captioning
    Barati, Elaheh
    Chen, Xuewen
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 811 - 817
  • [45] Semantic Enhanced Encoder-Decoder Network (SEN) for Video Captioning
    Gui, Yuling
    Guo, Dan
    Zhao, Ye
    PROCEEDINGS OF THE 2ND WORKSHOP ON MULTIMEDIA FOR ACCESSIBLE HUMAN COMPUTER INTERFACES (MAHCI '19), 2019, : 25 - 32
  • [46] Leveraging Weighted Fine-Grained Cross-Graph Attention for Visual and Semantic Enhanced Video Captioning Network
    Verma, Deepali
    Haldar, Arya
    Dutta, Tanima
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 2, 2023, : 2465 - 2473
  • [47] Reconstruction Network for Video Captioning
    Wang, Bairui
    Ma, Lin
    Zhang, Wei
    Liu, Wei
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7622 - 7631
  • [48] Hierarchical LSTMs with Adaptive Attention for Visual Captioning
    Gao, Lianli
    Li, Xiangpeng
    Song, Jingkuan
    Shen, Heng Tao
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2020, 42 (05) : 1112 - 1131
  • [49] Video Captioning via Hierarchical Reinforcement Learning
    Wang, Xin
    Chen, Wenhu
    Wu, Jiawei
    Wang, Yuan-Fang
    Wang, William Yang
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 4213 - 4222
  • [50] RESTHT: relation-enhanced spatial-temporal hierarchical transformer for video captioning
    Zheng, Lihuan
    Xu, Wanru
    Miao, Zhenjiang
    Qiu, Xinxiu
    Gong, Shanshan
    VISUAL COMPUTER, 2025, 41 (01): : 591 - 604