Multimodal-enhanced hierarchical attention network for video captioning

被引:0
|
作者
Maosheng Zhong
Youde Chen
Hao Zhang
Hao Xiong
Zhixiang Wang
机构
[1] Jiangxi Normal University,
来源
Multimedia Systems | 2023年 / 29卷
关键词
Video captioning; Bidirectional decoding transformer; Multimodal enhancement; Hierarchical attention network;
D O I
暂无
中图分类号
学科分类号
摘要
In video captioning, many pioneering approaches have been developed to generate higher-quality captions by exploring and adding new video feature modalities. However, as the number of modalities increases, the negative interaction between them gradually reduces the gain of caption generation. To address this problem, we propose a three-layer hierarchical attention network based on a bidirectional decoding transformer that enhances multimodal features. In the first layer, we execute different encoders according to the characteristics of each modality to enhance the vector representation of each modality. Then, in the second layer, we select keyframes from all sampled frames of the modality by calculating the attention value between the generated words and each frame of the modality. Finally, in the third layer, we allocate weights to different modalities to reduce redundancy between them before generating the current word. Additionally, we use a bidirectional decoder to consider the context of the ground-truth caption when generating captions. Experiments on two mainstream benchmark datasets, MSVD and MSR-VTT, demonstrate the effectiveness of our proposed model. The model achieves state-of-the-art performance in significant metrics, and the generated sentences are more in line with human language habits. Overall, our three-layer hierarchical attention network based on a bidirectional decoding transformer effectively enhances multimodal features and generates high-quality video captions. Codes are available on https://github.com/nickchen121/MHAN.
引用
收藏
页码:2469 / 2482
页数:13
相关论文
共 50 条
  • [1] Multimodal-enhanced hierarchical attention network for video captioning
    Zhong, Maosheng
    Chen, Youde
    Zhang, Hao
    Xiong, Hao
    Wang, Zhixiang
    [J]. MULTIMEDIA SYSTEMS, 2023, 29 (05) : 2469 - 2482
  • [2] MULTIMODAL SEMANTIC ATTENTION NETWORK FOR VIDEO CAPTIONING
    Sun, Liang
    Li, Bing
    Yuan, Chunfeng
    Zha, Zhengjun
    Hu, Weiming
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1300 - 1305
  • [3] Hierarchical attention-based multimodal fusion for video captioning
    Wu, Chunlei
    Wei, Yiwei
    Chu, Xiaoliang
    Weichen, Sun
    Su, Fei
    Wang, Leiquan
    [J]. NEUROCOMPUTING, 2018, 315 : 362 - 370
  • [4] Syntax-Guided Hierarchical Attention Network for Video Captioning
    Deng, Jincan
    Li, Liang
    Zhang, Beichen
    Wang, Shuhui
    Zha, Zhengjun
    Huang, Qingming
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (02) : 880 - 892
  • [5] A Hierarchical Multimodal Attention-based Neural Network for Image Captioning
    Cheng, Yong
    Huang, Fei
    Zhou, Lian
    Jin, Cheng
    Zhang, Yuejie
    Zhang, Tao
    [J]. SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, : 889 - 892
  • [6] Stacked Multimodal Attention Network for Context-Aware Video Captioning
    Zheng, Yi
    Zhang, Yuejie
    Feng, Rui
    Zhang, Tao
    Fan, Weiguo
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (01) : 31 - 42
  • [7] Hierarchical Attention Network for Image Captioning
    Wang, Weixuan
    Chen, Zhihong
    Hu, Haifeng
    [J]. THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8957 - 8964
  • [8] Hierarchical Modular Network for Video Captioning
    Ye, Hanhua
    Li, Guorong
    Qi, Yuankai
    Wang, Shuhui
    Huang, Qingming
    Yang, Ming-Hsuan
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17918 - 17927
  • [9] Multimodal attention-based transformer for video captioning
    Hemalatha Munusamy
    Chandra Sekhar C
    [J]. Applied Intelligence, 2023, 53 : 23349 - 23368
  • [10] Multimodal attention-based transformer for video captioning
    Munusamy, Hemalatha
    Sekhar, C. Chandra
    [J]. APPLIED INTELLIGENCE, 2023, 53 (20) : 23349 - 23368