Multimodal-enhanced hierarchical attention network for video captioning

被引:0
|
作者
Maosheng Zhong
Youde Chen
Hao Zhang
Hao Xiong
Zhixiang Wang
机构
[1] Jiangxi Normal University,
来源
Multimedia Systems | 2023年 / 29卷
关键词
Video captioning; Bidirectional decoding transformer; Multimodal enhancement; Hierarchical attention network;
D O I
暂无
中图分类号
学科分类号
摘要
In video captioning, many pioneering approaches have been developed to generate higher-quality captions by exploring and adding new video feature modalities. However, as the number of modalities increases, the negative interaction between them gradually reduces the gain of caption generation. To address this problem, we propose a three-layer hierarchical attention network based on a bidirectional decoding transformer that enhances multimodal features. In the first layer, we execute different encoders according to the characteristics of each modality to enhance the vector representation of each modality. Then, in the second layer, we select keyframes from all sampled frames of the modality by calculating the attention value between the generated words and each frame of the modality. Finally, in the third layer, we allocate weights to different modalities to reduce redundancy between them before generating the current word. Additionally, we use a bidirectional decoder to consider the context of the ground-truth caption when generating captions. Experiments on two mainstream benchmark datasets, MSVD and MSR-VTT, demonstrate the effectiveness of our proposed model. The model achieves state-of-the-art performance in significant metrics, and the generated sentences are more in line with human language habits. Overall, our three-layer hierarchical attention network based on a bidirectional decoding transformer effectively enhances multimodal features and generates high-quality video captions. Codes are available on https://github.com/nickchen121/MHAN.
引用
收藏
页码:2469 / 2482
页数:13
相关论文
共 50 条
  • [31] Deep multimodal embedding for video captioning
    Jin Young Lee
    Multimedia Tools and Applications, 2019, 78 : 31793 - 31805
  • [32] Multimodal Pretraining for Dense Video Captioning
    Huang, Gabriel
    Pang, Bo
    Zhu, Zhenhai
    Rivera, Clara
    Soricut, Radu
    1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 470 - 490
  • [33] Multimodal Feature Learning for Video Captioning
    Lee, Sujin
    Kim, Incheol
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2018, 2018
  • [34] Deep multimodal embedding for video captioning
    Lee, Jin Young
    MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (22) : 31793 - 31805
  • [35] Hierarchical Context-aware Network for Dense Video Event Captioning
    Ji, Lei
    Guo, Xianglin
    Huang, Haoyang
    Chen, Xilin
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 2004 - 2013
  • [36] Deep hierarchical attention network for video description
    Li, Shuohao
    Tang, Min
    Zhang, Jun
    JOURNAL OF ELECTRONIC IMAGING, 2018, 27 (02)
  • [37] Gated Hierarchical Attention for Image Captioning
    Wang, Qingzhong
    Chan, Antoni B.
    COMPUTER VISION - ACCV 2018, PT IV, 2019, 11364 : 21 - 37
  • [38] MIVCN: Multimodal interaction video captioning network based on semantic association graph
    Ying Wang
    Guoheng Huang
    Lin Yuming
    Haoliang Yuan
    Chi-Man Pun
    Wing-Kuen Ling
    Lianglun Cheng
    Applied Intelligence, 2022, 52 : 5241 - 5260
  • [39] MIVCN: Multimodal interaction video captioning network based on semantic association graph
    Wang, Ying
    Huang, Guoheng
    Lin Yuming
    Yuan, Haoliang
    Pun, Chi-Man
    Ling, Wing-Kuen
    Cheng, Lianglun
    APPLIED INTELLIGENCE, 2022, 52 (05) : 5241 - 5260
  • [40] Hierarchical Memory Modelling for Video Captioning
    Wang, Junbo
    Wang, Wei
    Huang, Yan
    Wang, Liang
    Tan, Tieniu
    PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 63 - 71