Multimodal-enhanced hierarchical attention network for video captioning

被引：0

作者：

Maosheng Zhong

Youde Chen

Hao Zhang

Hao Xiong

Zhixiang Wang

机构：

[1] Jiangxi Normal University,

来源：

Multimedia Systems | 2023年 / 29卷

关键词：

Video captioning; Bidirectional decoding transformer; Multimodal enhancement; Hierarchical attention network;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

In video captioning, many pioneering approaches have been developed to generate higher-quality captions by exploring and adding new video feature modalities. However, as the number of modalities increases, the negative interaction between them gradually reduces the gain of caption generation. To address this problem, we propose a three-layer hierarchical attention network based on a bidirectional decoding transformer that enhances multimodal features. In the first layer, we execute different encoders according to the characteristics of each modality to enhance the vector representation of each modality. Then, in the second layer, we select keyframes from all sampled frames of the modality by calculating the attention value between the generated words and each frame of the modality. Finally, in the third layer, we allocate weights to different modalities to reduce redundancy between them before generating the current word. Additionally, we use a bidirectional decoder to consider the context of the ground-truth caption when generating captions. Experiments on two mainstream benchmark datasets, MSVD and MSR-VTT, demonstrate the effectiveness of our proposed model. The model achieves state-of-the-art performance in significant metrics, and the generated sentences are more in line with human language habits. Overall, our three-layer hierarchical attention network based on a bidirectional decoding transformer effectively enhances multimodal features and generates high-quality video captions. Codes are available on https://github.com/nickchen121/MHAN.

引用

页码：2469 / 2482

页数：13

共 50 条

[31] Deep multimodal embedding for video captioning
Jin Young Lee
Multimedia Tools and Applications, 2019, 78 : 31793 - 31805
[32] Multimodal Pretraining for Dense Video Captioning
Huang, Gabriel
Pang, Bo
Zhu, Zhenhai
Rivera, Clara
Soricut, Radu
1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 470 - 490
[33] Multimodal Feature Learning for Video Captioning
Lee, Sujin
Kim, Incheol
MATHEMATICAL PROBLEMS IN ENGINEERING, 2018, 2018
[34] Deep multimodal embedding for video captioning
Lee, Jin Young
MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (22) : 31793 - 31805
[35] Hierarchical Context-aware Network for Dense Video Event Captioning
Ji, Lei
Guo, Xianglin
Huang, Haoyang
Chen, Xilin
59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 2004 - 2013
[36] Deep hierarchical attention network for video description
Li, Shuohao
Tang, Min
Zhang, Jun
JOURNAL OF ELECTRONIC IMAGING, 2018, 27 (02)
[37] Gated Hierarchical Attention for Image Captioning
Wang, Qingzhong
Chan, Antoni B.
COMPUTER VISION - ACCV 2018, PT IV, 2019, 11364 : 21 - 37
[38] MIVCN: Multimodal interaction video captioning network based on semantic association graph
Ying Wang
Guoheng Huang
Lin Yuming
Haoliang Yuan
Chi-Man Pun
Wing-Kuen Ling
Lianglun Cheng
Applied Intelligence, 2022, 52 : 5241 - 5260
[39] MIVCN: Multimodal interaction video captioning network based on semantic association graph
Wang, Ying
Huang, Guoheng
Lin Yuming
Yuan, Haoliang
Pun, Chi-Man
Ling, Wing-Kuen
Cheng, Lianglun
APPLIED INTELLIGENCE, 2022, 52 (05) : 5241 - 5260
[40] Hierarchical Memory Modelling for Video Captioning
Wang, Junbo
Wang, Wei
Huang, Yan
Wang, Liang
Tan, Tieniu
PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 63 - 71

← 1 2 3 4 5 →