Multimodal attention-based transformer for video captioning

被引：2

作者：

Munusamy, Hemalatha ^{[1
,2
]}

Sekhar, C. Chandra ^{[1
]}

机构：

[1] IIT Madras, Dept Comp Sci & Engn, Chennai 600036, Tamilnadu, India

[2] Anna Univ, Dept Informat Technol, MIT Campus, Chennai 600044, Tamilnadu, India

来源：

APPLIED INTELLIGENCE | 2023年 / 53卷 / 20期

关键词：

Video captioning; Transformer; Multimodal attention; Semantic keywords; NETWORK;

D O I：

10.1007/s10489-023-04597-2

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video captioning is a computer vision task that generates a natural language description for a video. In this paper, we propose a multimodal attention-based transformer using the keyframe features, object features, and semantic keyword embedding features of a video. The Structural Similarity Index Measure (SSIM) is used to extract keyframes from a video. We also detect the unique objects from the extracted keyframes. The features from the keyframes and the objects detected in the keyframes are extracted using a pretrained Convolutional Neural Network (CNN). In the encoder, we use a bimodal attention block to apply two-way cross-attention between the keyframe features and the object features. In the decoder, we combine the features of the words generated up to the previous time step, the semantic keyword embedding features, and the encoder features using a tri-modal attention block. This allows the decoder to choose the multimodal features dynamically to generate the next word in the description. We evaluated the proposed approach using the MSVD, MSR-VTT, and Charades datasets and observed that the proposed model provides better performance than other state-of-the-art models.

引用

页码：23349 / 23368

页数：20

共 50 条

[1] Multimodal attention-based transformer for video captioning
Hemalatha Munusamy
Chandra Sekhar C
[J]. Applied Intelligence, 2023, 53 : 23349 - 23368
[2] Hierarchical attention-based multimodal fusion for video captioning
Wu, Chunlei
Wei, Yiwei
Chu, Xiaoliang
Weichen, Sun
Su, Fei
Wang, Leiquan
[J]. NEUROCOMPUTING, 2018, 315 : 362 - 370
[3] Residual attention-based LSTM for video captioning
Xiangpeng Li
Zhilong Zhou
Lijiang Chen
Lianli Gao
[J]. World Wide Web, 2019, 22 : 621 - 636
[4] Residual attention-based LSTM for video captioning
Li, Xiangpeng
Zhou, Zhilong
Chen, Lijiang
Gao, Lianli
[J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2019, 22 (02): : 621 - 636
[5] Video Captioning With Attention-Based LSTM and Semantic Consistency
Gao, Lianli
Guo, Zhao
Zhang, Hanwang
Xu, Xing
Shen, Heng Tao
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2017, 19 (09) : 2045 - 2055
[6] Attention-based Densely Connected LSTM for Video Captioning
Zhu, Yongqing
Jiang, Shuqiang
[J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 802 - 810
[7] A Hierarchical Multimodal Attention-based Neural Network for Image Captioning
Cheng, Yong
Huang, Fei
Zhou, Lian
Jin, Cheng
Zhang, Yuejie
Zhang, Tao
[J]. SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, : 889 - 892
[8] Attention-Based Multimodal Fusion for Video Description
Hori, Chiori
Hori, Takaaki
Lee, Teng-Yok
Zhang, Ziming
Harsham, Bret
Hershey, John R.
Marks, Tim K.
Sumi, Kazuhiko
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 4203 - 4212
[9] An attention-based hybrid deep learning approach for bengali video captioning
Zaoad, Md. Shahir
Mannan, M. M. Rushadul
Mandol, Angshu Bikash
Rahman, Mostafizur
Islam, Md Adnanul
Rahman, Md. Mahbubur
[J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2023, 35 (01) : 257 - 269
[10] UAT: Universal Attention Transformer for Video Captioning
Im, Heeju
Choi, Yong-Suk
[J]. SENSORS, 2022, 22 (13)

← 1 2 3 4 5 →