Multimodal attention-based transformer for video captioning

被引:2
|
作者
Munusamy, Hemalatha [1 ,2 ]
Sekhar, C. Chandra [1 ]
机构
[1] IIT Madras, Dept Comp Sci & Engn, Chennai 600036, Tamilnadu, India
[2] Anna Univ, Dept Informat Technol, MIT Campus, Chennai 600044, Tamilnadu, India
关键词
Video captioning; Transformer; Multimodal attention; Semantic keywords; NETWORK;
D O I
10.1007/s10489-023-04597-2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video captioning is a computer vision task that generates a natural language description for a video. In this paper, we propose a multimodal attention-based transformer using the keyframe features, object features, and semantic keyword embedding features of a video. The Structural Similarity Index Measure (SSIM) is used to extract keyframes from a video. We also detect the unique objects from the extracted keyframes. The features from the keyframes and the objects detected in the keyframes are extracted using a pretrained Convolutional Neural Network (CNN). In the encoder, we use a bimodal attention block to apply two-way cross-attention between the keyframe features and the object features. In the decoder, we combine the features of the words generated up to the previous time step, the semantic keyword embedding features, and the encoder features using a tri-modal attention block. This allows the decoder to choose the multimodal features dynamically to generate the next word in the description. We evaluated the proposed approach using the MSVD, MSR-VTT, and Charades datasets and observed that the proposed model provides better performance than other state-of-the-art models.
引用
收藏
页码:23349 / 23368
页数:20
相关论文
共 50 条
  • [1] Multimodal attention-based transformer for video captioning
    Hemalatha Munusamy
    Chandra Sekhar C
    [J]. Applied Intelligence, 2023, 53 : 23349 - 23368
  • [2] Hierarchical attention-based multimodal fusion for video captioning
    Wu, Chunlei
    Wei, Yiwei
    Chu, Xiaoliang
    Weichen, Sun
    Su, Fei
    Wang, Leiquan
    [J]. NEUROCOMPUTING, 2018, 315 : 362 - 370
  • [3] Residual attention-based LSTM for video captioning
    Xiangpeng Li
    Zhilong Zhou
    Lijiang Chen
    Lianli Gao
    [J]. World Wide Web, 2019, 22 : 621 - 636
  • [4] Residual attention-based LSTM for video captioning
    Li, Xiangpeng
    Zhou, Zhilong
    Chen, Lijiang
    Gao, Lianli
    [J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2019, 22 (02): : 621 - 636
  • [5] Video Captioning With Attention-Based LSTM and Semantic Consistency
    Gao, Lianli
    Guo, Zhao
    Zhang, Hanwang
    Xu, Xing
    Shen, Heng Tao
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2017, 19 (09) : 2045 - 2055
  • [6] Attention-based Densely Connected LSTM for Video Captioning
    Zhu, Yongqing
    Jiang, Shuqiang
    [J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 802 - 810
  • [7] A Hierarchical Multimodal Attention-based Neural Network for Image Captioning
    Cheng, Yong
    Huang, Fei
    Zhou, Lian
    Jin, Cheng
    Zhang, Yuejie
    Zhang, Tao
    [J]. SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, : 889 - 892
  • [8] Attention-Based Multimodal Fusion for Video Description
    Hori, Chiori
    Hori, Takaaki
    Lee, Teng-Yok
    Zhang, Ziming
    Harsham, Bret
    Hershey, John R.
    Marks, Tim K.
    Sumi, Kazuhiko
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 4203 - 4212
  • [9] An attention-based hybrid deep learning approach for bengali video captioning
    Zaoad, Md. Shahir
    Mannan, M. M. Rushadul
    Mandol, Angshu Bikash
    Rahman, Mostafizur
    Islam, Md Adnanul
    Rahman, Md. Mahbubur
    [J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2023, 35 (01) : 257 - 269
  • [10] UAT: Universal Attention Transformer for Video Captioning
    Im, Heeju
    Choi, Yong-Suk
    [J]. SENSORS, 2022, 22 (13)