Multimodal attention-based transformer for video captioning

被引:2
|
作者
Munusamy, Hemalatha [1 ,2 ]
Sekhar, C. Chandra [1 ]
机构
[1] IIT Madras, Dept Comp Sci & Engn, Chennai 600036, Tamilnadu, India
[2] Anna Univ, Dept Informat Technol, MIT Campus, Chennai 600044, Tamilnadu, India
关键词
Video captioning; Transformer; Multimodal attention; Semantic keywords; NETWORK;
D O I
10.1007/s10489-023-04597-2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video captioning is a computer vision task that generates a natural language description for a video. In this paper, we propose a multimodal attention-based transformer using the keyframe features, object features, and semantic keyword embedding features of a video. The Structural Similarity Index Measure (SSIM) is used to extract keyframes from a video. We also detect the unique objects from the extracted keyframes. The features from the keyframes and the objects detected in the keyframes are extracted using a pretrained Convolutional Neural Network (CNN). In the encoder, we use a bimodal attention block to apply two-way cross-attention between the keyframe features and the object features. In the decoder, we combine the features of the words generated up to the previous time step, the semantic keyword embedding features, and the encoder features using a tri-modal attention block. This allows the decoder to choose the multimodal features dynamically to generate the next word in the description. We evaluated the proposed approach using the MSVD, MSR-VTT, and Charades datasets and observed that the proposed model provides better performance than other state-of-the-art models.
引用
收藏
页码:23349 / 23368
页数:20
相关论文
共 50 条
  • [21] Multimodal-enhanced hierarchical attention network for video captioning
    Maosheng Zhong
    Youde Chen
    Hao Zhang
    Hao Xiong
    Zhixiang Wang
    [J]. Multimedia Systems, 2023, 29 : 2469 - 2482
  • [22] Multimodal-enhanced hierarchical attention network for video captioning
    Zhong, Maosheng
    Chen, Youde
    Zhang, Hao
    Xiong, Hao
    Wang, Zhixiang
    [J]. MULTIMEDIA SYSTEMS, 2023, 29 (05) : 2469 - 2482
  • [23] Multimodal architecture for video captioning with memory networks and an attention mechanism
    Li, Wei
    Guo, Dashan
    Fang, Xiangzhong
    [J]. PATTERN RECOGNITION LETTERS, 2018, 105 : 23 - 29
  • [24] Attention-based LSTM with Semantic Consistency for Videos Captioning
    Guo, Zhao
    Gao, Lianli
    Song, Jingkuan
    Xu, Xing
    Shao, Jie
    Shen, Heng Tao
    [J]. MM'16: PROCEEDINGS OF THE 2016 ACM MULTIMEDIA CONFERENCE, 2016, : 357 - 361
  • [25] Attention-based video streaming
    Dikici, Cagatay
    Bozma, H. Isil
    [J]. SIGNAL PROCESSING-IMAGE COMMUNICATION, 2010, 25 (10) : 745 - 760
  • [26] A Visual Attention-Based Model for Bengali Image Captioning
    Das B.
    Pal R.
    Majumder M.
    Phadikar S.
    Sekh A.A.
    [J]. SN Computer Science, 4 (2)
  • [27] k-NN attention-based video vision transformer for action recognition
    Sun, Weirong
    Ma, Yujun
    Wang, Ruili
    [J]. NEUROCOMPUTING, 2024, 574
  • [28] Attention-Based Image Captioning Using DenseNet Features
    Hossain, Md Zakir
    Sohel, Ferdous
    Shiratuddin, Mohd Fairuz
    Laga, Hamid
    Bennamoun, Mohammed
    [J]. NEURAL INFORMATION PROCESSING, ICONIP 2019, PT V, 2019, 1143 : 109 - 117
  • [29] Attention-based multimodal image matching
    Moreshet, Aviad
    Keller, Yosi
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 241
  • [30] Stacked Multimodal Attention Network for Context-Aware Video Captioning
    Zheng, Yi
    Zhang, Yuejie
    Feng, Rui
    Zhang, Tao
    Fan, Weiguo
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (01) : 31 - 42