Video captioning based on vision transformer and reinforcement learning

被引:9
|
作者
Zhao, Hong [1 ]
Chen, Zhiwen [1 ]
Guo, Lan [1 ]
Han, Zeyu [2 ]
机构
[1] Lanzhou Univ Technol, Sch Comp & Commun, Lanzhou, Gansu, Peoples R China
[2] Lanzhou Univ Technol, Network & Informat Ctr, Lanzhou, Gansu, Peoples R China
基金
中国国家自然科学基金;
关键词
Video captioning; Vision transformer; Reinforcement learning; Long short-term memory network; Computer vision; Natural language processing; Attention mechanism; Encode-decode; Deep learning;
D O I
10.7717/peerj-cs.916
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from videos. Secondly, the encoding block of the ViT network is applied to encode video features. Thirdly, the encoded features are fed into a Long Short-Term Memory (LSTM) network to generate a video content description. Finally, the accuracy of video content description is further improved by fine-tuning reinforcement learning. We conducted experiments on the benchmark dataset MSR-VTT used for video captioning. The results show that compared with the current mainstream methods, the model in this paper has improved by 2.9%, 1.4%, 0.9% and 4.8% under the four evaluation indicators of LEU-4, METEOR, ROUGE-L and CIDEr-D, respectively.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Video captioning based on vision transformer and reinforcement learning
    Zhao, Hong
    Chen, Zhiwen
    Guo, Lan
    Han, Zeyu
    [J]. PeerJ Computer Science, 2022, 8
  • [2] Video Captioning via Hierarchical Reinforcement Learning
    Wang, Xin
    Chen, Wenhu
    Wu, Jiawei
    Wang, Yuan-Fang
    Wang, William Yang
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 4213 - 4222
  • [3] Reinforcement Learning Transformer for Image Captioning Generation Model
    Yan, Zhaojie
    [J]. FIFTEENTH INTERNATIONAL CONFERENCE ON MACHINE VISION, ICMV 2022, 2023, 12701
  • [4] Efficient Image Captioning Based on Vision Transformer Models
    Elbedwehy, Samar
    Medhat, T.
    Hamza, Taher
    Alrahmawy, Mohammed F.
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 73 (01): : 1483 - 1500
  • [5] Multimodal attention-based transformer for video captioning
    Hemalatha Munusamy
    Chandra Sekhar C
    [J]. Applied Intelligence, 2023, 53 : 23349 - 23368
  • [6] Multimodal attention-based transformer for video captioning
    Munusamy, Hemalatha
    Sekhar, C. Chandra
    [J]. APPLIED INTELLIGENCE, 2023, 53 (20) : 23349 - 23368
  • [7] Reconstruct and Represent Video Contents for Captioning via Reinforcement Learning
    Zhang, Wei
    Wang, Bairui
    Ma, Lin
    Liu, Wei
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2020, 42 (12) : 3088 - 3101
  • [8] End-to-End Video Captioning with Multitask Reinforcement Learning
    Li, Lijun
    Gong, Boqing
    [J]. 2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 339 - 348
  • [9] Engagement Recognition in Online Learning Based on an Improved Video Vision Transformer
    Guo, Zijian
    Zhou, Zhuoyi
    Pan, Jiahui
    Liang, Yan
    [J]. 2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [10] UAT: Universal Attention Transformer for Video Captioning
    Im, Heeju
    Choi, Yong-Suk
    [J]. SENSORS, 2022, 22 (13)