Video captioning based on vision transformer and reinforcement learning

被引：9

作者：

Zhao, Hong ^{[1
]}

Chen, Zhiwen ^{[1
]}

Guo, Lan ^{[1
]}

Han, Zeyu ^{[2
]}

机构：

[1] Lanzhou Univ Technol, Sch Comp & Commun, Lanzhou, Gansu, Peoples R China

[2] Lanzhou Univ Technol, Network & Informat Ctr, Lanzhou, Gansu, Peoples R China

来源：

PEERJ COMPUTER SCIENCE | 2022年 / 8卷

基金：

中国国家自然科学基金;

关键词：

Video captioning; Vision transformer; Reinforcement learning; Long short-term memory network; Computer vision; Natural language processing; Attention mechanism; Encode-decode; Deep learning;

D O I：

10.7717/peerj-cs.916

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from videos. Secondly, the encoding block of the ViT network is applied to encode video features. Thirdly, the encoded features are fed into a Long Short-Term Memory (LSTM) network to generate a video content description. Finally, the accuracy of video content description is further improved by fine-tuning reinforcement learning. We conducted experiments on the benchmark dataset MSR-VTT used for video captioning. The results show that compared with the current mainstream methods, the model in this paper has improved by 2.9%, 1.4%, 0.9% and 4.8% under the four evaluation indicators of LEU-4, METEOR, ROUGE-L and CIDEr-D, respectively.

引用

页数：16

共 50 条

[1] Video captioning based on vision transformer and reinforcement learning
Zhao, Hong
Chen, Zhiwen
Guo, Lan
Han, Zeyu
[J]. PeerJ Computer Science, 2022, 8
[2] Video Captioning via Hierarchical Reinforcement Learning
Wang, Xin
Chen, Wenhu
Wu, Jiawei
Wang, Yuan-Fang
Wang, William Yang
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 4213 - 4222
[3] Reinforcement Learning Transformer for Image Captioning Generation Model
Yan, Zhaojie
[J]. FIFTEENTH INTERNATIONAL CONFERENCE ON MACHINE VISION, ICMV 2022, 2023, 12701
[4] Efficient Image Captioning Based on Vision Transformer Models
Elbedwehy, Samar
Medhat, T.
Hamza, Taher
Alrahmawy, Mohammed F.
[J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 73 (01): : 1483 - 1500
[5] Multimodal attention-based transformer for video captioning
Hemalatha Munusamy
Chandra Sekhar C
[J]. Applied Intelligence, 2023, 53 : 23349 - 23368
[6] Multimodal attention-based transformer for video captioning
Munusamy, Hemalatha
Sekhar, C. Chandra
[J]. APPLIED INTELLIGENCE, 2023, 53 (20) : 23349 - 23368
[7] Reconstruct and Represent Video Contents for Captioning via Reinforcement Learning
Zhang, Wei
Wang, Bairui
Ma, Lin
Liu, Wei
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2020, 42 (12) : 3088 - 3101
[8] End-to-End Video Captioning with Multitask Reinforcement Learning
Li, Lijun
Gong, Boqing
[J]. 2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 339 - 348
[9] Engagement Recognition in Online Learning Based on an Improved Video Vision Transformer
Guo, Zijian
Zhou, Zhuoyi
Pan, Jiahui
Liang, Yan
[J]. 2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[10] UAT: Universal Attention Transformer for Video Captioning
Im, Heeju
Choi, Yong-Suk
[J]. SENSORS, 2022, 22 (13)

← 1 2 3 4 5 →