Video captioning algorithm based on mixed training and semantic association

被引：0

作者：

Chen, Shuqin ^{[1
,2
]}

Zhong, Xian ^{[1
,3
]}

Huang, Wenxin ^{[4
]}

Lu, Yansheng ^{[5
]}

机构：

[1] School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan,430070, China

[2] School of Computer Science, Hubei University of Education, Wuhan,430205, China

[3] School of Information Science and Technology, Peking University, Beijing,100091, China

[4] School of Computer Science and Information Engineering, Hubei University, Wuhan,430062, China

[5] School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan,430074, China

来源：

Huazhong Keji Daxue Xuebao (Ziran Kexue Ban)/Journal of Huazhong University of Science and Technology (Natural Science Edition) | 2023年 / 51卷 / 11期

关键词：

Associative storage - Electric transformer testing - Image coding - Long short-term memory - Semantics;

D O I：

10.13245/j.hust.230101

中图分类号：

学科分类号：

摘要：

Aiming at the problem that the current mainstream methods used Transformer's self-attention base unit or long short-term memory (LSTM) unit to model the dependency of sequence words, which ignored the semantic relationship between words in the sentence and the problem of exposure bias in the training and testing phases, a video captioning algorithm hybridizing the training and semantic correlation (DC-RL) was proposed.In the encoder section, a bi-directional long short-term memory recurrent neural network (LSTM1) was used to fuse the appearance features and action features obtained from the pre-trained model.In the decoder stage, an attentional mechanism was used to dynamically extract visual features corresponding to the currently generated word for both the global semantic decoder and the self-learning decoder, alleviating the problem of exposure bias caused by the discrepancy between training and testing in the traditional global semantic decoder.In this case, the global semantic decoder used the words from the previous time step in the real description to drive the generation of the current word, and in addition, the global semantic information corresponding to the current word was extracted by the global semantic extractor to assist the generation of the current word.The self-learning decoder, on the other hand, used the semantic information of the word generated at the previous time step to drive the generation of the current word.The hybrid-trained fusion network used reinforcement learning to directly optimize the fusion network model by using the semantic information of the previous word, which enabled the generation of more accurate video captioning.Research results show that on the dataset MSR-VTT, the fusion network model improves over the baseline in the four metrics of B4, M, R and C by 2.3%, 0.3%, 1.0% and 1.9%, respectively, and the fusion network model optimized by using reinforcement learning improves by 2.0%, 0.5%, 1.9% and 6.1%, respectively. © 2023 Huazhong University of Science and Technology. All rights reserved.

引用

页码：67 / 74

共 50 条

[1] MIVCN: Multimodal interaction video captioning network based on semantic association graph
Ying Wang
Guoheng Huang
Lin Yuming
Haoliang Yuan
Chi-Man Pun
Wing-Kuen Ling
Lianglun Cheng
Applied Intelligence, 2022, 52 : 5241 - 5260
[2] MIVCN: Multimodal interaction video captioning network based on semantic association graph
Wang, Ying
Huang, Guoheng
Lin Yuming
Yuan, Haoliang
Pun, Chi-Man
Ling, Wing-Kuen
Cheng, Lianglun
APPLIED INTELLIGENCE, 2022, 52 (05) : 5241 - 5260
[3] Video Captioning with Semantic Guiding
Yuan, Jin
Tian, Chunna
Zhang, Xiangnan
Ding, Yuxuan
Wei, Wei
2018 IEEE FOURTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM), 2018,
[4] Structured Encoding Based on Semantic Disambiguation for Video Captioning
Sun, Bo
Tian, Jinyu
Wu, Yong
Yu, Lunjun
Tang, Yuanyan
COGNITIVE COMPUTATION, 2024, 16 (03) : 1032 - 1048
[5] Video Captioning Based on Channel Soft Attention and Semantic Reconstructor
Lei, Zhou
Huang, Yiyong
FUTURE INTERNET, 2021, 13 (02) : 1 - 18
[6] Video Captioning With Attention-Based LSTM and Semantic Consistency
Gao, Lianli
Guo, Zhao
Zhang, Hanwang
Xu, Xing
Shen, Heng Tao
IEEE TRANSACTIONS ON MULTIMEDIA, 2017, 19 (09) : 2045 - 2055
[7] Semantic Grouping Network for Video Captioning
Ryu, Hobin
Kang, Sunghun
Kang, Haeyong
Yoo, Chang D.
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2514 - 2522
[8] Semantic guidance network for video captioning
Lan Guo
Hong Zhao
ZhiWen Chen
ZeYu Han
Scientific Reports, 13
[9] Video Captioning with Transferred Semantic Attributes
Pan, Yingwei
Yao, Ting
Li, Houqiang
Mei, Tao
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 984 - 992
[10] Semantic guidance network for video captioning
Guo, Lan
Zhao, Hong
Chen, Zhiwen
Han, Zeyu
SCIENTIFIC REPORTS, 2023, 13 (01)

← 1 2 3 4 5 →