Video captioning algorithm based on mixed training and semantic association

被引:0
|
作者
Chen, Shuqin [1 ,2 ]
Zhong, Xian [1 ,3 ]
Huang, Wenxin [4 ]
Lu, Yansheng [5 ]
机构
[1] School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan,430070, China
[2] School of Computer Science, Hubei University of Education, Wuhan,430205, China
[3] School of Information Science and Technology, Peking University, Beijing,100091, China
[4] School of Computer Science and Information Engineering, Hubei University, Wuhan,430062, China
[5] School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan,430074, China
关键词
Associative storage - Electric transformer testing - Image coding - Long short-term memory - Semantics;
D O I
10.13245/j.hust.230101
中图分类号
学科分类号
摘要
Aiming at the problem that the current mainstream methods used Transformer's self-attention base unit or long short-term memory (LSTM) unit to model the dependency of sequence words, which ignored the semantic relationship between words in the sentence and the problem of exposure bias in the training and testing phases, a video captioning algorithm hybridizing the training and semantic correlation (DC-RL) was proposed.In the encoder section, a bi-directional long short-term memory recurrent neural network (LSTM1) was used to fuse the appearance features and action features obtained from the pre-trained model.In the decoder stage, an attentional mechanism was used to dynamically extract visual features corresponding to the currently generated word for both the global semantic decoder and the self-learning decoder, alleviating the problem of exposure bias caused by the discrepancy between training and testing in the traditional global semantic decoder.In this case, the global semantic decoder used the words from the previous time step in the real description to drive the generation of the current word, and in addition, the global semantic information corresponding to the current word was extracted by the global semantic extractor to assist the generation of the current word.The self-learning decoder, on the other hand, used the semantic information of the word generated at the previous time step to drive the generation of the current word.The hybrid-trained fusion network used reinforcement learning to directly optimize the fusion network model by using the semantic information of the previous word, which enabled the generation of more accurate video captioning.Research results show that on the dataset MSR-VTT, the fusion network model improves over the baseline in the four metrics of B4, M, R and C by 2.3%, 0.3%, 1.0% and 1.9%, respectively, and the fusion network model optimized by using reinforcement learning improves by 2.0%, 0.5%, 1.9% and 6.1%, respectively. © 2023 Huazhong University of Science and Technology. All rights reserved.
引用
收藏
页码:67 / 74
相关论文
共 50 条
  • [41] Image Captioning Based on Visual and Semantic Attention
    Wei, Haiyang
    Li, Zhixin
    Zhang, Canlong
    MULTIMEDIA MODELING (MMM 2020), PT I, 2020, 11961 : 151 - 162
  • [42] Semantic association enhancement transformer with relative position for image captioning
    Jia, Xin
    Wang, Yunbo
    Peng, Yuxin
    Chen, Shengyong
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (15) : 21349 - 21367
  • [43] End-to-End Video Captioning Based on Multiview Semantic Alignment for Human-Machine Fusion
    Wu, Shuai
    Gao, Yubing
    Yang, Weidong
    Li, Hongkai
    Zhu, Guangyu
    IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2024, 22 : 1 - 9
  • [44] VideoTRM: Pre-training for Video Captioning Challenge 2020
    Chen, Jingwen
    Chao, Hongyang
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4605 - 4609
  • [45] A video coverless information hiding algorithm based on semantic segmentation
    Nan Pan
    Jiaohua Qin
    Yun Tan
    Xuyu Xiang
    Guimin Hou
    EURASIP Journal on Image and Video Processing, 2020
  • [46] A video coverless information hiding algorithm based on semantic segmentation
    Pan, Nan
    Qin, Jiaohua
    Tan, Yun
    Xiang, Xuyu
    Hou, Guimin
    EURASIP JOURNAL ON IMAGE AND VIDEO PROCESSING, 2020, 2020 (01)
  • [47] Quality Enhancement Based Video Captioning in Video Communication Systems
    Le, The Van
    Lee, Jin Young
    IEEE ACCESS, 2024, 12 : 40989 - 40999
  • [48] A Grey Relational Analysis based Evaluation Metric for Image Captioning and Video Captioning
    Ma, Miao
    Wang, Bolong
    PROCEEDINGS OF 2017 IEEE INTERNATIONAL CONFERENCE ON GREY SYSTEMS AND INTELLIGENT SERVICES (GSIS), 2017, : 76 - 81
  • [49] Dense video captioning based on local attention
    Qian, Yong
    Mao, Yingchi
    Chen, Zhihao
    Li, Chang
    Bloh, Olano Teah
    Huang, Qian
    IET IMAGE PROCESSING, 2023, 17 (09) : 2673 - 2685
  • [50] Attention based video captioning framework for Hindi
    Singh, Alok
    Singh, Thoudam Doren
    Bandyopadhyay, Sivaji
    MULTIMEDIA SYSTEMS, 2022, 28 (01) : 195 - 207