Video Captioning Based on Cascaded Attention-Guided Visual Feature Fusion

被引:0
|
作者
Shuqin Chen
Li Yang
Yikang Hu
机构
[1] Hubei University of Education,School of Computer Science
[2] Hubei Normal University,College of Computer and Information Engineering
关键词
Video captioning; Cascaded attention; Visual correlation; Feature fusion;
D O I
暂无
中图分类号
学科分类号
摘要
Video captioning generation has become one of the research hotspots in recent years due to its wide range of potential application scenarios. It captures the situation where recognition errors occur in the description due to insufficient interaction between visual features and text features during model encoding, and the attention mechanism is difficult to explicitly model the visual and verbal coherence. In this paper, we propose a video captioning algorithm CAVF (Cascaded Attention-guided Visual Features Fusion for Video Captioning) based on cascaded attention-guided visual features fusion. In the encoding stage, a cascaded attention mechanism is proposed to model the visual content correlation between different frames, and the global semantic information can better guide the visual features fusion, through which the network further enhances the correlation between the visual features of the model and the decoder. In the decoding stage, the overall features and word vectors obtained from the multilayer long- and short-term memory network are encoded for de-enhancement to generate the current words. Experiments on the public datasets MSVD and MSR-VTT validate the effectiveness of the model in this paper, and the proposed method in this paper improves 5.6%, 1.3%, and 4.3% in BLEU_4, ROUGE, and CIDER metrics, respectively, on the MSR-VTT public dataset compared with the benchmark method.
引用
收藏
页码:11509 / 11526
页数:17
相关论文
共 50 条
  • [1] Video Captioning Based on Cascaded Attention-Guided Visual Feature Fusion
    Chen, Shuqin
    Yang, Li
    Hu, Yikang
    [J]. NEURAL PROCESSING LETTERS, 2023, 55 (08) : 11509 - 11526
  • [2] Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning
    Dong, Shanshan
    Niu, Tianzi
    Luo, Xin
    Liu, Wu
    Xu, Xinshun
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)
  • [3] Attention-guided image captioning with adaptive global and local feature fusion
    Zhong, Xian
    Nie, Guozhang
    Huang, Wenxin
    Liu, Wenxuan
    Ma, Bo
    Lin, Chia-Wen
    [J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2021, 78
  • [4] Object Detection by Attention-Guided Feature Fusion Network
    Shi, Yuxuan
    Fan, Yue
    Xu, Siqi
    Gao, Yue
    Gao, Ran
    [J]. SYMMETRY-BASEL, 2022, 14 (05):
  • [5] Attention-guided Feature Fusion for Small Object Detection
    Yang, Jiaxiong
    Liu, Xianhui
    Liu, Zhuang
    [J]. IST 2023 - IEEE International Conference on Imaging Systems and Techniques, Proceedings, 2023,
  • [6] Attention-Guided Disentangled Feature Aggregation for Video Object Detection
    Muralidhara, Shishir
    Hashmi, Khurram Azeem
    Pagani, Alain
    Liwicki, Marcus
    Stricker, Didier
    Afzal, Muhammad Zeshan
    [J]. SENSORS, 2022, 22 (21)
  • [7] Joint Scence Network and Attention-Guided for Image Captioning
    Zhou, Dongming
    Yang, Jing
    Zhang, Canlong
    Tang, Yanping
    [J]. 2021 21ST IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2021), 2021, : 1535 - 1540
  • [8] Attention-Guided Image Captioning through Word Information
    Tang, Ziwei
    Yi, Yaohua
    Sheng, Hao
    [J]. SENSORS, 2021, 21 (23)
  • [9] Attention-guided multi-granularity fusion model for video summarization
    Zhang, Yunzuo
    Liu, Yameng
    Wu, Cunyu
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 249
  • [10] RadioTransformer: A Cascaded Global-Focal Transformer for Visual Attention-Guided Disease Classification
    Bhattacharya, Moinak
    Jain, Shubham
    Prasanna, Prateek
    [J]. COMPUTER VISION, ECCV 2022, PT XXI, 2022, 13681 : 679 - 698