Bidirectional transformer with knowledge graph for video captioning

被引:0
|
作者
Zhong, Maosheng [1 ]
Chen, Youde [1 ]
Zhang, Hao [1 ]
Xiong, Hao [1 ]
Wang, Zhixiang [1 ]
机构
[1] Jiangxi Normal Univ, Nanchang, Peoples R China
基金
中国国家自然科学基金;
关键词
Video captioning; Bidirectional transformer; Knowledge graph; Multimodal of video;
D O I
10.1007/s11042-023-17822-4
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Models based on transformer architecture have risen to prominence for video captioning. However, most models are only to improve either the encoder or the decoder, because when we improve the encoder and decoder simultaneously, the shortcomings of either side may be amplified. Based on the transformer architecture, we connect a bidirectional decoder and an encoder that integrates fine-grained spatio-temporal features, objects, and relationships between the objects in the video. Experiments show that improvements in the encoder amplify the information leakage of the bidirectional decoder and further produce a worse result. To tackle this problem, we generate pseudo reverse captions and propose a Bidirectional Transformer with Knowledge Graph (BTKG), which integrates the outputs of two encoders into the forward and backward decoders of the bidirectional decoder, respectively. In addition, we make fine-grained improvements on the interior of the different encoders according to four modal features of the video. Experiments on two mainstream benchmark datasets, i.e., MSVD and MSR-VTT, demonstrate the effectiveness of BTKG, which achieves state-of-the-art performance in significant metrics. Moreover, the sentences generated by BTKG contain scene words and modifiers, that are more in line with human language habits. Codes are available on https://github.com/nickchen121/BTKG.
引用
收藏
页码:58309 / 58328
页数:20
相关论文
共 50 条
  • [1] Text with Knowledge Graph Augmented Transformer for Video Captioning
    Gu, Xin
    Chen, Guang
    Wang, Yufei
    Zhang, Libo
    Luo, Tiejian
    Wen, Longyin
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18941 - 18951
  • [2] Image captioning with transformer and knowledge graph
    Zhang, Yu
    Shi, Xinyu
    Mi, Siya
    Yang, Xu
    [J]. PATTERN RECOGNITION LETTERS, 2021, 143 (143) : 43 - 49
  • [3] Action knowledge for video captioning with graph neural networks
    Hendria, Willy Fitra
    Velda, Vania
    Putra, Bahy Helmi Hartoyo
    Adzaka, Fikriansyah
    Jeong, Cheol
    [J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2023, 35 (04) : 50 - 62
  • [4] Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning
    Zhang, Junchao
    Peng, Yuxin
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 8319 - 8328
  • [5] CAPTIONING TRANSFORMER WITH SCENE GRAPH GUIDING
    Chen, Haishun
    Wang, Ying
    Yang, Xin
    Li, Jie
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 2538 - 2542
  • [6] Video captioning via a symmetric bidirectional decoder
    Qi, Shanshan
    Yang, Luxi
    [J]. IET COMPUTER VISION, 2021, 15 (04) : 283 - 296
  • [7] Incorporating the Graph Representation of Video and Text into Video Captioning
    Lu, Min
    Li, Yuan
    [J]. 2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 396 - 401
  • [8] UAT: Universal Attention Transformer for Video Captioning
    Im, Heeju
    Choi, Yong-Suk
    [J]. SENSORS, 2022, 22 (13)
  • [9] Accelerated masked transformer for dense video captioning
    Yu, Zhou
    Han, Nanjia
    [J]. NEUROCOMPUTING, 2021, 445 : 72 - 80
  • [10] Bidirectional Transformer for Video Deblurring
    Xu, Qian
    Qian, Yuntao
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (12) : 8450 - 8461