Context Visual Information-based Deliberation Network for Video Captioning

被引:0
|
作者
Lu, Min [1 ]
Li, Xueyong [1 ]
Liu, Caihua [1 ]
机构
[1] Civil Aviat Univ China, Coll Comp Sci & Technol, Tianjin 300300, Peoples R China
关键词
ATTENTION; IMAGE;
D O I
10.1109/ICPR48806.2021.9413314
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video captioning automatically and accurately generates a textual description for a video. The typical methods following the encoder-decoder architecture directly utilize hidden states to predict words. Nevertheless, these methods do not amend the inaccurate hidden states before feeding those states into word prediction. This leads to a cascade of errors in generating word by word. In this paper, the context visual information-based deliberation network is proposed, abbreviated as CVI-DelNet. Its key idea is to introduce a deliberator into the encoder-decoder framework. The encoder-decoder first generates a raw hidden state sequence. Unlike the existing methods, the raw hidden state is no longer directly used for word prediction but is fed into the deliberator to generate the refined hidden state. The words are then predicted according to the refined hidden states and the contextual visual features. The results on two datasets show that the proposed method significantly outperforms the state-of-the-art methods.
引用
收藏
页码:9812 / 9818
页数:7
相关论文
共 50 条
  • [1] Context Gating with Short Temporal Information for Video Captioning
    Xu, Jinlei
    Xu, Ting
    Tian, Xin
    Liu, Chunping
    Ji, Yi
    [J]. 2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [2] Attentive Visual Semantic Specialized Network for Video Captioning
    Perez-Martin, Jesus
    Bustos, Benjamin
    Perez, Jorge
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 5767 - 5774
  • [3] A Priori Information-Based Lightweight Convolutional Neural Network for Video Denoising
    Shentu, Min-Jian
    Zhu, Qiang
    Zhu, Shu-Yuan
    Meng, Xian-Dong
    [J]. Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2023, 51 (06): : 1510 - 1517
  • [4] Mutual information-based context quantization
    Cagnazzo, Marco
    Antonini, Marc
    Barlaud, Michel
    [J]. SIGNAL PROCESSING-IMAGE COMMUNICATION, 2010, 25 (01) : 64 - 74
  • [5] Visual Commonsense-Aware Representation Network for Video Captioning
    Zeng, Pengpeng
    Zhang, Haonan
    Gao, Lianli
    Li, Xiangpeng
    Qian, Jin
    Shen, Heng Tao
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, : 1 - 12
  • [6] Mutual Information-Based Visual Servoing
    Dame, Amaury
    Marchand, Eric
    [J]. IEEE TRANSACTIONS ON ROBOTICS, 2011, 27 (05) : 958 - 969
  • [7] Stacked Multimodal Attention Network for Context-Aware Video Captioning
    Zheng, Yi
    Zhang, Yuejie
    Feng, Rui
    Zhang, Tao
    Fan, Weiguo
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (01) : 31 - 42
  • [8] Hierarchical Context-aware Network for Dense Video Event Captioning
    Ji, Lei
    Guo, Xianglin
    Huang, Haoyang
    Chen, Xilin
    [J]. 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 2004 - 2013
  • [9] Modeling Context-Guided Visual and Linguistic Semantic Feature for Video Captioning
    Sun, Zhixin
    Zhong, Xian
    Chen, Shuqin
    Liu, Wenxuan
    Feng, Duxiu
    Li, Lin
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2021, PT V, 2021, 12895 : 677 - 689
  • [10] ICDT: Incremental Context Guided Deliberation Transformer for Image Captioning
    Lai, Xinyi
    Lyu, Yufeng
    Zhong, Jiang
    Wang, Chen
    Dai, Qizhu
    Li, Gang
    [J]. PRICAI 2022: TRENDS IN ARTIFICIAL INTELLIGENCE, PT II, 2022, 13630 : 444 - 458