Refocused Attention: Long Short-Term Rewards Guided Video Captioning

被引:1
|
作者
Dong, Jiarong [1 ,2 ]
Gao, Ke [1 ]
Chen, Xiaokai [1 ,2 ]
Cao, Juan [1 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
关键词
Video captioning; Hierarchical attention; Reinforcement learning; Reward;
D O I
10.1007/s11063-019-10030-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The adaptive cooperation of visual model and language model is essential for video captioning. However, due to the lack of proper guidance for each time step in end-to-end training, the over-dependence of language model often results in the invalidation of attention-based visual model, which is called 'Attention Defocus' problem in this paper. Based on an important observation that the recognition precision of entity word can reflect the effectiveness of the visual model, we propose a novel strategy called refocused attention to optimize the training and cooperating of visual model and language model, using ingenious guidance at appropriate time step. The strategy consists of a short-term-reward guided local entity recognition and a long-term-reward guided global relation understanding, neither requires any external training data. Moreover, a framework with hierarchical visual representations and hierarchical attention is established to fully exploit the potential strength of the proposed learning strategy. Extensive experiments demonstrate that the ingenious guidance strategy together with the optimized structure outperform state-of-the-art video captioning methods with relative improvements 7.7% in BLEU-4 and 5.0% in CIDEr-D on MSVD dataset, even without multi-modal features.
引用
收藏
页码:935 / 948
页数:14
相关论文
共 50 条
  • [1] Refocused Attention: Long Short-Term Rewards Guided Video Captioning
    Jiarong Dong
    Ke Gao
    Xiaokai Chen
    Juan Cao
    Neural Processing Letters, 2020, 52 : 935 - 948
  • [2] Multi-guiding long short-term memory for video captioning
    Xu, Ning
    Liu, An-An
    Nie, Weizhi
    Su, Yuting
    MULTIMEDIA SYSTEMS, 2019, 25 (06) : 663 - 672
  • [3] Multi-guiding long short-term memory for video captioning
    Ning Xu
    An-An Liu
    Weizhi Nie
    Yuting Su
    Multimedia Systems, 2019, 25 : 663 - 672
  • [4] Long Short-Term Relation Transformer With Global Gating for Video Captioning
    Li, Liang
    Gao, Xingyu
    Deng, Jincan
    Tu, Yunbin
    Zha, Zheng-Jun
    Huang, Qingming
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 2726 - 2738
  • [5] Video captioning using boosted and parallel Long Short-Term Memory networks
    Nabati, Masoomeh
    Behrad, Alireza
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2020, 190
  • [6] Short-term anchor linking and long-term self-guided attention for video object detection
    Cores, Daniel
    Brea, Victor M.
    Mucientes, Manuel
    IMAGE AND VISION COMPUTING, 2021, 110
  • [7] Long Short-Term Attention
    Zhong, Guoqiang
    Lin, Xin
    Chen, Kang
    Li, Qingyang
    Huang, Kaizhu
    ADVANCES IN BRAIN INSPIRED COGNITIVE SYSTEMS, 2020, 11691 : 45 - 54
  • [8] Image Captioning with Bidirectional Semantic Attention-Based Guiding of Long Short-Term Memory
    Cao, Pengfei
    Yang, Zhongyi
    Sun, Liang
    Liang, Yanchun
    Yang, Mary Qu
    Guan, Renchu
    NEURAL PROCESSING LETTERS, 2019, 50 (01) : 103 - 119
  • [9] Image Captioning with Bidirectional Semantic Attention-Based Guiding of Long Short-Term Memory
    Pengfei Cao
    Zhongyi Yang
    Liang Sun
    Yanchun Liang
    Mary Qu Yang
    Renchu Guan
    Neural Processing Letters, 2019, 50 : 103 - 119
  • [10] Motion Guided Spatial Attention for Video Captioning
    Chen, Shaoxiang
    Jiang, Yu-Gang
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8191 - 8198