Refocused Attention: Long Short-Term Rewards Guided Video Captioning

被引:1
|
作者
Dong, Jiarong [1 ,2 ]
Gao, Ke [1 ]
Chen, Xiaokai [1 ,2 ]
Cao, Juan [1 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
关键词
Video captioning; Hierarchical attention; Reinforcement learning; Reward;
D O I
10.1007/s11063-019-10030-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The adaptive cooperation of visual model and language model is essential for video captioning. However, due to the lack of proper guidance for each time step in end-to-end training, the over-dependence of language model often results in the invalidation of attention-based visual model, which is called 'Attention Defocus' problem in this paper. Based on an important observation that the recognition precision of entity word can reflect the effectiveness of the visual model, we propose a novel strategy called refocused attention to optimize the training and cooperating of visual model and language model, using ingenious guidance at appropriate time step. The strategy consists of a short-term-reward guided local entity recognition and a long-term-reward guided global relation understanding, neither requires any external training data. Moreover, a framework with hierarchical visual representations and hierarchical attention is established to fully exploit the potential strength of the proposed learning strategy. Extensive experiments demonstrate that the ingenious guidance strategy together with the optimized structure outperform state-of-the-art video captioning methods with relative improvements 7.7% in BLEU-4 and 5.0% in CIDEr-D on MSVD dataset, even without multi-modal features.
引用
收藏
页码:935 / 948
页数:14
相关论文
共 50 条
  • [31] ERABiLNet: enhanced residual attention with bidirectional long short-term memory
    Seerangan, Koteeswaran
    Nandagopal, Malarvizhi
    Nair, Resmi R.
    Periyasamy, Sakthivel
    Jhaveri, Rutvij H.
    Balusamy, Balamurugan
    Selvarajan, Shitharth
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [32] SHORT-TERM PROACTIVE INTERFERENCE - SHORT-TERM OR LONG-TERM
    LOFTUS, GR
    PATTERSON, KK
    BULLETIN OF THE PSYCHONOMIC SOCIETY, 1974, 4 (NA4) : 240 - 240
  • [33] Exploiting long-term temporal dynamics for video captioning
    Yuyu Guo
    Jingqiu Zhang
    Lianli Gao
    World Wide Web, 2019, 22 : 735 - 749
  • [34] Exploiting long-term temporal dynamics for video captioning
    Guo, Yuyu
    Zhang, Jingqiu
    Gao, Lianli
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2019, 22 (02): : 735 - 749
  • [35] SALSTM: segmented self-attention long short-term memory for long-term forecasting
    Dai, Zhi-Qiang
    Li, Jie
    Cao, Yang-Jie
    Zhang, Yong-Xiang
    JOURNAL OF SUPERCOMPUTING, 2025, 81 (01):
  • [36] Spatial attention model-modulated bi-directional long short-term memory for unsupervised video summarisation
    Zhong, Rui
    Xiao, Diyang
    Dong, Shi
    Hu, Min
    ELECTRONICS LETTERS, 2021, 57 (06) : 252 - 254
  • [37] Flow Guided Short-term Trackers with Cascade Detection for Long-term Tracking
    Wu, Han
    Yang, Xueyuan
    Yang, Yong
    Liu, Guizhong
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 170 - 178
  • [38] Long-Term and Short-Term Information Propagation and Fusion for Learned Video Compression
    Wang, Shen
    Feng, Donghui
    Lu, Guo
    Cheng, Zhengxue
    Song, Li
    Zhang, Wenjun
    IEEE TRANSACTIONS ON BROADCASTING, 2024, 70 (04) : 1254 - 1265
  • [39] VideoMem: Constructing, Analyzing, Predicting Short-Term and Long-Term Video Memorability
    Cohendet, Romain
    Demarty, Claire-Helene
    Duong, Ngoc Q. K.
    Engilberge, Martin
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 2531 - 2540
  • [40] Attention and short-term memory: Crossroads
    Nobre, Anna C.
    Stokes, Mark G.
    NEUROPSYCHOLOGIA, 2011, 49 (06) : 1391 - 1392