Refocused Attention: Long Short-Term Rewards Guided Video Captioning

被引:1
|
作者
Dong, Jiarong [1 ,2 ]
Gao, Ke [1 ]
Chen, Xiaokai [1 ,2 ]
Cao, Juan [1 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
关键词
Video captioning; Hierarchical attention; Reinforcement learning; Reward;
D O I
10.1007/s11063-019-10030-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The adaptive cooperation of visual model and language model is essential for video captioning. However, due to the lack of proper guidance for each time step in end-to-end training, the over-dependence of language model often results in the invalidation of attention-based visual model, which is called 'Attention Defocus' problem in this paper. Based on an important observation that the recognition precision of entity word can reflect the effectiveness of the visual model, we propose a novel strategy called refocused attention to optimize the training and cooperating of visual model and language model, using ingenious guidance at appropriate time step. The strategy consists of a short-term-reward guided local entity recognition and a long-term-reward guided global relation understanding, neither requires any external training data. Moreover, a framework with hierarchical visual representations and hierarchical attention is established to fully exploit the potential strength of the proposed learning strategy. Extensive experiments demonstrate that the ingenious guidance strategy together with the optimized structure outperform state-of-the-art video captioning methods with relative improvements 7.7% in BLEU-4 and 5.0% in CIDEr-D on MSVD dataset, even without multi-modal features.
引用
收藏
页码:935 / 948
页数:14
相关论文
共 50 条
  • [21] Research on Volleyball Video Intelligent Description Technology Combining the Long-Term and Short-Term Memory Network and Attention Mechanism
    Gao, Yuhua
    Mo, Yong
    Zhang, Heng
    Huang, Ruiyin
    Chen, Zilong
    COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2021, 2021
  • [22] LSCAformer: Long and short-term cross-attention-aware transformer for depression recognition from video sequences
    He, Lang
    Li, Zheng
    Tiwari, Prayag
    Zhu, Feng
    Wu, Di
    BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2024, 98
  • [23] Video Captioning Based on Cascaded Attention-Guided Visual Feature Fusion
    Shuqin Chen
    Li Yang
    Yikang Hu
    Neural Processing Letters, 2023, 55 (8) : 11509 - 11526
  • [24] Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning
    Dong, Shanshan
    Niu, Tianzi
    Luo, Xin
    Liu, Wu
    Xu, Xinshun
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)
  • [25] Video Captioning Based on Cascaded Attention-Guided Visual Feature Fusion
    Chen, Shuqin
    Yang, Li
    Hu, Yikang
    NEURAL PROCESSING LETTERS, 2023, 55 (08) : 11509 - 11526
  • [26] Long Short-Term Memory and Attention Models for Simulating Urban Densification
    El Hajjar, S.
    Abdallah, F.
    Kassem, H.
    Omrani, H.
    SUSTAINABLE CITIES AND SOCIETY, 2023, 98
  • [27] Dilated Long Short-Term Attention For Chaotic Time Series Applications
    Alyousif, Fatimah J.
    Alkhaldi, Nora A.
    2022 IEEE CONFERENCE ON EVOLVING AND ADAPTIVE INTELLIGENT SYSTEMS (IEEE EAIS 2022), 2022,
  • [28] Long- and short-term collaborative attention networks for sequential recommendation
    Yumin Dong
    Yongfu Zha
    Yongjian Zhang
    Xinji Zha
    The Journal of Supercomputing, 2023, 79 : 18375 - 18393
  • [29] Research on Attention Classification Based on Long Short-term Memory Network
    Wang Pai
    Wu Fan
    Wang Mei
    Qin Xue-Bin
    2020 5TH INTERNATIONAL CONFERENCE ON MECHANICAL, CONTROL AND COMPUTER ENGINEERING (ICMCCE 2020), 2020, : 1148 - 1151
  • [30] Long- and short-term collaborative attention networks for sequential recommendation
    Dong, Yumin
    Zha, Yongfu
    Zhang, Yongjian
    Zha, Xinji
    JOURNAL OF SUPERCOMPUTING, 2023, 79 (16): : 18375 - 18393