Spatio-Temporal Memory Attention for Image Captioning

被引:56
|
作者
Ji, Junzhong [1 ,2 ]
Xu, Cheng [1 ,2 ]
Zhang, Xiaodan [1 ,2 ]
Wang, Boyue [1 ,2 ]
Song, Xinhang [3 ]
机构
[1] Beijing Univ Technol, Beijing Artificial Intelligence Inst, Beijing 100124, Peoples R China
[2] Beijing Univ Technol, Fac Informat Technol, Beijing Municipal Key Lab Multimedia & Intelligen, Beijing 100124, Peoples R China
[3] Chinese Acad Sci, Inst Comp Technol, CAS, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China
基金
中国国家自然科学基金;
关键词
Image captioning; spatio-temporal relationship; attention transmission; memory attention; LSTM; VISUAL-ATTENTION; MECHANISMS; NETWORKS; MODEL;
D O I
10.1109/TIP.2020.3004729
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual attention has been successfully applied in image captioning to selectively incorporate the most relevant areas to the language generation procedure. However, the attention in current image captioning methods is only guided by the hidden state of language model, e.g. LSTM (Long-Short Term Memory), indirectly and implicitly, and thus the attended areas are weakly relevant at different time steps. Besides the spatial relationship of attention areas, the temporal relationship in attention is crucial for image captioning according to the attention transmission mechanism of human vision. In this paper, we propose a new spatio-temporal memory attention (STMA) model to learn the spatio-temporal relationship in attention for image captioning. The STMA introduces the memory mechanism to the attention model through a tailored LSTM, where the new cell is used to memorize and propagate the attention information, and the output gate is used to generate attention weights. The attention in STMA transmits with memory adaptively and dependently, which builds strong temporal connections of attentions and learns the spatio-temporal relationship of attended areas simultaneously. Besides, the proposed STMA is flexible to combine with attention-based image captioning frameworks. Experiments on MS COCO dataset demonstrate the superiority of the proposed STMA model in exploring the spatio-temporal relationship in attention and improving the current attention-based image captioning.
引用
收藏
页码:7615 / 7628
页数:14
相关论文
共 50 条
  • [1] Diverse Video Captioning by Adaptive Spatio-temporal Attention
    Ghaderi, Zohreh
    Salewski, Leonard
    Lensch, Hendrik P. A.
    [J]. PATTERN RECOGNITION, DAGM GCPR 2022, 2022, 13485 : 409 - 425
  • [2] Spatio-Temporal Attention Models for Grounded Video Captioning
    Zanfir, Mihai
    Marinoiu, Elisabeta
    Sminchisescu, Cristian
    [J]. COMPUTER VISION - ACCV 2016, PT IV, 2017, 10114 : 104 - 119
  • [3] Video Captioning via Sentence Augmentation and Spatio-Temporal Attention
    Chen, Tseng-Hung
    Zeng, Kuo-Hao
    Hsu, Wan-Ting
    Sun, Min
    [J]. COMPUTER VISION - ACCV 2016 WORKSHOPS, PT I, 2017, 10116 : 269 - 286
  • [4] Spatio-Temporal Ranked-Attention Networks for Video Captioning
    Cherian, Anoop
    Wang, Jue
    Hori, Chiori
    Marks, Tim K.
    [J]. 2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 1606 - 1615
  • [5] Capturing Temporal Structures for Video Captioning by Spatio-temporal Contexts and Channel Attention Mechanism
    Dashan Guo
    Wei Li
    Xiangzhong Fang
    [J]. Neural Processing Letters, 2017, 46 : 313 - 328
  • [6] Capturing Temporal Structures for Video Captioning by Spatio-temporal Contexts and Channel Attention Mechanism
    Guo, Dashan
    Li, Wei
    Fang, Xiangzhong
    [J]. NEURAL PROCESSING LETTERS, 2017, 46 (01) : 313 - 328
  • [7] Spatio-temporal ontologies and attention
    University of Freiburg, Freiburg, Germany
    [J]. Spat. Cogn. Comput., 2007, 1 (13-32):
  • [8] Exploring the Spatio-Temporal Aware Graph for video captioning
    Xue, Ping
    Zhou, Bing
    [J]. IET COMPUTER VISION, 2022, 16 (05) : 456 - 467
  • [9] Spatio-Temporal Memory Streaming
    Somogyi, Stephen
    Wenisch, Thomas F.
    Ailamaki, Anastasia
    Falsafi, Babak
    [J]. ISCA 2009: 36TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, 2009, : 69 - 80
  • [10] Action Recognition With Spatio-Temporal Visual Attention on Skeleton Image Sequences
    Yang, Zhengyuan
    Li, Yuncheng
    Yang, Jianchao
    Luo, Jiebo
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2019, 29 (08) : 2405 - 2415