Image Caption Generation with Hierarchical Contextual Visual Spatial Attention

被引:10
|
作者
Khademi, Mahmoud [1 ]
Schulte, Oliver [1 ]
机构
[1] Simon Fraser Univ, Burnaby, BC, Canada
关键词
D O I
10.1109/CVPRW.2018.00260
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a novel context-aware attention-based deep architecture for image caption generation. Our architecture employs a Bidirectional Grid LSTM, which takes visual features of an image as input and learns complex spatial patterns based on two-dimensional context, by selecting or ignoring its input. The Grid LSTM has not been applied to image caption generation task before. Another novel aspect is that we leverage a set of local region-grounded texts obtained by transfer learning. The region-grounded texts often describe the properties of the objects and their relationships in an image. To generate a global caption for the image, we integrate the spatial features from the Grid LSTM with the local region-grounded texts, using a two-layer Bidirectional LSTM. The first layer models the global scene context such as object presence. The second layer utilizes a novel dynamic spatial attention mechanism, based on another Grid LSTM, to generate the global caption word-by-word, while considering the caption context around a word in both directions. Unlike recent models that use a soft attention mechanism, our dynamic spatial attention mechanism considers the spatial context of the image regions. Experimental results on MS-COCO dataset show that our architecture outperforms the state-of-the-art.
引用
收藏
页码:2024 / 2032
页数:9
相关论文
共 50 条
  • [31] VSAM-Based Visual Keyword Generation for Image Caption
    Zhang, Suya
    Zhang, Yana
    Chen, Zeyu
    Li, Zhaohui
    IEEE ACCESS, 2021, 9 : 27638 - 27649
  • [32] Image Caption Description Generation Method Based on Reflective Attention Mechanism
    Qiao Pingan
    Yuan, Li
    Shen Ruixue
    ADVANCES IN NATURAL COMPUTATION, FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, ICNC-FSKD 2022, 2023, 153 : 600 - 609
  • [33] Neural Image Caption Generation with Global Feature Based Attention Scheme
    Wang, Yongzhuang
    Xiong, Hongkai
    IMAGE AND GRAPHICS (ICIG 2017), PT II, 2017, 10667 : 51 - 61
  • [34] A Deep Attention based Framework for Image Caption Generation in Hindi Language
    Dhir, Rijul
    Mishra, Santosh Kumar
    Saha, Sriparna
    Bhattacharyya, Pushpak
    COMPUTACION Y SISTEMAS, 2019, 23 (03): : 693 - 701
  • [35] Hierarchical acquisition of visual specificity in spatial contextual cueing
    Kin-Pou Lie
    Attention, Perception, & Psychophysics, 2015, 77 : 160 - 172
  • [36] Hierarchical acquisition of visual specificity in spatial contextual cueing
    Lie, Kin-Pou
    ATTENTION PERCEPTION & PSYCHOPHYSICS, 2015, 77 (01) : 160 - 172
  • [37] Multilevel Attention Networks and Policy Reinforcement Learning for Image Caption Generation
    Zhou, Zhibo
    Zhang, Xiaoming
    Li, Zhoujun
    Huang, Feiran
    Xu, Jie
    BIG DATA, 2022, 10 (06) : 481 - 492
  • [38] Local Attribute Attention Network for Minority Clothing Image Caption Generation
    Xuhui Z.
    Li L.
    Xiaodong F.
    Lijun L.
    Wei P.
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2024, 36 (03): : 399 - 412
  • [39] Automatic image caption generation using deep learning and multimodal attention
    Dai, Jin
    Zhang, Xinyu
    COMPUTER ANIMATION AND VIRTUAL WORLDS, 2022, 33 (3-4)
  • [40] Attention-based Visual-Audio Fusion for Video Caption Generation
    Guo, Ningning
    Liu, Huaping
    Jiang, Linhua
    2019 IEEE 4TH INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS AND MECHATRONICS (ICARM 2019), 2019, : 839 - 844