Image Caption Generation with Hierarchical Contextual Visual Spatial Attention

被引:10
|
作者
Khademi, Mahmoud [1 ]
Schulte, Oliver [1 ]
机构
[1] Simon Fraser Univ, Burnaby, BC, Canada
关键词
D O I
10.1109/CVPRW.2018.00260
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a novel context-aware attention-based deep architecture for image caption generation. Our architecture employs a Bidirectional Grid LSTM, which takes visual features of an image as input and learns complex spatial patterns based on two-dimensional context, by selecting or ignoring its input. The Grid LSTM has not been applied to image caption generation task before. Another novel aspect is that we leverage a set of local region-grounded texts obtained by transfer learning. The region-grounded texts often describe the properties of the objects and their relationships in an image. To generate a global caption for the image, we integrate the spatial features from the Grid LSTM with the local region-grounded texts, using a two-layer Bidirectional LSTM. The first layer models the global scene context such as object presence. The second layer utilizes a novel dynamic spatial attention mechanism, based on another Grid LSTM, to generate the global caption word-by-word, while considering the caption context around a word in both directions. Unlike recent models that use a soft attention mechanism, our dynamic spatial attention mechanism considers the spatial context of the image regions. Experimental results on MS-COCO dataset show that our architecture outperforms the state-of-the-art.
引用
下载
收藏
页码:2024 / 2032
页数:9
相关论文
共 50 条
  • [1] Image caption generation using Visual Attention Prediction and Contextual Spatial Relation Extraction
    Sasibhooshan, Reshmi
    Kumaraswamy, Suresh
    Sasidharan, Santhoshkumar
    JOURNAL OF BIG DATA, 2023, 10 (01)
  • [2] Image caption generation using Visual Attention Prediction and Contextual Spatial Relation Extraction
    Reshmi Sasibhooshan
    Suresh Kumaraswamy
    Santhoshkumar Sasidharan
    Journal of Big Data, 10
  • [3] Clothes image caption generation with attribute detection and visual attention model
    Li, Xianrui
    Ye, Zhiling
    Zhang, Zhao
    Zhao, Mingbo
    PATTERN RECOGNITION LETTERS, 2021, 141 (141) : 68 - 74
  • [4] Chinese Image Caption Generation via Visual Attention and Topic Modeling
    Liu, Maofu
    Hu, Huijun
    Li, Lingjun
    Yu, Yan
    Guan, Weili
    IEEE TRANSACTIONS ON CYBERNETICS, 2022, 52 (02) : 1247 - 1257
  • [5] Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
    Xu, Kelvin
    Ba, Jimmy Lei
    Kiros, Ryan
    Cho, Kyunghyun
    Courville, Aaron
    Salakhutdinov, Ruslan
    Zemel, Richard S.
    Bengio, Yoshua
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 37, 2015, 37 : 2048 - 2057
  • [6] GVA: guided visual attention approach for automatic image caption generation
    Hossen, Md. Bipul
    Ye, Zhongfu
    Abdussalam, Amr
    Hossain, Md. Imran
    MULTIMEDIA SYSTEMS, 2024, 30 (01)
  • [7] GVA: guided visual attention approach for automatic image caption generation
    Md. Bipul Hossen
    Zhongfu Ye
    Amr Abdussalam
    Md. Imran Hossain
    Multimedia Systems, 2024, 30
  • [8] Image caption based on Visual Attention Mechanism
    Zhou, Jinfei
    Zhu, Yaping
    Pan, Hong
    PROCEEDINGS OF 2019 INTERNATIONAL CONFERENCE ON IMAGE, VIDEO AND SIGNAL PROCESSING (IVSP 2019), 2019, : 28 - 32
  • [9] Cross-Lingual Image Caption Generation Based on Visual Attention Model
    Wang, Bin
    Wang, Cungang
    Zhang, Qian
    Su, Ying
    Wang, Yang
    Xu, Yanyan
    IEEE ACCESS, 2020, 8 : 104543 - 104554
  • [10] Spatial Relational Attention Using Fully Convolutional Networks for Image Caption Generation
    Jiang, Teng
    Gong, Liang
    Yang, Yupu
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS, 2020, 19 (02)