Image Caption Generation with Hierarchical Contextual Visual Spatial Attention

被引:10
|
作者
Khademi, Mahmoud [1 ]
Schulte, Oliver [1 ]
机构
[1] Simon Fraser Univ, Burnaby, BC, Canada
关键词
D O I
10.1109/CVPRW.2018.00260
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a novel context-aware attention-based deep architecture for image caption generation. Our architecture employs a Bidirectional Grid LSTM, which takes visual features of an image as input and learns complex spatial patterns based on two-dimensional context, by selecting or ignoring its input. The Grid LSTM has not been applied to image caption generation task before. Another novel aspect is that we leverage a set of local region-grounded texts obtained by transfer learning. The region-grounded texts often describe the properties of the objects and their relationships in an image. To generate a global caption for the image, we integrate the spatial features from the Grid LSTM with the local region-grounded texts, using a two-layer Bidirectional LSTM. The first layer models the global scene context such as object presence. The second layer utilizes a novel dynamic spatial attention mechanism, based on another Grid LSTM, to generate the global caption word-by-word, while considering the caption context around a word in both directions. Unlike recent models that use a soft attention mechanism, our dynamic spatial attention mechanism considers the spatial context of the image regions. Experimental results on MS-COCO dataset show that our architecture outperforms the state-of-the-art.
引用
下载
收藏
页码:2024 / 2032
页数:9
相关论文
共 50 条
  • [11] Image caption generation with dual attention mechanism
    Liu, Maofu
    Li, Lingjun
    Hu, Huijun
    Guan, Weili
    Tian, Jing
    INFORMATION PROCESSING & MANAGEMENT, 2020, 57 (02)
  • [12] Tell and guess: cooperative learning for natural image caption generation with hierarchical refined attention
    Zhang, Wenqiao
    Tang, Siliang
    Su, Jiajie
    Xiao, Jun
    Zhuang, Yueting
    MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (11) : 16267 - 16282
  • [13] Tell and guess: cooperative learning for natural image caption generation with hierarchical refined attention
    Wenqiao Zhang
    Siliang Tang
    Jiajie Su
    Jun Xiao
    Yueting Zhuang
    Multimedia Tools and Applications, 2021, 80 : 16267 - 16282
  • [14] Image Caption Generation Using Attention Model
    Ramalakshmi, Eliganti
    Jain, Moksh Sailesh
    Uddin, Mohammed Ameer
    INNOVATIVE DATA COMMUNICATION TECHNOLOGIES AND APPLICATION, ICIDCA 2021, 2022, 96 : 1009 - 1017
  • [15] Stack-VS: Stacked Visual-Semantic Attention for Image Caption Generation
    Cheng, Ling
    Wei, Wei
    Mao, Xianling
    Liu, Yong
    Miao, Chunyan
    IEEE ACCESS, 2020, 8 : 154953 - 154965
  • [16] Image Caption via Visual Attention Switch on DenseNet
    Hao, Yanlong
    Xie, Jiyang
    Lin, Zhiqing
    PROCEEDINGS OF 2018 INTERNATIONAL CONFERENCE ON NETWORK INFRASTRUCTURE AND DIGITAL CONTENT (IEEE IC-NIDC), 2018, : 334 - 338
  • [17] Bahdanau Attention Based Bengali Image Caption Generation
    Alam, Md Sahrial
    Rahman, Md Sayedur
    Hosen, Md Ikbal
    Mubin, Khairul Anam
    Hossen, Sharif
    Mridha, M. F.
    2022 INTERNATIONAL CONFERENCE ON DECISION AID SCIENCES AND APPLICATIONS (DASA), 2022, : 1073 - 1077
  • [18] Fine-grained attention for image caption generation
    Chang, Yan-Shuo
    MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (03) : 2959 - 2971
  • [19] Fine-grained attention for image caption generation
    Yan-Shuo Chang
    Multimedia Tools and Applications, 2018, 77 : 2959 - 2971
  • [20] Image caption generation using a dual attention mechanism
    Padate, Roshni
    Jain, Amit
    Kalla, Mukesh
    Sharma, Arvind
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 123