Image Caption Generation with Hierarchical Contextual Visual Spatial Attention

被引:10
|
作者
Khademi, Mahmoud [1 ]
Schulte, Oliver [1 ]
机构
[1] Simon Fraser Univ, Burnaby, BC, Canada
关键词
D O I
10.1109/CVPRW.2018.00260
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a novel context-aware attention-based deep architecture for image caption generation. Our architecture employs a Bidirectional Grid LSTM, which takes visual features of an image as input and learns complex spatial patterns based on two-dimensional context, by selecting or ignoring its input. The Grid LSTM has not been applied to image caption generation task before. Another novel aspect is that we leverage a set of local region-grounded texts obtained by transfer learning. The region-grounded texts often describe the properties of the objects and their relationships in an image. To generate a global caption for the image, we integrate the spatial features from the Grid LSTM with the local region-grounded texts, using a two-layer Bidirectional LSTM. The first layer models the global scene context such as object presence. The second layer utilizes a novel dynamic spatial attention mechanism, based on another Grid LSTM, to generate the global caption word-by-word, while considering the caption context around a word in both directions. Unlike recent models that use a soft attention mechanism, our dynamic spatial attention mechanism considers the spatial context of the image regions. Experimental results on MS-COCO dataset show that our architecture outperforms the state-of-the-art.
引用
收藏
页码:2024 / 2032
页数:9
相关论文
共 50 条
  • [21] Visual Attention Based on Long-Short Term Memory Model for Image Caption Generation
    Qu, Shiru
    Xi, Yuling
    Ding, Songtao
    2017 29TH CHINESE CONTROL AND DECISION CONFERENCE (CCDC), 2017, : 4789 - 4794
  • [22] Automatic Generation of Image Caption Based on Semantic Relation using Deep Visual Attention Prediction
    El-gayar, M. M.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (09) : 105 - 114
  • [23] VD-SAN: Visual-Densely Semantic Attention Network for Image Caption Generation
    He, Xinwei
    Yang, Yang
    Shi, Baoguang
    Bai, Xiang
    NEUROCOMPUTING, 2019, 328 : 48 - 55
  • [24] Scene Attention Mechanism for Remote Sensing Image Caption Generation
    Wu, Shiqi
    Zhang, Xiangrong
    Wang, Xin
    Li, Chen
    Jiao, Licheng
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [25] Image caption generation method based on adaptive attention mechanism
    Jin, Huazhong
    Wu, Yu
    Wan, Fang
    Hu, Man
    Li, Qingqing
    MIPPR 2019: PATTERN RECOGNITION AND COMPUTER VISION, 2020, 11430
  • [26] Assamese news image caption generation using attention mechanism
    Das, Ringki
    Singh, Thoudam Doren
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (07) : 10051 - 10069
  • [27] Recurrent Attention LSTM Model for Image Chinese Caption Generation
    Zhang, Chaoying
    Dai, Yaping
    Cheng, Yanyan
    Jia, Zhiyang
    Hirota, Kaoru
    2018 JOINT 10TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING AND INTELLIGENT SYSTEMS (SCIS) AND 19TH INTERNATIONAL SYMPOSIUM ON ADVANCED INTELLIGENT SYSTEMS (ISIS), 2018, : 808 - 813
  • [28] Assamese news image caption generation using attention mechanism
    Ringki Das
    Thoudam Doren Singh
    Multimedia Tools and Applications, 2022, 81 : 10051 - 10069
  • [29] A Hierarchical Attention Model for Social Contextual Image Recommendation
    Wu, Le
    Chen, Lei
    Hong, Richang
    Fu, Yanjie
    Xie, Xing
    Wang, Meng
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2020, 32 (10) : 1854 - 1867
  • [30] Visual Image Caption Generation for Service Robotics and Industrial Applications
    Luo, Ren C.
    Hsu, Yu-Ting
    Wen, Yu-Cheng
    Ye, Huan-Jun
    2019 IEEE INTERNATIONAL CONFERENCE ON INDUSTRIAL CYBER PHYSICAL SYSTEMS (ICPS 2019), 2019, : 827 - 832