Exploring Visual Relationship for Image Captioning

被引:582
|
作者
Yao, Ting [1 ]
Pan, Yingwei [1 ]
Li, Yehao [2 ]
Mei, Tao [1 ]
机构
[1] JD AI Res, Beijing, Peoples R China
[2] Sun Yat Sen Univ, Guangzhou, Peoples R China
来源
关键词
Image captioning; Graph convolutional networks; Visual relationship; Long short-term memory;
D O I
10.1007/978-3-030-01264-9_42
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
It is always well believed that modeling relationships between objects would be helpful for representing and eventually describing an image. Nevertheless, there has not been evidence in support of the idea on image description generation. In this paper, we introduce a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework. Specifically, we present Graph Convolutional Networks plus Long Short-Term Memory (dubbed as GCN-LSTM) architecture that novelly integrates both semantic and spatial object relationships into image encoder. Technically, we build graphs over the detected objects in an image based on their spatial and semantic connections. The representations of each region proposed on objects are then refined by leveraging graph structure through GCN. With the learnt region-level features, our GCN-LSTM capitalizes on LSTM-based captioning framework with attention mechanism for sentence generation. Extensive experiments are conducted on COCO image captioning dataset, and superior results are reported when comparing to state-of-the-art approaches. More remarkably, GCN-LSTM increases CIDEr-D performance from 120.1% to 128.7% on COCO testing set.
引用
收藏
页码:711 / 727
页数:17
相关论文
共 50 条
  • [1] Exploring region relationships implicitly: Image captioning with visual relationship attention
    Zhang, Zongjian
    Wu, Qiang
    Wang, Yang
    Chen, Fang
    IMAGE AND VISION COMPUTING, 2021, 109
  • [2] Visual Relationship Attention for Image Captioning
    Zhang, Zongjian
    Wu, Qiang
    Wang, Yang
    Chen, Fang
    2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [3] Visual contextual relationship augmented transformer for image captioning
    Su, Qiang
    Hu, Junbo
    Li, Zhixin
    APPLIED INTELLIGENCE, 2024, 54 (06) : 4794 - 4813
  • [4] Visual contextual relationship augmented transformer for image captioning
    Qiang Su
    Junbo Hu
    Zhixin Li
    Applied Intelligence, 2024, 54 : 4794 - 4813
  • [5] Social Image Captioning: Exploring Visual Attention and User Attention
    Wang, Leiquan
    Chu, Xiaoliang
    Zhang, Weishan
    Wei, Yiwei
    Sun, Weichen
    Wu, Chunlei
    SENSORS, 2018, 18 (02)
  • [6] Boosting convolutional image captioning with semantic content and visual relationship
    Bai, Cong
    Zheng, Anqi
    Huang, Yuan
    Pan, Xiang
    Chen, Nan
    DISPLAYS, 2021, 70
  • [7] Learning visual relationship and context-aware attention for image captioning
    Wang, Junbo
    Wang, Wei
    Wang, Liang
    Wang, Zhiyong
    Feng, David Dagan
    Tan, Tieniu
    PATTERN RECOGNITION, 2020, 98
  • [8] Exploring refined dual visual features cross-combination for image captioning
    Hu, Junbo
    Li, Zhixin
    Su, Qiang
    Tang, Zhenjun
    Ma, Huifang
    NEURAL NETWORKS, 2024, 180
  • [9] Visual Cluster Grounding for Image Captioning
    Jiang, Wenhui
    Zhu, Minwei
    Fang, Yuming
    Shi, Guangming
    Zhao, Xiaowei
    Liu, Yang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 3920 - 3934
  • [10] Bengali Image Captioning with Visual Attention
    Ami, Amit Saha
    Humaira, Mayeesha
    Jim, Md Abidur Rahman Khan
    Paul, Shimul
    Shah, Faisal Muhammad
    2020 23RD INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY (ICCIT 2020), 2020,