Exploring Visual Relationship for Image Captioning

被引:582
|
作者
Yao, Ting [1 ]
Pan, Yingwei [1 ]
Li, Yehao [2 ]
Mei, Tao [1 ]
机构
[1] JD AI Res, Beijing, Peoples R China
[2] Sun Yat Sen Univ, Guangzhou, Peoples R China
来源
关键词
Image captioning; Graph convolutional networks; Visual relationship; Long short-term memory;
D O I
10.1007/978-3-030-01264-9_42
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
It is always well believed that modeling relationships between objects would be helpful for representing and eventually describing an image. Nevertheless, there has not been evidence in support of the idea on image description generation. In this paper, we introduce a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework. Specifically, we present Graph Convolutional Networks plus Long Short-Term Memory (dubbed as GCN-LSTM) architecture that novelly integrates both semantic and spatial object relationships into image encoder. Technically, we build graphs over the detected objects in an image based on their spatial and semantic connections. The representations of each region proposed on objects are then refined by leveraging graph structure through GCN. With the learnt region-level features, our GCN-LSTM capitalizes on LSTM-based captioning framework with attention mechanism for sentence generation. Extensive experiments are conducted on COCO image captioning dataset, and superior results are reported when comparing to state-of-the-art approaches. More remarkably, GCN-LSTM increases CIDEr-D performance from 120.1% to 128.7% on COCO testing set.
引用
收藏
页码:711 / 727
页数:17
相关论文
共 50 条
  • [41] Aligned visual semantic scene graph for image captioning
    Zhao, Shanshan
    Li, Lixiang
    Peng, Haipeng
    DISPLAYS, 2022, 74
  • [42] Visual News: Benchmark and Challenges in News Image Captioning
    Liu, Fuxiao
    Wang, Yinghan
    Wang, Tianlu
    Ordonez, Vicente
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 6761 - 6771
  • [43] Combine Visual Features and Scene Semantics for Image Captioning
    Li Z.-X.
    Wei H.-Y.
    Huang F.-C.
    Zhang C.-L.
    Ma H.-F.
    Shi Z.-Z.
    Li, Zhi-Xin (lizx@gxnu.edu.cn), 1624, Science Press (43): : 1624 - 1640
  • [44] Learning joint relationship attention network for image captioning
    Wang, Changzhi
    Gu, Xiaodong
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 211
  • [45] Exploring Transformer and Multilabel Classification for Remote Sensing Image Captioning
    Kandala, Hitesh
    Saha, Sudipan
    Banerjee, Biplab
    Zhu, Xiao Xiang
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19
  • [46] Exploring Spatial-Based Position Encoding for Image Captioning
    Yang, Xiaobao
    He, Shuai
    Wu, Junsheng
    Yang, Yang
    Hou, Zhiqiang
    Ma, Sugang
    MATHEMATICS, 2023, 11 (21)
  • [47] Exploring and Distilling Cross-Modal Information for Image Captioning
    Liu, Fenglin
    Ren, Xuancheng
    Liu, Yuanxin
    Lei, Kai
    Sun, Xu
    PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 5095 - 5101
  • [48] Exploring Semantic Relationships for Image Captioning without Parallel Data
    Liu, Fenglin
    Gao, Meng
    Zhang, Tianhao
    Zou, Yuexian
    2019 19TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2019), 2019, : 439 - 448
  • [49] Exploring coherence from heterogeneous representations for OCR image captioning
    Zhang, Yao
    Song, Zijie
    Hu, Zhenzhen
    MULTIMEDIA SYSTEMS, 2024, 30 (05)
  • [50] Visual Analytics for Efficient Image Exploration and User-Guided Image Captioning
    Li, Yiran
    Wang, Junpeng
    Aboagye, Prince
    Yeh, Chin-Chia Michael
    Zheng, Yan
    Wang, Liang
    Zhang, Wei
    Ma, Kwan-Liu
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2024, 30 (06) : 2875 - 2887