Exploring Visual Relationship for Image Captioning

被引:582
|
作者
Yao, Ting [1 ]
Pan, Yingwei [1 ]
Li, Yehao [2 ]
Mei, Tao [1 ]
机构
[1] JD AI Res, Beijing, Peoples R China
[2] Sun Yat Sen Univ, Guangzhou, Peoples R China
来源
关键词
Image captioning; Graph convolutional networks; Visual relationship; Long short-term memory;
D O I
10.1007/978-3-030-01264-9_42
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
It is always well believed that modeling relationships between objects would be helpful for representing and eventually describing an image. Nevertheless, there has not been evidence in support of the idea on image description generation. In this paper, we introduce a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework. Specifically, we present Graph Convolutional Networks plus Long Short-Term Memory (dubbed as GCN-LSTM) architecture that novelly integrates both semantic and spatial object relationships into image encoder. Technically, we build graphs over the detected objects in an image based on their spatial and semantic connections. The representations of each region proposed on objects are then refined by leveraging graph structure through GCN. With the learnt region-level features, our GCN-LSTM capitalizes on LSTM-based captioning framework with attention mechanism for sentence generation. Extensive experiments are conducted on COCO image captioning dataset, and superior results are reported when comparing to state-of-the-art approaches. More remarkably, GCN-LSTM increases CIDEr-D performance from 120.1% to 128.7% on COCO testing set.
引用
收藏
页码:711 / 727
页数:17
相关论文
共 50 条
  • [21] Towards local visual modeling for image captioning
    Ma, Yiwei
    Ji, Jiayi
    Sun, Xiaoshuai
    Zhou, Yiyi
    Ji, Rongrong
    PATTERN RECOGNITION, 2023, 138
  • [22] Image Captioning Based on Visual and Semantic Attention
    Wei, Haiyang
    Li, Zhixin
    Zhang, Canlong
    MULTIMEDIA MODELING (MMM 2020), PT I, 2020, 11961 : 151 - 162
  • [23] Image captioning improved visual question answering
    Sharma, Himanshu
    Jalal, Anand Singh
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (24) : 34775 - 34796
  • [24] Improving Visual Question Answering by Image Captioning
    Shao, Xiangjun
    Dong, Hongsong
    Wu, Guangsheng
    IEEE ACCESS, 2025, 13 : 46299 - 46311
  • [25] Dual Position Relationship Transformer for Image Captioning
    Wang, Yaohan
    Qian, Wenhua
    Nie, Rencan
    Xu, Dan
    Cao, Jinde
    Kim, Pyoungwon
    BIG DATA, 2022, 10 (06) : 515 - 527
  • [26] Exploring Diverse In-Context Configurations for Image Captioning
    Yang, Xu
    Wu, Yongliang
    Yang, Mingzhuo
    Chen, Haokun
    Geng, Xin
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [27] EXPLORING DUAL STREAM GLOBAL INFORMATION FOR IMAGE CAPTIONING
    Xian, Tiantao
    Li, Zhixin
    Chen, Tianyu
    Ma, Huifang
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4458 - 4462
  • [28] Exploring region features in remote sensing image captioning
    Zhao, Kai
    Xiong, Wei
    INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2024, 127
  • [29] Exploring the Impact of Vision Features in News Image Captioning
    Zhang, Junzhe
    Wan, Xiaojun
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 12923 - 12936
  • [30] Exploring Data and Models in SAR Ship Image Captioning
    Zhao, Kai
    Xiong, Wei
    IEEE ACCESS, 2022, 10 : 91150 - 91159