Exploring Visual Relationship for Image Captioning

被引：582

作者：

Yao, Ting ^{[1
]}

Pan, Yingwei ^{[1
]}

Li, Yehao ^{[2
]}

Mei, Tao ^{[1
]}

机构：

[1] JD AI Res, Beijing, Peoples R China

[2] Sun Yat Sen Univ, Guangzhou, Peoples R China

来源：

COMPUTER VISION - ECCV 2018, PT XIV | 2018年 / 11218卷

关键词：

Image captioning; Graph convolutional networks; Visual relationship; Long short-term memory;

D O I：

10.1007/978-3-030-01264-9_42

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

It is always well believed that modeling relationships between objects would be helpful for representing and eventually describing an image. Nevertheless, there has not been evidence in support of the idea on image description generation. In this paper, we introduce a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework. Specifically, we present Graph Convolutional Networks plus Long Short-Term Memory (dubbed as GCN-LSTM) architecture that novelly integrates both semantic and spatial object relationships into image encoder. Technically, we build graphs over the detected objects in an image based on their spatial and semantic connections. The representations of each region proposed on objects are then refined by leveraging graph structure through GCN. With the learnt region-level features, our GCN-LSTM capitalizes on LSTM-based captioning framework with attention mechanism for sentence generation. Extensive experiments are conducted on COCO image captioning dataset, and superior results are reported when comparing to state-of-the-art approaches. More remarkably, GCN-LSTM increases CIDEr-D performance from 120.1% to 128.7% on COCO testing set.

引用

页码：711 / 727

页数：17

共 50 条

[31] RVAIC: Refined visual attention for improved image captioning
Al-Qatf, Majjed
Hawbani, Ammar
Wang, XingFu
Abdusallam, Amr
Alsamhi, Saeed
Alhabib, Mohammed
Curry, Edward
Journal of Intelligent and Fuzzy Systems, 2024, 46 (02): : 3447 - 3459
[32] Visual Linguistic Model and Its Applications in Image Captioning
Kumar R.
SN Computer Science, 2020, 1 (3)
[33] RVAIC: Refined visual attention for improved image captioning
Al-Qatf, Majjed
Hawbani, Ammar
Wang, XingFu
Abdusallam, Amr
Alsamhi, Saeed
Alhabib, Mohammed
Curry, Edward
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2024, 46 (02) : 3447 - 3459
[34] Image captioning in Bengali language using visual attention
Masud, Adiba
Hosen, Md. Biplob
Habibullah, Md.
Anannya, Mehrin
Kaiser, M. Shamim
PLOS ONE, 2025, 20 (02):
[35] Image Captioning With Visual-Semantic Double Attention
He, Chen
Hu, Haifeng
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (01)
[36] Image Captioning with Text-Based Visual Attention
Chen He
Haifeng Hu
Neural Processing Letters, 2019, 49 : 177 - 185
[37] VISUAL SALIENCY FOR IMAGE CAPTIONING IN NEW MULTIMEDIA SERVICES
Cornia, Marcella
Baraldi, Lorenzo
Serra, Giuseppe
Cucchiara, Rita
2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2017,
[38] A visual question answering model based on image captioning
Zhou, Kun
Liu, Qiongjie
Zhao, Dexin
MULTIMEDIA SYSTEMS, 2024, 30 (06)
[39] Image Captioning with Text-Based Visual Attention
He, Chen
Hu, Haifeng
NEURAL PROCESSING LETTERS, 2019, 49 (01) : 177 - 185
[40] DIFNet: Boosting Visual Information Flow for Image Captioning
Wu, Mingrui
Zhang, Xuying
Sun, Xiaoshuai
Zhou, Yiyi
Chen, Chao
Gu, Jiaxin
Sun, Xing
Ji, Rongrong
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17999 - 18008

← 1 2 3 4 5 →