High-Order Interaction Learning for Image Captioning

被引：68

作者：

Wang, Yanhui ^{[1
]}

Xu, Ning ^{[1
]}

Liu, An-An ^{[1
]}

Li, Wenhui ^{[1
]}

Zhang, Yongdong ^{[2
]}

机构：

[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China

[2] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230052, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2022年 / 32卷 / 07期

基金：

中国国家自然科学基金;

关键词：

Visualization; Semantics; Feature extraction; Decoding; Task analysis; Ions; Encoding; Image captioning; high-order interaction; encoder-decoder framework;

D O I：

10.1109/TCSVT.2021.3121062

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Image captioning aims at understanding various semantic concepts (e.g., objects and relationships) from an image and integrating them in a sentence-level description. Hence, it is necessary to learn the interaction among these concepts. If we define the context of the interaction to be involved in the subject-predicate-object triplet, most current methods only focus on the single triplet for the first-order interaction to generate sentences. Intuitively, we humans are able to perceive the high-order interaction among concepts from two or more triplets to describe an image. For example, when we see the triplets man-cutting-sandwich and man-with-knife, it is natural to integrate and predict the sentence man cutting sandwich with knife. This depends on the high-order interaction between cutting and knife in different triplets. Therefore, exploiting high-order interaction is expected to benefit image captioning and focus on reasoning. In this paper, we introduce the novel high-order interaction learning method over detected objects and relationships for image captioning under the umbrella of the encoder-decoder framework. We first extract a set of object and relationship features in an image. During the encoding stage, the interactive refining network is proposed to learn high-order representations by modeling intra- and inter-object feature interaction in the self-attention fashion. During the decoding stage, the interactive fusion network is proposed to integrate object and relationship information by strengthening their high-order interaction based on language context for sentence generation. In this way, we learn the object-relationship dependencies in different stages, which can provide abundant cues for both visual understanding and caption generation. Extensive experiments show that the proposed method can achieve competitive performances against the state-of-the-art methods on MSCOCO dataset. Additional ablation studies further validate its effectiveness.

引用

页码：4417 / 4430

页数：14

共 50 条

[1] High-Order and Interactive Perceptual Feature Learning for Medical Image Retargeting
Ma, Mingjuan
Zhang, Yuehong
IEEE ACCESS, 2025, 13 : 55358 - 55369
[2] Deep Learning for High-Order Drug-Drug Interaction Prediction
Peng, Bo
Ning, Xia
ACM-BCB'19: PROCEEDINGS OF THE 10TH ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND HEALTH INFORMATICS, 2019, : 197 - 206
[3] Region-Aware Image Captioning via Interaction Learning
Liu, An-An
Zhai, Yingchen
Xu, Ning
Nie, Weizhi
Li, Wenhui
Zhang, Yongdong
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (06) : 3685 - 3696
[4] Probing Synergistic High-Order Interaction for Multi-Modal Image Fusion
Zhou, Man
Zheng, Naishan
He, Xuanhua
Hong, Danfeng
Chanussot, Jocelyn
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2025, 47 (02) : 840 - 857
[5] HIGH-ORDER CORRECTIONS TO THE IMAGE POTENTIAL
ZHENG, XY
RITCHIE, RH
MANSON, JR
PHYSICAL REVIEW B, 1989, 39 (18): : 13510 - 13513
[6] Optimal learning high-order Markov random fields priors of colour image
Zhang, Ke
Jin, Huidong
Fu, Zhouyu
Liu, Nianjun
COMPUTER VISION - ACCV 2007, PT I, PROCEEDINGS, 2007, 4843 : 482 - 491
[7] High-Order Distance-Based Multiview Stochastic Learning in Image Classification
Yu, Jun
Rui, Yong
Tang, Yuan Yan
Tao, Dacheng
IEEE TRANSACTIONS ON CYBERNETICS, 2014, 44 (12) : 2431 - 2442
[8] Contrastive Learning for Image Captioning
Dai, Bo
Lin, Dahua
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
[9] Learning to Evaluate Image Captioning
Cui, Yin
Yang, Guandao
Veit, Andreas
Huang, Xun
Belongie, Serge
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 5804 - 5812
[10] Meta Learning for Image Captioning
Li, Nannan
Chen, Zhenzhong
Liu, Shan
THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8626 - 8633

← 1 2 3 4 5 →