Image-Text Surgery: Efficient Concept Learning in Image Captioning by Generating Pseudopairs

被引：16

作者：

Fu, Kun ^{[1
,2
,3
]}

Li, Jin ^{[1
,2
,3
]}

Jin, Junqi ^{[1
,2
,3
]}

Zhang, Changshui ^{[1
,2
,3
]}

机构：

[1] Tsinghua Univ, Dept Automat, Beijing 100084, Peoples R China

[2] Tsinghua Univ, State Key Lab Intelligent Technol & Syst, Beijing 100084, Peoples R China

[3] Tsinghua Natl Lab Informat Sci & Technol, Beijing 100084, Peoples R China

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2018年 / 29卷 / 12期

基金：

北京市自然科学基金;

关键词：

Image captioning; novel concept; pseudodata; visual attention;

D O I：

10.1109/TNNLS.2018.2813306

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Image captioning aims to generate natural language sentences to describe the salient parts of a given image. Although neural networks have recently achieved promising results, a key problem is that they can only describe concepts seen in the training image-sentence pairs. Efficient learning of novel concepts has thus been a topic of recent interest to alleviate the expensive manpower of labeling data. In this paper, we propose a novel method, Image-Text Surgery, to synthesize pseudoimage-sentence pairs. The pseudopairs are generated under the guidance of a knowledge base, with syntax from a seed data set (i.e., MSCOCO) and visual information from an existing large-scale image base (i.e., ImageNet). Via pseudodata, the captioning model learns novel concepts without any corresponding human-labeled pairs. We further introduce adaptive visual replacement, which adaptively filters unnecessary visual features in pseudodata with an attention mechanism. We evaluate our approach on a held-out subset of the MSCOCO data set. The experimental results demonstrate that the proposed approach provides significant performance improvements over state-of-the-art methods in terms of F1 score and sentence quality. An ablation study and the qualitative results further validate the effectiveness of our approach.

引用

页码：5910 / 5921

页数：12

共 50 条

[21] Joint Image-text Representation Learning for Fashion Retrieval
Yan, Cairong
Li, Yu
Wan, Yongquan
Zhang, Zhaohui
[J]. ICMLC 2020: 2020 12TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, 2018, : 412 - 417
[22] Learning with Adaptive Knowledge for Continual Image-Text Modeling
Luo, Yutian
Gao, Yizhao
Lu, Zhiwu
[J]. PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, : 472 - 480
[23] Dynamic interaction networks for image-text multimodal learning
Wang, Wenshan
Liu, Pengfei
Yang, Su
Zhang, Weishan
[J]. NEUROCOMPUTING, 2020, 379 : 262 - 272
[24] Text to Image Synthesis for Improved Image Captioning
Hossain, Md. Zakir
Sohel, Ferdous
Shiratuddin, Mohd Fairuz
Laga, Hamid
Bennamoun, Mohammed
[J]. IEEE ACCESS, 2021, 9 : 64918 - 64928
[25] Large-scale image annotation with image-text hybrid learning models
Chien, Been-Chian
Ku, Chia-Wei
[J]. SOFT COMPUTING, 2017, 21 (11) : 2857 - 2869
[26] The image-text as textual interaction
MacLeod, C
[J]. GERMANIC REVIEW, 1999, 74 (03): : 257 - 260
[27] Hyperbolic Image-Text Representations
Desai, Karan
Nickel, Maximilian
Rajpurohit, Tanmay
Johnson, Justin
Vedantam, Ramakrishna
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202, 2023, 202
[28] MISL: Multi-grained image-text semantic learning for text-guided image inpainting
Wu, Xingcai
Zhao, Kejun
Huang, Qianding
Wang, Qi
Yang, Zhenguo
Hao, Gefei
[J]. PATTERN RECOGNITION, 2024, 145
[29] Scoping Review on Image-Text Multimodal Machine Learning Models
Rashid, Maisha Binte
Rivas, Pablo
[J]. 2023 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE, CSCI 2023, 2023, : 186 - 192
[30] JECL: Joint Embedding and Cluster Learning for Image-Text Pairs
Yang, Sean T.
Huang, Kuan-Hao
Howe, Bill
[J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 8344 - 8351

← 1 2 3 4 5 →