Image-Text Surgery: Efficient Concept Learning in Image Captioning by Generating Pseudopairs

被引:16
|
作者
Fu, Kun [1 ,2 ,3 ]
Li, Jin [1 ,2 ,3 ]
Jin, Junqi [1 ,2 ,3 ]
Zhang, Changshui [1 ,2 ,3 ]
机构
[1] Tsinghua Univ, Dept Automat, Beijing 100084, Peoples R China
[2] Tsinghua Univ, State Key Lab Intelligent Technol & Syst, Beijing 100084, Peoples R China
[3] Tsinghua Natl Lab Informat Sci & Technol, Beijing 100084, Peoples R China
基金
北京市自然科学基金;
关键词
Image captioning; novel concept; pseudodata; visual attention;
D O I
10.1109/TNNLS.2018.2813306
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image captioning aims to generate natural language sentences to describe the salient parts of a given image. Although neural networks have recently achieved promising results, a key problem is that they can only describe concepts seen in the training image-sentence pairs. Efficient learning of novel concepts has thus been a topic of recent interest to alleviate the expensive manpower of labeling data. In this paper, we propose a novel method, Image-Text Surgery, to synthesize pseudoimage-sentence pairs. The pseudopairs are generated under the guidance of a knowledge base, with syntax from a seed data set (i.e., MSCOCO) and visual information from an existing large-scale image base (i.e., ImageNet). Via pseudodata, the captioning model learns novel concepts without any corresponding human-labeled pairs. We further introduce adaptive visual replacement, which adaptively filters unnecessary visual features in pseudodata with an attention mechanism. We evaluate our approach on a held-out subset of the MSCOCO data set. The experimental results demonstrate that the proposed approach provides significant performance improvements over state-of-the-art methods in terms of F1 score and sentence quality. An ablation study and the qualitative results further validate the effectiveness of our approach.
引用
收藏
页码:5910 / 5921
页数:12
相关论文
共 50 条
  • [21] Joint Image-text Representation Learning for Fashion Retrieval
    Yan, Cairong
    Li, Yu
    Wan, Yongquan
    Zhang, Zhaohui
    [J]. ICMLC 2020: 2020 12TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, 2018, : 412 - 417
  • [22] Learning with Adaptive Knowledge for Continual Image-Text Modeling
    Luo, Yutian
    Gao, Yizhao
    Lu, Zhiwu
    [J]. PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, : 472 - 480
  • [23] Dynamic interaction networks for image-text multimodal learning
    Wang, Wenshan
    Liu, Pengfei
    Yang, Su
    Zhang, Weishan
    [J]. NEUROCOMPUTING, 2020, 379 : 262 - 272
  • [24] Text to Image Synthesis for Improved Image Captioning
    Hossain, Md. Zakir
    Sohel, Ferdous
    Shiratuddin, Mohd Fairuz
    Laga, Hamid
    Bennamoun, Mohammed
    [J]. IEEE ACCESS, 2021, 9 : 64918 - 64928
  • [25] Large-scale image annotation with image-text hybrid learning models
    Chien, Been-Chian
    Ku, Chia-Wei
    [J]. SOFT COMPUTING, 2017, 21 (11) : 2857 - 2869
  • [26] The image-text as textual interaction
    MacLeod, C
    [J]. GERMANIC REVIEW, 1999, 74 (03): : 257 - 260
  • [27] Hyperbolic Image-Text Representations
    Desai, Karan
    Nickel, Maximilian
    Rajpurohit, Tanmay
    Johnson, Justin
    Vedantam, Ramakrishna
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202, 2023, 202
  • [28] MISL: Multi-grained image-text semantic learning for text-guided image inpainting
    Wu, Xingcai
    Zhao, Kejun
    Huang, Qianding
    Wang, Qi
    Yang, Zhenguo
    Hao, Gefei
    [J]. PATTERN RECOGNITION, 2024, 145
  • [29] Scoping Review on Image-Text Multimodal Machine Learning Models
    Rashid, Maisha Binte
    Rivas, Pablo
    [J]. 2023 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE, CSCI 2023, 2023, : 186 - 192
  • [30] JECL: Joint Embedding and Cluster Learning for Image-Text Pairs
    Yang, Sean T.
    Huang, Kuan-Hao
    Howe, Bill
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 8344 - 8351