Towards Unsupervised Image Captioning with Shared Multimodal Embeddings

被引:62
|
作者
Laina, Iro [1 ]
Rupprecht, Christian [2 ]
Navab, Nassir [1 ]
机构
[1] Tech Univ Munich, Munich, Germany
[2] Univ Oxford, Oxford, England
关键词
D O I
10.1109/ICCV.2019.00751
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Understanding images without explicit supervision has become an important problem in computer vision. In this paper, we address image captioning by generating language descriptions of scenes without learning from annotated pairs of images and their captions. The core component of our approach is a shared latent space that is structured by visual concepts. In this space, the two modalities should be indistinguishable. A language model is first trained to encode sentences into semantically structured embeddings. Image features that are translated into this embedding space can be decoded into descriptions through the same language model, similarly to sentence embeddings. This translation is learned from weakly paired images and text using a loss robust to noisy assignments and a conditional adversarial component. Our approach allows to exploit large text corpora outside the annotated distributions of image/caption data. Our experiments show that the proposed domain alignment learns a semantically meaningful representation which outperforms previous work.
引用
收藏
页码:7413 / 7423
页数:11
相关论文
共 50 条
  • [1] Unsupervised Image Captioning
    Feng, Yang
    Ma, Lin
    Liu, Wei
    Luo, Jiebo
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4120 - 4129
  • [2] SUNIT: multimodal unsupervised image-to-image translation with shared encoder
    Lin, Liyuan
    Ji, Shulin
    Zhou, Yuan
    Zhang, Shun
    [J]. JOURNAL OF ELECTRONIC IMAGING, 2022, 31 (01)
  • [3] Towards Personalized Image Captioning via Multimodal Memory Networks
    Park, Cesc Chunseong
    Kim, Byeongchang
    Kim, Gunhee
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (04) : 999 - 1012
  • [4] Unsupervised Style Control for Image Captioning
    Tian, Junyu
    Yang, Zhikun
    Shi, Shumin
    [J]. DATA SCIENCE (ICPCSEE 2022), PT I, 2022, 1628 : 413 - 424
  • [5] Multimodal Image Captioning for Marketing Analysis
    Harzig, Philipp
    Brehm, Stephan
    Lienhart, Rainer
    Kaiser, Carolin
    Schallner, Rene
    [J]. IEEE 1ST CONFERENCE ON MULTIMEDIA INFORMATION PROCESSING AND RETRIEVAL (MIPR 2018), 2018, : 158 - 161
  • [6] MMT: A Multimodal Translator for Image Captioning
    Liu, Chang
    Sun, Fuchun
    Wang, Changhu
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, PT II, 2017, 10614 : 784 - 784
  • [7] Improving multimodal datasets with image captioning
    Thao Nguyen
    Gadre, Samir Yitzhak
    Ilharco, Gabriel
    Oh, Sewoong
    Schmidt, Ludwig
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [8] A multimodal fusion approach for image captioning
    Zhao, Dexin
    Chang, Zhi
    Guo, Shutao
    [J]. NEUROCOMPUTING, 2019, 329 : 476 - 485
  • [9] Object-Centric Unsupervised Image Captioning
    Meng, Zihang
    Yang, David
    Cao, Xuefei
    Shah, Ashish
    Lim, Ser-Nam
    [J]. COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 219 - 235
  • [10] Removing Partial Mismatches in Unsupervised Image Captioning
    Honda, Ukyo
    Hashimoto, Atsushi
    Watanabe, Taro
    Matsumoto, Yuji
    [J]. Transactions of the Japanese Society for Artificial Intelligence, 2022, 37 (02)