Cross-Domain Image Captioning via Cross-Modal Retrieval and Model Adaptation

被引:34
|
作者
Zhao, Wentian [1 ]
Wu, Xinxiao [1 ]
Luo, Jiebo [2 ]
机构
[1] Beijing Inst Technol, Media Comp & Intelligent Syst Lab, Beijing 100081, Peoples R China
[2] Univ Rochester, Dept Comp Sci, Rochester, NY 14627 USA
关键词
Adaptation models; Task analysis; Visualization; Computational modeling; Linguistics; Semantics; Image segmentation; Cross-domain image captioning; cross-modal retrieval; model adaptation;
D O I
10.1109/TIP.2020.3042086
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, large scale datasets of paired images and sentences have enabled the remarkable success in automatically generating descriptions for images, namely image captioning. However, it is labour-intensive and time-consuming to collect a sufficient number of paired images and sentences in each domain. It may be beneficial to transfer the image captioning model trained in an existing domain with pairs of images and sentences (i.e., source domain) to a new domain with only unpaired data (i.e., target domain). In this paper, we propose a cross-modal retrieval aided approach to cross-domain image captioning that leverages a cross-modal retrieval model to generate pseudo pairs of images and sentences in the target domain to facilitate the adaptation of the captioning model. To learn the correlation between images and sentences in the target domain, we propose an iterative cross-modal retrieval process where a cross-modal retrieval model is first pre-trained using the source domain data and then applied to the target domain data to acquire an initial set of pseudo image-sentence pairs. The pseudo image-sentence pairs are further refined by iteratively fine-tuning the retrieval model with the pseudo image-sentence pairs and updating the pseudo image-sentence pairs using the retrieval model. To make the linguistic patterns of the sentences learned in the source domain adapt well to the target domain, we propose an adaptive image captioning model with a self-attention mechanism fine-tuned using the refined pseudo image-sentence pairs. Experimental results on several settings where MSCOCO is used as the source domain and five different datasets (Flickr30k, TGIF, CUB-200, Oxford-102 and Conceptual) are used as the target domains demonstrate that our method achieves mostly better or comparable performance against the state-of-the-art methods. We also extend our method to cross-domain video captioning where MSR-VTT is used as the source domain and two other datasets (MSVD and Charades Captions) are used as the target domains to further demonstrate the effectiveness of our method.
引用
收藏
页码:1180 / 1192
页数:13
相关论文
共 50 条
  • [21] Applying an Embodied Cognition Perspective to Cross-Modal and Cross-Domain Color Associations
    Loeffler, Diana
    INTERNATIONAL JOURNAL OF PSYCHOLOGY, 2016, 51 : 1135 - 1135
  • [22] Domain Invariant Subspace Learning for Cross-Modal Retrieval
    Liu, Chenlu
    Xu, Xing
    Yang, Yang
    Lu, Huimin
    Shen, Fumin
    Ji, Yanli
    MULTIMEDIA MODELING, MMM 2018, PT II, 2018, 10705 : 94 - 105
  • [23] Cross-domain image retrieval: methods and applications
    Xiaoping Zhou
    Xiangyu Han
    Haoran Li
    Jia Wang
    Xun Liang
    International Journal of Multimedia Information Retrieval, 2022, 11 : 199 - 218
  • [24] Survey on clothing image retrieval with cross-domain
    Chen Ning
    Yang Di
    Li Menglu
    COMPLEX & INTELLIGENT SYSTEMS, 2022, 8 (06) : 5531 - 5544
  • [25] Semi-supervised cross-modal learning for cross modal retrieval and image annotation
    Fuhao Zou
    Xingqiang Bai
    Chaoyang Luan
    Kai Li
    Yunfei Wang
    Hefei Ling
    World Wide Web, 2019, 22 : 825 - 841
  • [26] Cross-domain image retrieval: methods and applications
    Zhou, Xiaoping
    Han, Xiangyu
    Li, Haoran
    Wang, Jia
    Liang, Xun
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2022, 11 (03) : 199 - 218
  • [27] Survey on clothing image retrieval with cross-domain
    Chen Ning
    Yang Di
    Li Menglu
    Complex & Intelligent Systems, 2022, 8 : 5531 - 5544
  • [28] Semi-supervised cross-modal learning for cross modal retrieval and image annotation
    Zou, Fuhao
    Bai, Xingqiang
    Luan, Chaoyang
    Li, Kai
    Wang, Yunfei
    Ling, Hefei
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2019, 22 (02): : 825 - 841
  • [29] Cross-Domain Image Retrieval with Attention Modeling
    Ji, Xin
    Wang, Wei
    Zhang, Meihui
    Yang, Yang
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1654 - 1662
  • [30] DADRnet: Cross-domain image dehazing via domain adaptation and disentangled representation
    Li, Xiaopeng
    Yu, Hu
    Zhao, Chen
    Fan, Cien
    Zou, Lian
    NEUROCOMPUTING, 2023, 544