Cross-Domain Image Captioning via Cross-Modal Retrieval and Model Adaptation

被引:34
|
作者
Zhao, Wentian [1 ]
Wu, Xinxiao [1 ]
Luo, Jiebo [2 ]
机构
[1] Beijing Inst Technol, Media Comp & Intelligent Syst Lab, Beijing 100081, Peoples R China
[2] Univ Rochester, Dept Comp Sci, Rochester, NY 14627 USA
关键词
Adaptation models; Task analysis; Visualization; Computational modeling; Linguistics; Semantics; Image segmentation; Cross-domain image captioning; cross-modal retrieval; model adaptation;
D O I
10.1109/TIP.2020.3042086
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, large scale datasets of paired images and sentences have enabled the remarkable success in automatically generating descriptions for images, namely image captioning. However, it is labour-intensive and time-consuming to collect a sufficient number of paired images and sentences in each domain. It may be beneficial to transfer the image captioning model trained in an existing domain with pairs of images and sentences (i.e., source domain) to a new domain with only unpaired data (i.e., target domain). In this paper, we propose a cross-modal retrieval aided approach to cross-domain image captioning that leverages a cross-modal retrieval model to generate pseudo pairs of images and sentences in the target domain to facilitate the adaptation of the captioning model. To learn the correlation between images and sentences in the target domain, we propose an iterative cross-modal retrieval process where a cross-modal retrieval model is first pre-trained using the source domain data and then applied to the target domain data to acquire an initial set of pseudo image-sentence pairs. The pseudo image-sentence pairs are further refined by iteratively fine-tuning the retrieval model with the pseudo image-sentence pairs and updating the pseudo image-sentence pairs using the retrieval model. To make the linguistic patterns of the sentences learned in the source domain adapt well to the target domain, we propose an adaptive image captioning model with a self-attention mechanism fine-tuned using the refined pseudo image-sentence pairs. Experimental results on several settings where MSCOCO is used as the source domain and five different datasets (Flickr30k, TGIF, CUB-200, Oxford-102 and Conceptual) are used as the target domains demonstrate that our method achieves mostly better or comparable performance against the state-of-the-art methods. We also extend our method to cross-domain video captioning where MSR-VTT is used as the source domain and two other datasets (MSVD and Charades Captions) are used as the target domains to further demonstrate the effectiveness of our method.
引用
收藏
页码:1180 / 1192
页数:13
相关论文
共 50 条
  • [1] Cross-Domain Transfer Hashing for Efficient Cross-Modal Retrieval
    Li, Fengling
    Wang, Bowen
    Zhu, Lei
    Li, Jingjing
    Zhang, Zheng
    Chang, Xiaojun
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (10) : 9664 - 9677
  • [2] Domain Adaptive Cross-Modal Image Retrieval via Modality and Domain Translations
    Yanagi, Rintaro
    Togo, Ren
    Ogawa, Takahiro
    Haseyama, Miki
    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2021, E104A (06) : 866 - 875
  • [3] Cross-domain Cross-modal Food Transfer
    Zhu, Bin
    Ngo, Chong-Wah
    Chen, Jing-jing
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3762 - 3770
  • [4] Adaptive Cross-Modal Prototypes for Cross-Domain Visual-Language Retrieval
    Liu, Yang
    Chen, Qingchao
    Albanie, Samuel
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 14949 - 14959
  • [5] Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning
    Li, Zhengxin
    Zhao, Wenzhe
    Du, Xuanyi
    Zhou, Guangyao
    Zhang, Songlin
    REMOTE SENSING, 2024, 16 (01)
  • [6] Cross-domain personalized image captioning
    Cuirong Long
    Xiaoshan Yang
    Changsheng Xu
    Multimedia Tools and Applications, 2020, 79 : 33333 - 33348
  • [7] Cross-domain personalized image captioning
    Long, Cuirong
    Yang, Xiaoshan
    Xu, Changsheng
    MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (45-46) : 33333 - 33348
  • [8] Cross-Domain and Cross-Modal Knowledge Distillation in Domain Adaptation for 3D Semantic Segmentation
    Li, Miaoyu
    Zhang, Yachao
    Xie, Yuan
    Gao, Zuodong
    Li, Cuihua
    Zhang, Zhizhong
    Qu, Yanyun
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 3829 - 3837
  • [9] Cross-modal domain adaptation for text-based regularization of image semantics in image retrieval systems
    Pereira, Jose Costa
    Vasconcelos, Nuno
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2014, 124 : 123 - 135
  • [10] Multitask Learning for Cross-Domain Image Captioning
    Yang, Min
    Zhao, Wei
    Xu, Wei
    Feng, Yabing
    Zhao, Zhou
    Chen, Xiaojun
    Lei, Kai
    IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (04) : 1047 - 1061