Cross-Domain Image Captioning via Cross-Modal Retrieval and Model Adaptation

被引：34

作者：

Zhao, Wentian ^{[1
]}

Wu, Xinxiao ^{[1
]}

Luo, Jiebo ^{[2
]}

机构：

[1] Beijing Inst Technol, Media Comp & Intelligent Syst Lab, Beijing 100081, Peoples R China

[2] Univ Rochester, Dept Comp Sci, Rochester, NY 14627 USA

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2021年 / 30卷

关键词：

Adaptation models; Task analysis; Visualization; Computational modeling; Linguistics; Semantics; Image segmentation; Cross-domain image captioning; cross-modal retrieval; model adaptation;

D O I：

10.1109/TIP.2020.3042086

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In recent years, large scale datasets of paired images and sentences have enabled the remarkable success in automatically generating descriptions for images, namely image captioning. However, it is labour-intensive and time-consuming to collect a sufficient number of paired images and sentences in each domain. It may be beneficial to transfer the image captioning model trained in an existing domain with pairs of images and sentences (i.e., source domain) to a new domain with only unpaired data (i.e., target domain). In this paper, we propose a cross-modal retrieval aided approach to cross-domain image captioning that leverages a cross-modal retrieval model to generate pseudo pairs of images and sentences in the target domain to facilitate the adaptation of the captioning model. To learn the correlation between images and sentences in the target domain, we propose an iterative cross-modal retrieval process where a cross-modal retrieval model is first pre-trained using the source domain data and then applied to the target domain data to acquire an initial set of pseudo image-sentence pairs. The pseudo image-sentence pairs are further refined by iteratively fine-tuning the retrieval model with the pseudo image-sentence pairs and updating the pseudo image-sentence pairs using the retrieval model. To make the linguistic patterns of the sentences learned in the source domain adapt well to the target domain, we propose an adaptive image captioning model with a self-attention mechanism fine-tuned using the refined pseudo image-sentence pairs. Experimental results on several settings where MSCOCO is used as the source domain and five different datasets (Flickr30k, TGIF, CUB-200, Oxford-102 and Conceptual) are used as the target domains demonstrate that our method achieves mostly better or comparable performance against the state-of-the-art methods. We also extend our method to cross-domain video captioning where MSR-VTT is used as the source domain and two other datasets (MSVD and Charades Captions) are used as the target domains to further demonstrate the effectiveness of our method.

引用

页码：1180 / 1192

页数：13

共 50 条

[1] Cross-Domain Transfer Hashing for Efficient Cross-Modal Retrieval
Li, Fengling
Wang, Bowen
Zhu, Lei
Li, Jingjing
Zhang, Zheng
Chang, Xiaojun
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (10) : 9664 - 9677
[2] Domain Adaptive Cross-Modal Image Retrieval via Modality and Domain Translations
Yanagi, Rintaro
Togo, Ren
Ogawa, Takahiro
Haseyama, Miki
IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2021, E104A (06) : 866 - 875
[3] Cross-domain Cross-modal Food Transfer
Zhu, Bin
Ngo, Chong-Wah
Chen, Jing-jing
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3762 - 3770
[4] Adaptive Cross-Modal Prototypes for Cross-Domain Visual-Language Retrieval
Liu, Yang
Chen, Qingchao
Albanie, Samuel
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 14949 - 14959
[5] Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning
Li, Zhengxin
Zhao, Wenzhe
Du, Xuanyi
Zhou, Guangyao
Zhang, Songlin
REMOTE SENSING, 2024, 16 (01)
[6] Cross-domain personalized image captioning
Cuirong Long
Xiaoshan Yang
Changsheng Xu
Multimedia Tools and Applications, 2020, 79 : 33333 - 33348
[7] Cross-domain personalized image captioning
Long, Cuirong
Yang, Xiaoshan
Xu, Changsheng
MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (45-46) : 33333 - 33348
[8] Cross-Domain and Cross-Modal Knowledge Distillation in Domain Adaptation for 3D Semantic Segmentation
Li, Miaoyu
Zhang, Yachao
Xie, Yuan
Gao, Zuodong
Li, Cuihua
Zhang, Zhizhong
Qu, Yanyun
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 3829 - 3837
[9] Cross-modal domain adaptation for text-based regularization of image semantics in image retrieval systems
Pereira, Jose Costa
Vasconcelos, Nuno
COMPUTER VISION AND IMAGE UNDERSTANDING, 2014, 124 : 123 - 135
[10] Multitask Learning for Cross-Domain Image Captioning
Yang, Min
Zhao, Wei
Xu, Wei
Feng, Yabing
Zhao, Zhou
Chen, Xiaojun
Lei, Kai
IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (04) : 1047 - 1061

← 1 2 3 4 5 →