Multitask Learning for Cross-Domain Image Captioning

被引:90
|
作者
Yang, Min [1 ]
Zhao, Wei [1 ]
Xu, Wei [2 ]
Feng, Yabing [2 ]
Zhao, Zhou [3 ]
Chen, Xiaojun [4 ]
Lei, Kai [5 ]
机构
[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen 518055, Peoples R China
[2] Tencent, Shenzhen 518057, Peoples R China
[3] Zhejiang Univ, Sch Comp Sci, Hangzhou 310058, Zhejiang, Peoples R China
[4] Shenzhen Univ, Sch Comp Sci, Shenzhen 518060, Peoples R China
[5] Peking Univ, Sch Elect & Comp Engn, Shenzhen Key Lab Informat Centr Networking & Bloc, Shenzhen 518055, Peoples R China
关键词
Multitask learning; image captioning; image synthesis; dual learning; reinforcement learning; REPRESENTATION;
D O I
10.1109/TMM.2018.2869276
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recent artificial intelligence research has witnessed great interest in automatically generating text descriptions of images, which are known as the image captioning task. Remarkable success has been achieved on domains where a large number of paired data in multimedia are available. Nevertheless, annotating sufficient data is labor-intensive and time-consuming, establishing significant barriers for adapting the image captioning systems to new domains. In this study, we introduc a novel Multitask Learning Algorithm for cross-Domain Image Captioning (MLADIC). MLADIC is a multitask system that simultaneously optimizes two coupled objectives via a dual learning mechanism: image captioning and text-to-image synthesis, with the hope that by leveraging the correlation of the two dual tasks, we are able to enhance the image captioning performance in the target domain. Concretely, the image captioning task is trained with an encoder-decoder model (i.e., CNN-LSTM) to generate textual descriptions of the input images. The image synthesis task employs the conditional generative adversarial network (C-GAN) to synthesize plausible images based on text descriptions. In C-GAN, a generative model G synthesizes plausible images given text descriptions, and a discriminative model D tries to distinguish the images in training data from the generated images by G. The adversarial process can eventually guide G to generate plausible and high-quality images. To bridge the gap between different domains, a two-step strategy is adopted in order to transfer knowledge from the source domains to the target domains. First, we pre-train the model to learn the alignment between the neural representations of images and that of text data with the sufficient labeled source domain data. Second, we fine-tune the learned model by leveraging the limited image-text pairs and unpaired data in the target domain. We conduct extensive experiments to evaluate the performance of MLADIC by using the MSCOCO as the source domain data, and using Flickr30k and Oxford-102 as the target domain data. The results demonstrate that MLADIC achieves substantially better performance than the strong competitors for the cross-domain image captioning task.
引用
收藏
页码:1047 / 1061
页数:15
相关论文
共 50 条
  • [1] Dual Learning for Cross-domain Image Captioning
    Zhao, Wei
    Xu, Wei
    Yang, Min
    Ye, Jianbo
    Zhao, Zhou
    Feng, Yabing
    Qiao, Yu
    [J]. CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2017, : 29 - 38
  • [2] Discriminative Style Learning for Cross-Domain Image Captioning
    Yuan, Jin
    Zhu, Shuai
    Huang, Shuyin
    Zhang, Hanwang
    Xiao, Yaoqiang
    Li, Zhiyong
    Wang, Meng
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 1723 - 1736
  • [3] Cross-domain personalized image captioning
    Cuirong Long
    Xiaoshan Yang
    Changsheng Xu
    [J]. Multimedia Tools and Applications, 2020, 79 : 33333 - 33348
  • [4] Cross-domain personalized image captioning
    Long, Cuirong
    Yang, Xiaoshan
    Xu, Changsheng
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (45-46) : 33333 - 33348
  • [5] Learning Scene Graph for Better Cross-Domain Image Captioning
    Jia, Junhua
    Xin, Xiaowei
    Gao, Xiaoyan
    Ding, Xiangqian
    Pang, Shunpeng
    [J]. PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT III, 2024, 14427 : 121 - 137
  • [6] Cross-Domain Image Captioning with Discriminative Finetuning
    Dessi, Roberto
    Bevilacqua, Michele
    Gualdoni, Eleonora
    Carraz Rakotonirina, Nathanael
    Franzon, Francesca
    Baroni, Marco
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6935 - 6944
  • [7] Cross-domain multi-style merge for image captioning
    Duan, Yiqun
    Wang, Zhen
    Li, Yi
    Wang, Jingya
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 228
  • [8] Cross-Domain Image Captioning via Cross-Modal Retrieval and Model Adaptation
    Zhao, Wentian
    Wu, Xinxiao
    Luo, Jiebo
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 1180 - 1192
  • [9] Cross-domain learning for underwater image enhancement
    Li, Fei
    Zheng, Jiangbin
    Zhang, Yuan-fang
    Jia, Wenjing
    Wei, Qianru
    He, Xiangjian
    [J]. SIGNAL PROCESSING-IMAGE COMMUNICATION, 2023, 110
  • [10] Cross-domain collaborative learning for single image deraining
    Pan, Zaiyu
    Wang, Jun
    Shen, Zhengwen
    Han, Shuyu
    Zhu, Jihong
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 211