Multimodal Data Augmentation for Image Captioning using Diffusion Models

被引:0
|
作者
Xiao, Changrong [1 ]
Xu, Sean Xin [1 ]
Zhang, Kunpeng [2 ]
机构
[1] Tsinghua Univ, Sch Econ & Management, Ctr AI & Management AIM, Beijing, Peoples R China
[2] Univ Maryland, Dept Decis Operat & Informat Technol, College Pk, MD 20742 USA
关键词
data synthesis; image captioning; multimodal applications;
D O I
10.1145/3607827.3616839
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image captioning, an important vision-language task, often requires a tremendous number of finely labeled image-caption pairs for learning the underlying alignment between images and texts. In this paper, we proposed a multimodal data augmentation method, leveraging a recent text-to-image model called Stable Diffusion, to expand the training set via high-quality generation of image-caption pairs. Extensive experiments on the MS COCO dataset demonstrate the advantages of our approach over several benchmark methods, and particularly a significant boost when having fewer training instances. In addition, models trained on our augmented datasets also outperform prior unpaired image captioning methods by a large margin. Finally, further improvement regarding the training efficiency and effectiveness can be obtained after intentionally filtering the generated data based on quality assessment.
引用
收藏
页码:23 / 33
页数:11
相关论文
共 50 条
  • [1] Text Augmentation for Compressed Image Captioning Models
    Atliha, Viktar
    Sesok, Dmitrij
    [J]. 2022 IEEE OPEN CONFERENCE OF ELECTRICAL, ELECTRONIC AND INFORMATION SCIENCES (ESTREAM), 2022,
  • [2] Image captioning with data augmentation using cropping and mask based on attention image
    Iwamura, Kiyohiko
    Louhi Kasahara, Jun Younes
    Moro, Alessandro
    Yamashita, Atsushi
    Asama, Hajime
    [J]. Seimitsu Kogaku Kaishi/Journal of the Japan Society for Precision Engineering, 2020, 86 (11): : 904 - 910
  • [3] Image captioning by diffusion models: A survey
    Daneshfar, Fatemeh
    Bartani, Ako
    Lotfi, Pardis
    [J]. Engineering Applications of Artificial Intelligence, 2024, 138
  • [4] Text Augmentation Using BERT for Image Captioning
    Atliha, Viktar
    Sesok, Dmitrij
    [J]. APPLIED SCIENCES-BASEL, 2020, 10 (17):
  • [5] Image Captioning Using Multimodal Deep Learning Approach
    Farkh, Rihem
    Oudinet, Ghislain
    Foued, Yasser
    [J]. Computers, Materials and Continua, 2024, 81 (03): : 3951 - 3968
  • [6] Multimodal Image Captioning for Marketing Analysis
    Harzig, Philipp
    Brehm, Stephan
    Lienhart, Rainer
    Kaiser, Carolin
    Schallner, Rene
    [J]. IEEE 1ST CONFERENCE ON MULTIMEDIA INFORMATION PROCESSING AND RETRIEVAL (MIPR 2018), 2018, : 158 - 161
  • [7] MMT: A Multimodal Translator for Image Captioning
    Liu, Chang
    Sun, Fuchun
    Wang, Changhu
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, PT II, 2017, 10614 : 784 - 784
  • [8] Improving multimodal datasets with image captioning
    Thao Nguyen
    Gadre, Samir Yitzhak
    Ilharco, Gabriel
    Oh, Sewoong
    Schmidt, Ludwig
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [9] Exploring Data and Models in SAR Ship Image Captioning
    Zhao, Kai
    Xiong, Wei
    [J]. IEEE ACCESS, 2022, 10 : 91150 - 91159
  • [10] A multimodal fusion approach for image captioning
    Zhao, Dexin
    Chang, Zhi
    Guo, Shutao
    [J]. NEUROCOMPUTING, 2019, 329 : 476 - 485