Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets

被引:0
|
作者
Marcella Cornia
Lorenzo Baraldi
Giuseppe Fiameni
Rita Cucchiara
机构
[1] University of Modena and Reggio Emilia,
[2] NVIDIA AI Technology Centre,undefined
[3] IIT-CNR,undefined
来源
关键词
Image captioning; Vision and language; Multimodal learning;
D O I
暂无
中图分类号
学科分类号
摘要
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both human-annotated and web-collected captions. Large-scale datasets with noisy image-text pairs, indeed, provide a sub-optimal source of supervision because of their low-quality descriptive style, while human-annotated datasets are cleaner but smaller in scale. To get the best of both worlds, we propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component. The proposed model avoids the need of object detectors, is trained with a single objective of prompt language modeling, and can replicate the style of human-collected captions while training on sources with different input styles. Experimentally, the model shows a strong capability of recognizing real-world concepts and producing high-quality captions. Extensive experiments are performed on different image captioning datasets, including CC3M, nocaps, and the competitive COCO dataset, where our model consistently outperforms baselines and state-of-the-art approaches.
引用
收藏
页码:1701 / 1720
页数:19
相关论文
共 50 条
  • [1] Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets
    Cornia, Marcella
    Baraldi, Lorenzo
    Fiameni, Giuseppe
    Cucchiara, Rita
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (05) : 1701 - 1720
  • [2] Multi-Source Style Transfer via Style Disentanglement Network
    Wang, Quan
    Li, Sheng
    Wang, Zichi
    Zhang, Xinpeng
    Feng, Guorui
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1373 - 1383
  • [3] Style quantification of scanned multi-source digits
    Zhang, Xiaoli
    Nagy, George
    18TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 2, PROCEEDINGS, 2006, : 1018 - +
  • [4] Leveraging Mixture Alignment for Multi-Source Domain Adaptation
    Dayal, Aveen
    Shrusti, S.
    Cenkeramaddi, Linga Reddy
    Mohan, C. Krishna
    Kumar, Abhinav
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2025, 34 : 885 - 898
  • [5] Efficient Multi-Source Anonymity for Aggregated Internet of Vehicles Datasets
    Lu, Xingmin
    Song, Wei
    APPLIED SCIENCES-BASEL, 2024, 14 (08):
  • [6] An Analysis of Multi-Source Temperature Datasets using Statistical Techniques
    Sharma, Vishal
    Ghosh, Sanjay Kumar
    MAPAN-JOURNAL OF METROLOGY SOCIETY OF INDIA, 2024, 39 (04): : 799 - 813
  • [7] Multi-style cartoonization: Leveraging multiple datasets with generative adversarial networks
    Cai, Jianlu
    Li, Frederick W. B.
    Nan, Fangzhe
    Yang, Bailin
    COMPUTER ANIMATION AND VIRTUAL WORLDS, 2024, 35 (03)
  • [8] Style Semantic Disentangle Network for Multi-Source Domain Generalization
    Guo, Song
    Luo, Haiyong
    Zhu, Yida
    Zhao, Fang
    2023 IEEE 8TH INTERNATIONAL CONFERENCE ON BIG DATA ANALYTICS, ICBDA, 2023, : 31 - 38
  • [9] Accuracy Assessment of Multi-Source Gridded Population Distribution Datasets in China
    Bai, Zhongqiang
    Wang, Juanle
    Wang, Mingming
    Gao, Mengxu
    Sun, Jiulin
    SUSTAINABILITY, 2018, 10 (05)
  • [10] Land cover change detection in the Aralkum with multi-source satellite datasets
    Low, Fabian
    Dimov, Dimo
    Kenjabaev, Shavkat
    Zaitov, Sherzod
    Stulina, Galina
    Dukhovny, Viktor
    GISCIENCE & REMOTE SENSING, 2022, 59 (01) : 17 - 35