On Diversity in Image Captioning: Metrics and Methods

被引:13
|
作者
Wang, Qingzhong [1 ]
Wan, Jia [1 ]
Chan, Antoni B. [1 ]
机构
[1] City Univ Hong, Dept Comp Sci, Kowloon, Hong Kong, Peoples R China
关键词
Measurement; Semantics; Learning (artificial intelligence); Vegetation; Legged locomotion; Training; Computational modeling; Image captioning; diverse captions; reinforcement learning; policy gradient; adversarial training; diversity metric;
D O I
10.1109/TPAMI.2020.3013834
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Diversity is one of the most important properties in image captioning, as it reflects various expressions of important concepts presented in an image. However, the most popular metrics cannot well evaluate the diversity of multiple captions. In this paper, we first propose a metric to measure the diversity of a set of captions, which is derived from latent semantic analysis (LSA), and then kernelize LSA using CIDEr (R. Vedantam et al., 2015) similarity. Compared with mBLEU (R. Shetty et al., 2017), our proposed diversity metrics show a relatively strong correlation to human evaluation. We conduct extensive experiments, finding there is a large gap between the performance of the current state-of-the-art models and human annotations considering both diversity and accuracy; the models that aim to generate captions with higher CIDEr scores normally obtain lower diversity scores, which generally learn to describe images using common words. To bridge this "diversity" gap, we consider several methods for training caption models to generate diverse captions. First, we show that balancing the cross-entropy loss and CIDEr reward in reinforcement learning during training can effectively control the tradeoff between diversity and accuracy of the generated captions. Second, we develop approaches that directly optimize our diversity metric and CIDEr score using reinforcement learning. These proposed approaches using reinforcement learning (RL) can be unified into a self-critical (S. J. Rennie et al., 2017) framework with new RL baselines. Third, we combine accuracy and diversity into a single measure using an ensemble matrix, and then maximize the determinant of the ensemble matrix via reinforcement learning to boost diversity and accuracy, which outperforms its counterparts on the oracle test. Finally, inspired by determinantal point processes (DPP), we develop a DPP selection algorithm to select a subset of captions from a large number of candidate captions. The experimental results show that maximizing the determinant of the ensemble matrix outperforms other methods considerably improving diversity and accuracy.
引用
收藏
页码:1035 / 1049
页数:15
相关论文
共 50 条
  • [1] Image Captioning Methods and Metrics
    Sargar, Omkar
    Kinger, Shakti
    [J]. 2021 INTERNATIONAL CONFERENCE ON EMERGING SMART COMPUTING AND INFORMATICS (ESCI), 2021, : 522 - 526
  • [2] A comprehensive literature review on image captioning methods and metrics based on deep learning technique
    Al-Shamayleh, Ahmad Sami
    Adwan, Omar
    Alsharaiah, Mohammad A.
    Hussein, Abdelrahman H.
    Kharma, Qasem M.
    Eke, Christopher Ifeanyi
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (12) : 34219 - 34268
  • [3] A comprehensive literature review on image captioning methods and metrics based on deep learning technique
    Ahmad Sami Al-Shamayleh
    Omar Adwan
    Mohammad A. Alsharaiah
    Abdelrahman H. Hussein
    Qasem M. Kharma
    Christopher Ifeanyi Eke
    [J]. Multimedia Tools and Applications, 2024, 83 : 34219 - 34268
  • [4] Are metrics measuring what they should? An evaluation of Image Captioning task metrics
    Gonzalez-Chavez, Othon
    Ruiz, Guillermo
    Moctezuma, Daniela
    Ramirez-delReal, Tania
    [J]. SIGNAL PROCESSING-IMAGE COMMUNICATION, 2024, 120
  • [5] Training for Diversity in Image Paragraph Captioning
    Melas-Kyriazi, Luke
    Han, George
    Rush, Alexander M.
    [J]. 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 757 - 761
  • [6] Re-evaluating Automatic Metrics for Image Captioning
    Kilickaya, Mert
    Erdem, Aykut
    Ikizler-Cinbis, Nazli
    Erdem, Erkut
    [J]. 15TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2017), VOL 1: LONG PAPERS, 2017, : 199 - 209
  • [7] Deep learning and knowledge graph for image/video captioning: A review of datasets, evaluation metrics, and methods
    Wajid, Mohammad Saif
    Terashima-Marin, Hugo
    Najafirad, Peyman
    Wajid, Mohd Anas
    [J]. ENGINEERING REPORTS, 2024, 6 (01)
  • [8] Describing like Humans: on Diversity in Image Captioning
    Wang, Qingzhong
    Chan, Antoni B.
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4190 - 4198
  • [9] Modeling coherence and diversity for image paragraph captioning
    He, Xiangheng
    Li, Xinde
    [J]. 2020 5TH INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS AND MECHATRONICS (ICARM 2020), 2020, : 634 - 639
  • [10] A thorough review of models, evaluation metrics, and datasets on image captioning
    Luo, Gaifang
    Cheng, Lijun
    Jing, Chao
    Zhao, Can
    Song, Guozhu
    [J]. IET IMAGE PROCESSING, 2022, 16 (02) : 311 - 332