BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

被引:1
|
作者
Sarto, Sara [1 ]
Cornia, Marcella [1 ]
Baraldi, Lorenzo [1 ]
Cucchiara, Rita [1 ,2 ]
机构
[1] Univ Modena & Reggio Emilia, Modena, Italy
[2] IIT CNR, Pisa, Italy
来源
关键词
Captioning Evaluation; Vision-and-Language;
D O I
10.1007/978-3-031-73229-4_5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper, we propose BRIDGE, a new learnable and reference-free image captioning metric that employs a novel module to map visual features into dense vectors and integrates them into multi-modal pseudo-captions which are built during the evaluation process. This approach results in a multimodal metric that properly incorporates information from the input image without relying on reference captions, bridging the gap between human judgment and machine-generated image captions. Experiments spanning several datasets demonstrate that our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores. Our source code and trained models are publicly available at: https:// github.com/aimagelab/bridge- score.
引用
收藏
页码:70 / 87
页数:18
相关论文
共 50 条
  • [31] Image Captioning with Text-Based Visual Attention
    He, Chen
    Hu, Haifeng
    NEURAL PROCESSING LETTERS, 2019, 49 (01) : 177 - 185
  • [32] DIFNet: Boosting Visual Information Flow for Image Captioning
    Wu, Mingrui
    Zhang, Xuying
    Sun, Xiaoshuai
    Zhou, Yiyi
    Chen, Chao
    Gu, Jiaxin
    Sun, Xing
    Ji, Rongrong
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17999 - 18008
  • [33] Visual contextual relationship augmented transformer for image captioning
    Qiang Su
    Junbo Hu
    Zhixin Li
    Applied Intelligence, 2024, 54 : 4794 - 4813
  • [34] Aligned visual semantic scene graph for image captioning
    Zhao, Shanshan
    Li, Lixiang
    Peng, Haipeng
    DISPLAYS, 2022, 74
  • [35] Visual News: Benchmark and Challenges in News Image Captioning
    Liu, Fuxiao
    Wang, Yinghan
    Wang, Tianlu
    Ordonez, Vicente
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 6761 - 6771
  • [36] Combine Visual Features and Scene Semantics for Image Captioning
    Li Z.-X.
    Wei H.-Y.
    Huang F.-C.
    Zhang C.-L.
    Ma H.-F.
    Shi Z.-Z.
    Li, Zhi-Xin (lizx@gxnu.edu.cn), 1624, Science Press (43): : 1624 - 1640
  • [37] Bridging patterns: An approach to bridge gaps between SE and HCI
    Folmer, E
    van Welie, M
    Bosch, J
    INFORMATION AND SOFTWARE TECHNOLOGY, 2006, 48 (02) : 69 - 89
  • [38] Bridging the Gap between Vision and Language Domains for Improved Image Captioning
    Liu, Fenglin
    Wu, Xian
    Ge, Shen
    Zhang, Xiaoyu
    Fan, Wei
    Zou, Yuexian
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4153 - 4161
  • [39] A Grey Relational Analysis based Evaluation Metric for Image Captioning and Video Captioning
    Ma, Miao
    Wang, Bolong
    PROCEEDINGS OF 2017 IEEE INTERNATIONAL CONFERENCE ON GREY SYSTEMS AND INTELLIGENT SERVICES (GSIS), 2017, : 76 - 81
  • [40] Semantic interdisciplinary evaluation of image captioning models
    Sirisha, Uddagiri
    Chandana, Bolem Sai
    COGENT ENGINEERING, 2022, 9 (01):