BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

被引:1
|
作者
Sarto, Sara [1 ]
Cornia, Marcella [1 ]
Baraldi, Lorenzo [1 ]
Cucchiara, Rita [1 ,2 ]
机构
[1] Univ Modena & Reggio Emilia, Modena, Italy
[2] IIT CNR, Pisa, Italy
来源
关键词
Captioning Evaluation; Vision-and-Language;
D O I
10.1007/978-3-031-73229-4_5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper, we propose BRIDGE, a new learnable and reference-free image captioning metric that employs a novel module to map visual features into dense vectors and integrates them into multi-modal pseudo-captions which are built during the evaluation process. This approach results in a multimodal metric that properly incorporates information from the input image without relying on reference captions, bridging the gap between human judgment and machine-generated image captions. Experiments spanning several datasets demonstrate that our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores. Our source code and trained models are publicly available at: https:// github.com/aimagelab/bridge- score.
引用
收藏
页码:70 / 87
页数:18
相关论文
共 50 条
  • [1] Quantifying the Impact of Complementary Visual and Textual Cues Under Image Captioning
    Akilan, Thangarajah
    Thiagarajan, Amitha
    Venkatesan, Bharathwaaj
    Thirumeni, Sowmiya
    Chandrasekaran, Sanjana Gurusamy
    2020 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2020, : 389 - 394
  • [2] Bridging by Word: Image-Grounded Vocabulary Construction for Visual Captioning
    Fan, Zhihao
    Wei, Zhongyu
    Wang, Siyuan
    Huang, Xuanjing
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 6514 - 6524
  • [3] Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning
    Xie, Yujia
    Zhou, Luowei
    Dai, Xiyang
    Yuan, Lu
    Bach, Nguyen
    Liu, Ce
    Zeng, Michael
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [4] Visual Relationship Attention for Image Captioning
    Zhang, Zongjian
    Wu, Qiang
    Wang, Yang
    Chen, Fang
    2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [5] Visual Cluster Grounding for Image Captioning
    Jiang, Wenhui
    Zhu, Minwei
    Fang, Yuming
    Shi, Guangming
    Zhao, Xiaowei
    Liu, Yang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 3920 - 3934
  • [6] Bengali Image Captioning with Visual Attention
    Ami, Amit Saha
    Humaira, Mayeesha
    Jim, Md Abidur Rahman Khan
    Paul, Shimul
    Shah, Faisal Muhammad
    2020 23RD INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY (ICCIT 2020), 2020,
  • [7] A visual persistence model for image captioning
    Wang, Yiyu
    Xu, Jungang
    Sun, Yingfei
    NEUROCOMPUTING, 2022, 468 : 48 - 59
  • [8] Incorporating Unlikely Negative Cues for Distinctive Image Captioning
    Fei, Zhengcong
    Huang, Junshi
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 745 - 753
  • [9] Visual enhanced gLSTM for image captioning
    Zhang, Jing
    Li, Kangkang
    Wang, Zhenkun
    Zhao, Xianwen
    Wang, Zhe
    EXPERT SYSTEMS WITH APPLICATIONS, 2021, 184
  • [10] Exploring Visual Relationship for Image Captioning
    Yao, Ting
    Pan, Yingwei
    Li, Yehao
    Mei, Tao
    COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 : 711 - 727