BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

被引:1
|
作者
Sarto, Sara [1 ]
Cornia, Marcella [1 ]
Baraldi, Lorenzo [1 ]
Cucchiara, Rita [1 ,2 ]
机构
[1] Univ Modena & Reggio Emilia, Modena, Italy
[2] IIT CNR, Pisa, Italy
来源
关键词
Captioning Evaluation; Vision-and-Language;
D O I
10.1007/978-3-031-73229-4_5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper, we propose BRIDGE, a new learnable and reference-free image captioning metric that employs a novel module to map visual features into dense vectors and integrates them into multi-modal pseudo-captions which are built during the evaluation process. This approach results in a multimodal metric that properly incorporates information from the input image without relying on reference captions, bridging the gap between human judgment and machine-generated image captions. Experiments spanning several datasets demonstrate that our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores. Our source code and trained models are publicly available at: https:// github.com/aimagelab/bridge- score.
引用
收藏
页码:70 / 87
页数:18
相关论文
共 50 条
  • [41] Visual Analytics for Efficient Image Exploration and User-Guided Image Captioning
    Li, Yiran
    Wang, Junpeng
    Aboagye, Prince
    Yeh, Chin-Chia Michael
    Zheng, Yan
    Wang, Liang
    Zhang, Wei
    Ma, Kwan-Liu
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2024, 30 (06) : 2875 - 2887
  • [42] A Visual Attention-Based Model for Bengali Image Captioning
    Das B.
    Pal R.
    Majumder M.
    Phadikar S.
    Sekh A.A.
    SN Computer Science, 4 (2)
  • [43] Boosting convolutional image captioning with semantic content and visual relationship
    Bai, Cong
    Zheng, Anqi
    Huang, Yuan
    Pan, Xiang
    Chen, Nan
    DISPLAYS, 2021, 70
  • [44] Multi-level Visual Fusion Networks for Image Captioning
    Zhou, Dongming
    Zhang, Canlong
    Li, Zhixin
    Wang, Zhiwen
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [45] Neuraltalk+: neural image captioning with visual assistance capabilities
    Sharma H.
    Padha D.
    Multimedia Tools and Applications, 2025, 84 (10) : 6843 - 6871
  • [46] A Unified Visual and Linguistic Semantics Method for Enhanced Image Captioning
    Peng, Jiajia
    Tang, Tianbing
    APPLIED SCIENCES-BASEL, 2024, 14 (06):
  • [47] Aligning Linguistic Words and Visual Semantic Units for Image Captioning
    Guo, Longteng
    Liu, Jing
    Tang, Jinhui
    Li, Jiangwei
    Luo, Wei
    Lu, Hanqing
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 765 - 773
  • [48] Social Image Captioning: Exploring Visual Attention and User Attention
    Wang, Leiquan
    Chu, Xiaoliang
    Zhang, Weishan
    Wei, Yiwei
    Sun, Weichen
    Wu, Chunlei
    SENSORS, 2018, 18 (02)
  • [49] Local-global visual interaction attention for image captioning
    Wang, Changzhi
    Gu, Xiaodong
    DIGITAL SIGNAL PROCESSING, 2022, 130
  • [50] VIXEN: Visual Text Comparison Network for Image Difference Captioning
    Black, Alexander
    Shi, Jing
    Fan, Yifei
    Bui, Tu
    Collomosse, John
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 2, 2024, : 846 - 854