BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

被引：1

作者：

Sarto, Sara ^{[1
]}

Cornia, Marcella ^{[1
]}

Baraldi, Lorenzo ^{[1
]}

Cucchiara, Rita ^{[1
,2
]}

机构：

[1] Univ Modena & Reggio Emilia, Modena, Italy

[2] IIT CNR, Pisa, Italy

来源：

COMPUTER VISION - ECCV 2024, PT LXXVIII | 2025年 / 15136卷

关键词：

Captioning Evaluation; Vision-and-Language;

D O I：

10.1007/978-3-031-73229-4_5

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper, we propose BRIDGE, a new learnable and reference-free image captioning metric that employs a novel module to map visual features into dense vectors and integrates them into multi-modal pseudo-captions which are built during the evaluation process. This approach results in a multimodal metric that properly incorporates information from the input image without relying on reference captions, bridging the gap between human judgment and machine-generated image captions. Experiments spanning several datasets demonstrate that our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores. Our source code and trained models are publicly available at: https:// github.com/aimagelab/bridge- score.

引用

页码：70 / 87

页数：18

共 50 条

[21] Ontological Approach to Image Captioning Evaluation
D. Shunkevich
N. Iskra
Pattern Recognition and Image Analysis, 2020, 30 : 288 - 294
[22] RVAIC: Refined visual attention for improved image captioning
Al-Qatf, Majjed
Hawbani, Ammar
Wang, XingFu
Abdusallam, Amr
Alsamhi, Saeed
Alhabib, Mohammed
Curry, Edward
Journal of Intelligent and Fuzzy Systems, 2024, 46 (02): : 3447 - 3459
[23] Visual Linguistic Model and Its Applications in Image Captioning
Kumar R.
SN Computer Science, 2020, 1 (3)
[24] RVAIC: Refined visual attention for improved image captioning
Al-Qatf, Majjed
Hawbani, Ammar
Wang, XingFu
Abdusallam, Amr
Alsamhi, Saeed
Alhabib, Mohammed
Curry, Edward
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2024, 46 (02) : 3447 - 3459
[25] Image captioning in Bengali language using visual attention
Masud, Adiba
Hosen, Md. Biplob
Habibullah, Md.
Anannya, Mehrin
Kaiser, M. Shamim
PLOS ONE, 2025, 20 (02):
[26] Image Captioning With Visual-Semantic Double Attention
He, Chen
Hu, Haifeng
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (01)
[27] Visual contextual relationship augmented transformer for image captioning
Su, Qiang
Hu, Junbo
Li, Zhixin
APPLIED INTELLIGENCE, 2024, 54 (06) : 4794 - 4813
[28] Image Captioning with Text-Based Visual Attention
Chen He
Haifeng Hu
Neural Processing Letters, 2019, 49 : 177 - 185
[29] VISUAL SALIENCY FOR IMAGE CAPTIONING IN NEW MULTIMEDIA SERVICES
Cornia, Marcella
Baraldi, Lorenzo
Serra, Giuseppe
Cucchiara, Rita
2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2017,
[30] A visual question answering model based on image captioning
Zhou, Kun
Liu, Qiongjie
Zhao, Dexin
MULTIMEDIA SYSTEMS, 2024, 30 (06)

← 1 2 3 4 5 →