BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

被引：1

作者：

Sarto, Sara ^{[1
]}

Cornia, Marcella ^{[1
]}

Baraldi, Lorenzo ^{[1
]}

Cucchiara, Rita ^{[1
,2
]}

机构：

[1] Univ Modena & Reggio Emilia, Modena, Italy

[2] IIT CNR, Pisa, Italy

来源：

COMPUTER VISION - ECCV 2024, PT LXXVIII | 2025年 / 15136卷

关键词：

Captioning Evaluation; Vision-and-Language;

D O I：

10.1007/978-3-031-73229-4_5

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper, we propose BRIDGE, a new learnable and reference-free image captioning metric that employs a novel module to map visual features into dense vectors and integrates them into multi-modal pseudo-captions which are built during the evaluation process. This approach results in a multimodal metric that properly incorporates information from the input image without relying on reference captions, bridging the gap between human judgment and machine-generated image captions. Experiments spanning several datasets demonstrate that our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores. Our source code and trained models are publicly available at: https:// github.com/aimagelab/bridge- score.

引用

页码：70 / 87

页数：18

共 50 条

[41] Visual Analytics for Efficient Image Exploration and User-Guided Image Captioning
Li, Yiran
Wang, Junpeng
Aboagye, Prince
Yeh, Chin-Chia Michael
Zheng, Yan
Wang, Liang
Zhang, Wei
Ma, Kwan-Liu
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2024, 30 (06) : 2875 - 2887
[42] A Visual Attention-Based Model for Bengali Image Captioning
Das B.
Pal R.
Majumder M.
Phadikar S.
Sekh A.A.
SN Computer Science, 4 (2)
[43] Boosting convolutional image captioning with semantic content and visual relationship
Bai, Cong
Zheng, Anqi
Huang, Yuan
Pan, Xiang
Chen, Nan
DISPLAYS, 2021, 70
[44] Multi-level Visual Fusion Networks for Image Captioning
Zhou, Dongming
Zhang, Canlong
Li, Zhixin
Wang, Zhiwen
2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
[45] Neuraltalk+: neural image captioning with visual assistance capabilities
Sharma H.
Padha D.
Multimedia Tools and Applications, 2025, 84 (10) : 6843 - 6871
[46] A Unified Visual and Linguistic Semantics Method for Enhanced Image Captioning
Peng, Jiajia
Tang, Tianbing
APPLIED SCIENCES-BASEL, 2024, 14 (06):
[47] Aligning Linguistic Words and Visual Semantic Units for Image Captioning
Guo, Longteng
Liu, Jing
Tang, Jinhui
Li, Jiangwei
Luo, Wei
Lu, Hanqing
PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 765 - 773
[48] Social Image Captioning: Exploring Visual Attention and User Attention
Wang, Leiquan
Chu, Xiaoliang
Zhang, Weishan
Wei, Yiwei
Sun, Weichen
Wu, Chunlei
SENSORS, 2018, 18 (02)
[49] Local-global visual interaction attention for image captioning
Wang, Changzhi
Gu, Xiaodong
DIGITAL SIGNAL PROCESSING, 2022, 130
[50] VIXEN: Visual Text Comparison Network for Image Difference Captioning
Black, Alexander
Shi, Jing
Fan, Yifei
Bui, Tu
Collomosse, John
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 2, 2024, : 846 - 854

← 1 2 3 4 5 →