BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

被引：1

作者：

Sarto, Sara ^{[1
]}

Cornia, Marcella ^{[1
]}

Baraldi, Lorenzo ^{[1
]}

Cucchiara, Rita ^{[1
,2
]}

机构：

[1] Univ Modena & Reggio Emilia, Modena, Italy

[2] IIT CNR, Pisa, Italy

来源：

COMPUTER VISION - ECCV 2024, PT LXXVIII | 2025年 / 15136卷

关键词：

Captioning Evaluation; Vision-and-Language;

D O I：

10.1007/978-3-031-73229-4_5

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper, we propose BRIDGE, a new learnable and reference-free image captioning metric that employs a novel module to map visual features into dense vectors and integrates them into multi-modal pseudo-captions which are built during the evaluation process. This approach results in a multimodal metric that properly incorporates information from the input image without relying on reference captions, bridging the gap between human judgment and machine-generated image captions. Experiments spanning several datasets demonstrate that our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores. Our source code and trained models are publicly available at: https:// github.com/aimagelab/bridge- score.

引用

页码：70 / 87

页数：18

共 50 条

[31] Image Captioning with Text-Based Visual Attention
He, Chen
Hu, Haifeng
NEURAL PROCESSING LETTERS, 2019, 49 (01) : 177 - 185
[32] DIFNet: Boosting Visual Information Flow for Image Captioning
Wu, Mingrui
Zhang, Xuying
Sun, Xiaoshuai
Zhou, Yiyi
Chen, Chao
Gu, Jiaxin
Sun, Xing
Ji, Rongrong
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17999 - 18008
[33] Visual contextual relationship augmented transformer for image captioning
Qiang Su
Junbo Hu
Zhixin Li
Applied Intelligence, 2024, 54 : 4794 - 4813
[34] Aligned visual semantic scene graph for image captioning
Zhao, Shanshan
Li, Lixiang
Peng, Haipeng
DISPLAYS, 2022, 74
[35] Visual News: Benchmark and Challenges in News Image Captioning
Liu, Fuxiao
Wang, Yinghan
Wang, Tianlu
Ordonez, Vicente
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 6761 - 6771
[36] Combine Visual Features and Scene Semantics for Image Captioning
Li Z.-X.
Wei H.-Y.
Huang F.-C.
Zhang C.-L.
Ma H.-F.
Shi Z.-Z.
Li, Zhi-Xin (lizx@gxnu.edu.cn), 1624, Science Press (43): : 1624 - 1640
[37] Bridging patterns: An approach to bridge gaps between SE and HCI
Folmer, E
van Welie, M
Bosch, J
INFORMATION AND SOFTWARE TECHNOLOGY, 2006, 48 (02) : 69 - 89
[38] Bridging the Gap between Vision and Language Domains for Improved Image Captioning
Liu, Fenglin
Wu, Xian
Ge, Shen
Zhang, Xiaoyu
Fan, Wei
Zou, Yuexian
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4153 - 4161
[39] A Grey Relational Analysis based Evaluation Metric for Image Captioning and Video Captioning
Ma, Miao
Wang, Bolong
PROCEEDINGS OF 2017 IEEE INTERNATIONAL CONFERENCE ON GREY SYSTEMS AND INTELLIGENT SERVICES (GSIS), 2017, : 76 - 81
[40] Semantic interdisciplinary evaluation of image captioning models
Sirisha, Uddagiri
Chandana, Bolem Sai
COGENT ENGINEERING, 2022, 9 (01):

← 1 2 3 4 5 →