BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

被引:1
|
作者
Sarto, Sara [1 ]
Cornia, Marcella [1 ]
Baraldi, Lorenzo [1 ]
Cucchiara, Rita [1 ,2 ]
机构
[1] Univ Modena & Reggio Emilia, Modena, Italy
[2] IIT CNR, Pisa, Italy
来源
关键词
Captioning Evaluation; Vision-and-Language;
D O I
10.1007/978-3-031-73229-4_5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper, we propose BRIDGE, a new learnable and reference-free image captioning metric that employs a novel module to map visual features into dense vectors and integrates them into multi-modal pseudo-captions which are built during the evaluation process. This approach results in a multimodal metric that properly incorporates information from the input image without relying on reference captions, bridging the gap between human judgment and machine-generated image captions. Experiments spanning several datasets demonstrate that our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores. Our source code and trained models are publicly available at: https:// github.com/aimagelab/bridge- score.
引用
收藏
页码:70 / 87
页数:18
相关论文
共 50 条
  • [21] Ontological Approach to Image Captioning Evaluation
    D. Shunkevich
    N. Iskra
    Pattern Recognition and Image Analysis, 2020, 30 : 288 - 294
  • [22] RVAIC: Refined visual attention for improved image captioning
    Al-Qatf, Majjed
    Hawbani, Ammar
    Wang, XingFu
    Abdusallam, Amr
    Alsamhi, Saeed
    Alhabib, Mohammed
    Curry, Edward
    Journal of Intelligent and Fuzzy Systems, 2024, 46 (02): : 3447 - 3459
  • [24] RVAIC: Refined visual attention for improved image captioning
    Al-Qatf, Majjed
    Hawbani, Ammar
    Wang, XingFu
    Abdusallam, Amr
    Alsamhi, Saeed
    Alhabib, Mohammed
    Curry, Edward
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2024, 46 (02) : 3447 - 3459
  • [25] Image captioning in Bengali language using visual attention
    Masud, Adiba
    Hosen, Md. Biplob
    Habibullah, Md.
    Anannya, Mehrin
    Kaiser, M. Shamim
    PLOS ONE, 2025, 20 (02):
  • [26] Image Captioning With Visual-Semantic Double Attention
    He, Chen
    Hu, Haifeng
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (01)
  • [27] Visual contextual relationship augmented transformer for image captioning
    Su, Qiang
    Hu, Junbo
    Li, Zhixin
    APPLIED INTELLIGENCE, 2024, 54 (06) : 4794 - 4813
  • [28] Image Captioning with Text-Based Visual Attention
    Chen He
    Haifeng Hu
    Neural Processing Letters, 2019, 49 : 177 - 185
  • [29] VISUAL SALIENCY FOR IMAGE CAPTIONING IN NEW MULTIMEDIA SERVICES
    Cornia, Marcella
    Baraldi, Lorenzo
    Serra, Giuseppe
    Cucchiara, Rita
    2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2017,
  • [30] A visual question answering model based on image captioning
    Zhou, Kun
    Liu, Qiongjie
    Zhao, Dexin
    MULTIMEDIA SYSTEMS, 2024, 30 (06)