Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model

被引：1

作者：

Cheng, Kanzhi ^{[1
]}

Song, Wenpo ^{[1
]}

Ma, Zheng ^{[1
]}

Zhu, Wenhao ^{[1
]}

Zhu, Zixuan ^{[2
]}

Zhang, Jianbing ^{[1
]}

机构：

[1] Nanjing Univ, Natl Key Lab Novel Software Technol, Nanjing, Peoples R China

[2] Univ Glasgow, Glasgow, Lanark, Scotland

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

关键词：

Image Captioning; Vision-Language Pre-Training; Knowledge;

D O I：

10.1145/3581783.3611987

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

descriptions that lack real-world knowledge, e.g., named entities and contextual information. Considering that Vision-Language Pre-Training (VLP) models master massive such knowledge from large-scale web-harvested data, it is promising to utilize the generalizability of VLP models to incorporate knowledge into image descriptions. However, using VLP models faces challenges: zero-shot inference suffers from knowledge hallucination that leads to low-quality descriptions, but the generic bias in downstream task fine-tuning hinders the VLP model from expressing knowledge. To address these concerns, we propose a simple yet effective method called Knowledge-guided Replay (K-Replay), which enables the retention of pre-training knowledge during fine-tuning. Our approach consists of two parts: (1) a knowledge prediction task on automatically collected replay exemplars to continuously awaken the VLP model's memory about knowledge, thus preventing the model from collapsing into the generic pattern; (2) a knowledge distillation constraint to improve the faithfulness of generated descriptions hence alleviating the knowledge hallucination. To evaluate knowledge-enhanced descriptions, we construct a novel captioning benchmark KnowCap, containing knowledge of landmarks, famous brands, special foods and movie characters. Experimental results show that our approach effectively incorporates knowledge into descriptions, outperforming strong VLP baseline by 20.9 points (78.7 -> 99.6) in CIDEr score and 20.5 percentage points (34.0%-> 54.5%) in knowledge recognition accuracy. Our code and data is available at https://github.com/njucckevin/KnowCap.

引用

页码：5038 / 5047

页数：10

共 50 条

[21] Vision-language pre-training via modal interaction
Cheng, Hang
Ye, Hehui
Zhou, Xiaofei
Liu, Ximeng
Chen, Fei
Wang, Meiqing
[J]. PATTERN RECOGNITION, 2024, 156
[22] Leveraging per Image-Token Consistency for Vision-Language Pre-training
Gou, Yunhao
Ko, Tom
Yang, Hansi
Kwok, James
Zhang, Yu
Wang, Mingxuan
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19155 - 19164
[23] Vision-Language Pre-Training with Triple Contrastive Learning
Yang, Jinyu
Duan, Jiali
Tran, Son
Xu, Yi
Chanda, Sampath
Chen, Liqun
Zeng, Belinda
Chilimbi, Trishul
Huang, Junzhou
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15650 - 15659
[24] GilBERT: Generative Vision-Language Pre-Training for Image-Text Retrieval
Hong, Weixiang
Ji, Kaixiang
Liu, Jiajia
Wang, Jian
Chen, Jingdong
Chu, Wei
[J]. SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1379 - 1388
[25] Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval
Yao, Tao
Peng, Shouyong
Wang, Lili
Li, Ying
Sun, Yujuan
[J]. APPLIED INTELLIGENCE, 2024, 54 (23) : 12230 - 12245
[26] BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Li, Junnan
Li, Dongxu
Xiong, Caiming
Hoi, Steven
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[27] MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
Ji, Yatai
Wang, Junjie
Gong, Yuan
Zhang, Lin
Zhu, Yanru
Wang, Hongfa
Zhang, Jiaxing
Sakai, Tetsuya
Yang, Yujiu
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23262 - 23271
[28] Multimodal detection of hateful memes by applying a vision-language pre-training model
Chen, Yuyang
Pan, Feng
[J]. PLOS ONE, 2022, 17 (09):
[29] Vision-Language Pre-Training for Boosting Scene Text Detectors
Song, Sibo
Wan, Jianqiang
Yang, Zhibo
Tang, Jun
Cheng, Wenqing
Bai, Xiang
Yao, Cong
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15660 - 15670
[30] Leveraging vision-language prompts for real-world image restoration and enhancement
[J]. Zhang, Zhao (cszzhang@gmail.com), 2025, 250

← 1 2 3 4 5 →