Image captioning for effective use of language models in knowledge-based visual question answering

被引:16
|
作者
Salaberria, Ander [1 ]
Azkune, Gorka [1 ]
Lacalle, Oier Lopez de [1 ]
Soroa, Aitor [1 ]
Agirre, Eneko [1 ]
机构
[1] Univ Basque Country UPV EHU, HiTZ Basque Ctr Language Technol, Ixa NLP Grp, M Lardizabal 1, Donostia San Sebastian 20018, Basque Country, Spain
关键词
Visual question answering; Image captioning; Language models; Deep learning;
D O I
10.1016/j.eswa.2022.118669
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained language models have been shown to include world knowledge, we propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models. More specifically, we verbalize the image contents and allow language models to better leverage their implicit knowledge to solve knowledge-intensive tasks. Focusing on a visual question answering task which requires external knowledge (OK-VQA), our contributions are: (i) a text-only model that outperforms pretrained multimodal (image-text) models of comparable number of parameters; (ii) confirmation that our text-only method is specially effective for tasks requiring external knowledge, as it is less effective in standard a VQA task (VQA 2.0); and (iii) our method attains results in the state-of-the-art when increasing the size of the language model. We also significantly outperform current multimodal systems, even though augmented with external knowledge. Our qualitative analysis on OK-VQA reveals that automatic captions often fail to capture relevant information in the images, which seems to be balanced by the better inference ability of the text-only language models. Our work opens up possibilities to further improve inference in visio-linguistic tasks.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] A Retriever-Reader Framework with Visual Entity Linking for Knowledge-Based Visual Question Answering
    You, Jiuxiang
    Yang, Zhenguo
    Li, Qing
    Liu, Wenyin
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 13 - 18
  • [32] MKEAH: Multimodal knowledge extraction and accumulation based on hyperplane embedding for knowledge-based visual question answering
    Zhang, Heng
    Wei, Zhihua
    Liu, Guanming
    Wang, Rui
    Mu, Ruibin
    Liu, Chuanbao
    Yuan, Aiquan
    Cao, Guodong
    Hu, Ning
    [J]. Virtual Reality and Intelligent Hardware, 6 (04): : 280 - 291
  • [33] MKEAH: Multimodal knowledge extraction and accumulation based on hyperplane embedding for knowledge-based visual question answering
    Heng ZHANG
    Zhihua WEI
    Guanming LIU
    Rui WANG
    Ruibin MU
    Chuanbao LIU
    Aiquan YUAN
    Guodong CAO
    Ning HU
    [J]. 虚拟现实与智能硬件(中英文), 2024, 6 (04) : 280 - 291
  • [34] Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection
    Garcia-Olano, Diego
    Onoe, Yasumasa
    Ghosh, Joydeep
    [J]. COMPANION PROCEEDINGS OF THE WEB CONFERENCE 2022, WWW 2022 COMPANION, 2022, : 705 - 715
  • [35] Weakly-Supervised Visual-Retriever-Reader for Knowledge-based Question Answering
    Luo, Man
    Zeng, Yankai
    Banerjee, Pratyay
    Baral, Chitta
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 6417 - 6431
  • [36] Cross-modality Multiple Relations Learning for Knowledge-based Visual Question Answering
    Wang, Yan
    Li, Peize
    Si, Qingyi
    Zhang, Hanwen
    Zang, Wenyu
    Lin, Zheng
    Fu, Peng
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (03)
  • [37] Inner Knowledge-based Img2Doc Scheme for Visual Question Answering
    Li, Qun
    Xiao, Fu
    Bhanu, Bir
    Sheng, Biyun
    Hong, Richang
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2022, 18 (03)
  • [38] Knowledge-Based Visual Question Answering Using Multi-Modal Semantic Graph
    Jiang, Lei
    Meng, Zuqiang
    [J]. ELECTRONICS, 2023, 12 (06)
  • [39] Direct relation detection for knowledge-based question answering
    Shamsabadi, Abbas Shahini
    Ramezani, Reza
    Farsani, Hadi Khosravi
    Nematbakhsh, Mohammadali
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 211
  • [40] Asking Clarification Questions in Knowledge-Based Question Answering
    Xu, Jingjing
    Wang, Yuechen
    Tang, Duyu
    Duan, Nan
    Yang, Pengcheng
    Zeng, Qi
    Zhou, Ming
    Sun, Xu
    [J]. 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 1618 - 1629