BOK-VQA: Bilingual outside Knowledge-Based Visual Question Answering via Graph Representation Pretraining

被引:0
|
作者
Kim, MinJun [1 ]
Song, SeungWoo [1 ]
Lee, YouHan [2 ]
Jang, Haneol [1 ]
Lim, KyungTae [3 ]
机构
[1] Hanbat Natl Univ, Daejeon, South Korea
[2] Kakao Brain, Seongnam, South Korea
[3] Seoul Natl Univ Sci & Technol, Seoul, South Korea
基金
新加坡国家研究基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The current research direction in generative models, such as the recently developed GPT4, aims to find relevant knowledge information for multimodal and multilingual inputs to provide answers. Under these research circumstances, the demand for multilingual evaluation of visual question answering (VQA) tasks, a representative task of multimodal systems, has increased. Accordingly, we propose a bilingual outside-knowledge VQA (BOK-VQA) dataset in this study that can be extended to multilingualism. The proposed data include 17K images, 17K question-answer pairs for both Korean and English and 280K instances of knowledge information related to question-answer content. We also present a framework that can effectively inject knowledge information into a VQA system by pretraining the knowledge information of BOK-VQA data in the form of graph embeddings. Finally, through in-depth analysis, we demonstrated the actual effect of the knowledge information contained in the constructed training data on VQA.
引用
收藏
页码:18381 / 18389
页数:9
相关论文
共 38 条
  • [21] A Retriever-Reader Framework with Visual Entity Linking for Knowledge-Based Visual Question Answering
    You, Jiuxiang
    Yang, Zhenguo
    Li, Qing
    Liu, Wenyin
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 13 - 18
  • [22] Prompting Large Language Models with Knowledge-Injection for Knowledge-Based Visual Question Answering
    Hu, Zhongjian
    Yang, Peng
    Liu, Fengyuan
    Meng, Yuan
    Liu, Xingyu
    [J]. BIG DATA MINING AND ANALYTICS, 2024, 7 (03): : 843 - 857
  • [23] MKEAH: Multimodal knowledge extraction and accumulation based on hyperplane embedding for knowledge-based visual question answering
    Zhang, Heng
    Wei, Zhihua
    Liu, Guanming
    Wang, Rui
    Mu, Ruibin
    Liu, Chuanbao
    Yuan, Aiquan
    Cao, Guodong
    Hu, Ning
    [J]. Virtual Reality and Intelligent Hardware, 2024, 6 (04): : 280 - 291
  • [24] MKEAH: Multimodal knowledge extraction and accumulation based on hyperplane embedding for knowledge-based visual question answering
    Heng ZHANG
    Zhihua WEI
    Guanming LIU
    Rui WANG
    Ruibin MU
    Chuanbao LIU
    Aiquan YUAN
    Guodong CAO
    Ning HU
    [J]. 虚拟现实与智能硬件(中英文)., 2024, 6 (04) - 291
  • [25] Let Me Show You Step by Step: An Interpretable Graph Routing Network for Knowledge-based Visual Question Answering
    Wang, Duokang
    Hu, Linmei
    Hao, Rui
    Shao, Yingxia
    Lv, Xin
    Nie, Liqiang
    Li, Juanzi
    [J]. PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 1984 - 1994
  • [26] Cross-modality Multiple Relations Learning for Knowledge-based Visual Question Answering
    Wang, Yan
    Li, Peize
    Si, Qingyi
    Zhang, Hanwen
    Zang, Wenyu
    Lin, Zheng
    Fu, Peng
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (03)
  • [27] Weakly-Supervised Visual-Retriever-Reader for Knowledge-based Question Answering
    Luo, Man
    Zeng, Yankai
    Banerjee, Pratyay
    Baral, Chitta
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 6417 - 6431
  • [28] Inner Knowledge-based Img2Doc Scheme for Visual Question Answering
    Li, Qun
    Xiao, Fu
    Bhanu, Bir
    Sheng, Biyun
    Hong, Richang
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2022, 18 (03)
  • [29] Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering
    Shao, Zhenwei
    Yu, Zhou
    Wang, Meng
    Yu, Jun
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14974 - 14983
  • [30] Image captioning for effective use of language models in knowledge-based visual question answering
    Salaberria, Ander
    Azkune, Gorka
    Lacalle, Oier Lopez de
    Soroa, Aitor
    Agirre, Eneko
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 212