Vision-Language-Knowledge Co-Embedding for Visual Commonsense Reasoning

被引:5
|
作者
Lee, JaeYun [1 ]
Kim, Incheol [1 ]
机构
[1] Kyonggi Univ, Dept Comp Sci, Suwon 16227, South Korea
关键词
visual commonsense reasoning; multimodal co-embedding; knowledge graph; graph convolutional network; pretrained multi-head self-attention network;
D O I
10.3390/s21092911
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Visual commonsense reasoning is an intelligent task performed to decide the most appropriate answer to a question while providing the rationale or reason for the answer when an image, a natural language question, and candidate responses are given. For effective visual commonsense reasoning, both the knowledge acquisition problem and the multimodal alignment problem need to be solved. Therefore, we propose a novel Vision-Language-Knowledge Co-embedding (ViLaKC) model that extracts knowledge graphs relevant to the question from an external knowledge base, ConceptNet, and uses them together with the input image to answer the question. The proposed model uses a pretrained vision-language-knowledge embedding module, which co-embeds multimodal data including images, natural language texts, and knowledge graphs into a single feature vector. To reflect the structural information of the knowledge graph, the proposed model uses the graph convolutional neural network layer to embed the knowledge graph first and then uses multi-head self-attention layers to co-embed it with the image and natural language question. The effectiveness and performance of the proposed model are experimentally validated using the VCR v1.0 benchmark dataset.
引用
收藏
页数:19
相关论文
共 34 条
  • [21] KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for visual commonsense reasoning[Formula presented]
    Song, Dandan
    Ma, Siyi
    Sun, Zhanchen
    Yang, Sicheng
    Liao, Lejian
    Knowledge-Based Systems, 2021, 230
  • [22] Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding
    Cao, Qingxing
    Li, Bailin
    Liang, Xiaodan
    Wang, Keze
    Lin, Liang
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (07) : 2758 - 2767
  • [23] Dynamic Heterogeneous-Graph Reasoning with Language Models and Knowledge Representation Learning for Commonsense Question Answering
    Wang, Yujie
    Zhang, Hu
    Liang, Jiye
    Li, Ru
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 14048 - 14063
  • [24] PU-GEN: Enhancing generative commonsense reasoning for language models with human-centered knowledge
    Seo, Jaehyung
    Oh, Dongsuk
    Eo, Sugyeong
    Park, Chanjun
    Yang, Kisu
    Moon, Hyeonseok
    Park, Kinam
    Lim, Heuiseok
    KNOWLEDGE-BASED SYSTEMS, 2022, 256
  • [25] Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs
    Marasovic, Ana
    Bhagavatula, Chandra
    Park, Jae Sung
    Le Bras, Ronan
    Smith, Noah A.
    Choi, Yejin
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 2810 - 2829
  • [26] Incorporating External Knowledge Reasoning for Vision-and-Language Navigation with Assistant's Help
    Li, Xin
    Zhang, Yu
    Yuan, Weilin
    Luo, Junren
    APPLIED SCIENCES-BASEL, 2022, 12 (14):
  • [27] Transformer with convolution and graph-node co-embedding: An accurate and interpretable vision backbone for predicting gene expressions from local histopathological image
    Xiao, Xiao
    Kong, Yan
    Li, Ronghan
    Wang, Zuoheng
    Lu, Hui
    MEDICAL IMAGE ANALYSIS, 2024, 91
  • [28] CAT-ViL: Co-attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery
    Bai, Long
    Islam, Mobarakol
    Ren, Hongliang
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT IX, 2023, 14228 : 397 - 407
  • [29] KBGN: Knowledge-Bridge Graph Network for Adaptive Vision-Text Reasoning in Visual Dialogue
    Jiang, Xiaoze
    Du, Siyi
    Qin, Zengchang
    Sun, Yajing
    Yu, Jing
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 1265 - 1273
  • [30] Professional vision in the classroom: Teachers? knowledge-based reasoning explaining their visual focus of attention to students
    Muhonen, Heli
    Pakarinen, Eija
    Lerkkanen, Marja-Kristiina
    TEACHING AND TEACHER EDUCATION, 2023, 121