Vision-Language-Knowledge Co-Embedding for Visual Commonsense Reasoning

被引:5
|
作者
Lee, JaeYun [1 ]
Kim, Incheol [1 ]
机构
[1] Kyonggi Univ, Dept Comp Sci, Suwon 16227, South Korea
关键词
visual commonsense reasoning; multimodal co-embedding; knowledge graph; graph convolutional network; pretrained multi-head self-attention network;
D O I
10.3390/s21092911
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Visual commonsense reasoning is an intelligent task performed to decide the most appropriate answer to a question while providing the rationale or reason for the answer when an image, a natural language question, and candidate responses are given. For effective visual commonsense reasoning, both the knowledge acquisition problem and the multimodal alignment problem need to be solved. Therefore, we propose a novel Vision-Language-Knowledge Co-embedding (ViLaKC) model that extracts knowledge graphs relevant to the question from an external knowledge base, ConceptNet, and uses them together with the input image to answer the question. The proposed model uses a pretrained vision-language-knowledge embedding module, which co-embeds multimodal data including images, natural language texts, and knowledge graphs into a single feature vector. To reflect the structural information of the knowledge graph, the proposed model uses the graph convolutional neural network layer to embed the knowledge graph first and then uses multi-head self-attention layers to co-embed it with the image and natural language question. The effectiveness and performance of the proposed model are experimentally validated using the VCR v1.0 benchmark dataset.
引用
收藏
页数:19
相关论文
共 34 条
  • [1] Learning to Agree on Vision Attention for Visual Commonsense Reasoning
    Li, Zhenyang
    Guo, Yangyang
    Wang, Kejie
    Liu, Fan
    Nie, Liqiang
    Kankanhalli, Mohan
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1065 - 1075
  • [2] Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning
    Yin, Da
    Li, Liunian Harold
    Hu, Ziniu
    Peng, Nanyun
    Chang, Kai-Wei
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 2115 - 2129
  • [3] Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog
    Zhang, Shunyu
    Jiang, Xiaoze
    Yang, Zequn
    Wan, Tao
    Qin, Zengchang
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4599 - 4608
  • [4] Multi-Level Knowledge Injecting for Visual Commonsense Reasoning
    Wen, Zhang
    Peng, Yuxin
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (03) : 1042 - 1054
  • [5] A Co-Embedding Model with Variational Auto-Encoder for Knowledge Graphs
    Xie, Luodi
    Huang, Huimin
    Du, Qing
    APPLIED SCIENCES-BASEL, 2022, 12 (02):
  • [6] How to Use Language Expert to Assist Inference for Visual Commonsense Reasoning
    Song, Zijie
    Hu, Wenbo
    Ye, Hao
    Hong, Richang
    2023 23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW 2023, 2023, : 521 - 527
  • [7] Language Generation with Multi-Hop Reasoning on Commonsense Knowledge Graph
    Ji, Haozhe
    Ke, Pei
    Huang, Shaohan
    Wei, Furu
    Zhu, Xiaoyan
    Huang, Minlie
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 725 - 736
  • [8] Multi-Modal Structure-Embedding Graph Transformer for Visual Commonsense Reasoning
    Zhu, Jian
    Wang, Hanli
    He, Bin
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1295 - 1305
  • [9] Efficient and self-adaptive rationale knowledge base for visual commonsense reasoning
    Zijie Song
    Zhenzhen Hu
    Richang Hong
    Multimedia Systems, 2023, 29 : 3017 - 3026
  • [10] Efficient and self-adaptive rationale knowledge base for visual commonsense reasoning
    Song, Zijie
    Hu, Zhenzhen
    Hong, Richang
    MULTIMEDIA SYSTEMS, 2023, 29 (05) : 3017 - 3026