Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog

被引:9
|
作者
Zhang, Shunyu [1 ]
Jiang, Xiaoze [1 ]
Yang, Zequn [1 ]
Wan, Tao [2 ]
Qin, Zengchang [1 ]
机构
[1] Beihang Univ, Sch ASEE, Intelligent Comp & Machine Learning Lab, Beijing, Peoples R China
[2] Beihang Univ, Beijing Adv Innovat Ctr Biomed Engn, Sch BSME, Beijing, Peoples R China
关键词
LANGUAGE;
D O I
10.1109/CVPRW56347.2022.00506
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Visual Dialog requires an agent to engage in a conversation with humans grounded in an image. Many studies on Visual Dialog focus on the understanding of the dialog history or the content of an image, while a considerable amount of commonsense-required questions are ignored. Handling these scenarios depends on logical reasoning that requires commonsense priors. How to capture relevant commonsense knowledge complementary to the history and the image remains a key challenge. In this paper, we propose a novel model by Reasoning with Multi-structure Commonsense Knowledge (RMK). In our model, the external knowledge is represented with sentence-level facts and graph-level facts, to properly suit the scenario of the composite of dialog history and image. On top of these multi-structure representations, our model can capture relevant knowledge and incorporate them into the vision and semantic features, via graph-based interaction and transformer-based fusion. Experimental results and analysis on VisDial v1.0 and VisDialCK datasets show that our proposed model effectively outperforms comparative methods.
引用
收藏
页码:4599 / 4608
页数:10
相关论文
共 50 条
  • [1] Multi-Level Knowledge Injecting for Visual Commonsense Reasoning
    Wen, Zhang
    Peng, Yuxin
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (03) : 1042 - 1054
  • [2] Multi-Modal Structure-Embedding Graph Transformer for Visual Commonsense Reasoning
    Zhu, Jian
    Wang, Hanli
    He, Bin
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1295 - 1305
  • [3] Commonsense Reasoning and Commonsense Knowledge in Artificial Intelligence
    Davis, Ernest
    Marcus, Gary
    [J]. COMMUNICATIONS OF THE ACM, 2015, 58 (09) : 92 - 103
  • [4] Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning
    Zhang, Xi
    Zhang, Feifei
    Xu, Changsheng
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1793 - 1802
  • [5] Counterfactual Visual Dialog: Robust Commonsense Knowledge Learning From Unbiased Training
    Liu, An-An
    Huang, Chenxi
    Xu, Ning
    Tian, Hongshuo
    Liu, Jing
    Zhang, Yongdong
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1639 - 1651
  • [6] Social Commonsense Reasoning with Multi-Head Knowledge Attention
    Paul, Debjit
    Frank, Anette
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020,
  • [7] Associative Reasoning for Commonsense Knowledge
    Schon, Claudia
    [J]. ADVANCES IN ARTIFICIAL INTELLIGENCE, KI 2023, 2023, 14236 : 170 - 183
  • [8] Visual commonsense reasoning with directional visual connections
    Han, Yahong
    Wu, Aming
    Zhu, Linchao
    Yang, Yi
    [J]. FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2021, 22 (05) : 625 - 637
  • [9] Reasoning Visual Dialog with Sparse Graph Learning and Knowledge Transfer
    Kang, Gi-Cheon
    Park, Junseok
    Lee, Hwaran
    Zhang, Byoung-Tak
    Kim, Jin-Hwa
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 327 - 339
  • [10] Language Generation with Multi-Hop Reasoning on Commonsense Knowledge Graph
    Ji, Haozhe
    Ke, Pei
    Huang, Shaohan
    Wei, Furu
    Zhu, Xiaoyan
    Huang, Minlie
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 725 - 736