DisAVR: Disentangled Adaptive Visual Reasoning Network for Diagram Question Answering

被引:0
|
作者
Wang, Yaxian [1 ]
Wei, Bifan [1 ]
Liu, Jun [1 ]
Zhang, Lingling [2 ]
Wang, Jiaxin [2 ]
Wang, Qianying [3 ]
机构
[1] Xi An Jiao Tong Univ, Sch Comp Sci & Technol, Natl Engn Lab Big Data Analyt, Xian 710049, Shaanxi, Peoples R China
[2] Xi An Jiao Tong Univ, Key Lab Intelligent Networks & Network Secur, Minist Educ, Xian 710049, Shaanxi, Peoples R China
[3] Lenovo Res, Beijing 100094, Peoples R China
基金
中国国家自然科学基金;
关键词
Cognition; Visualization; Task analysis; Question answering (information retrieval); Semantics; Geometry; Routing; Diagram understanding; visual reasoning; diagram question answering;
D O I
10.1109/TIP.2023.3306910
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Diagram Question Answering (DQA) aims to correctly answer questions about given diagrams, which demands an interplay of good diagram understanding and effective reasoning. However, the same appearance of objects in diagrams can express different semantics. This kind of visual semantic ambiguity problem makes it challenging to represent diagrams sufficiently for better understanding. Moreover, since there are questions about diagrams from different perspectives, it is also crucial to perform flexible and adaptive reasoning on content-rich diagrams. In this paper, we propose a Disentangled Adaptive Visual Reasoning Network for DQA, named DisAVR, to jointly optimize the dual-process of representation and reasoning. DisAVR mainly comprises three modules: improved region feature learning, question parsing, and disentangled adaptive reasoning. Specifically, the improved region feature learning module is designed to first learn robust diagram representation by integrating detail-aware patch features and semantically-explicit text features with region features. Subsequently, the question parsing module decomposes the question into three types of question guidance including region, spatial relation and semantic relation guidance to dynamically guide subsequent reasoning. Next, the disentangled adaptive reasoning module decomposes the whole reasoning process by employing three visual reasoning cells to construct a soft fully-connected multi-layer stacked routing space. These three cells in each layer reason over object regions, semantic and spatial relations in the diagram under the corresponding question guidance. Moreover, an adaptive routing mechanism is designed to flexibly explore more optimal reasoning paths for specific diagram-question pairs. Extensive experiments on three DQA datasets demonstrate the superiority of our DisAVR.
引用
收藏
页码:4812 / 4827
页数:16
相关论文
共 50 条
  • [41] ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
    Masry, Ahmed
    Long, Do Xuan
    Tan, Jia Qing
    Joty, Shafiq
    Hogue, Enamul
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 2263 - 2279
  • [42] Multimodal feature fusion by relational reasoning and attention for visual question answering
    Zhang, Weifeng
    Yu, Jing
    Hu, Hua
    Hu, Haiyang
    Qin, Zengchang
    [J]. INFORMATION FUSION, 2020, 55 : 116 - 126
  • [43] Medical Visual Question Answering via Conditional Reasoning and Contrastive Learning
    Liu, Bo
    Zhan, Li-Ming
    Xu, Li
    Wu, Xiao-Ming
    [J]. IEEE TRANSACTIONS ON MEDICAL IMAGING, 2023, 42 (05) : 1532 - 1545
  • [44] Hierarchical reasoning based on perception action cycle for visual question answering
    Mohamud, Safaa Abdullahi Moallim
    Jalali, Amin
    Lee, Minho
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 241
  • [45] Learning Hierarchical Reasoning for Text-Based Visual Question Answering
    Li, Caiyuan
    Du, Qinyi
    Wang, Qingqing
    Jin, Yaohui
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2021, PT III, 2021, 12893 : 305 - 316
  • [46] Visual question answering method based on relational reasoning and gating mechanism
    Wang, Xin
    Chen, Qiao-Hong
    Sun, Qi
    Jia, Yu-Bo
    [J]. Zhejiang Daxue Xuebao (Gongxue Ban)/Journal of Zhejiang University (Engineering Science), 2022, 56 (01): : 36 - 46
  • [47] Co-Attention Network With Question Type for Visual Question Answering
    Yang, Chao
    Jiang, Mengqi
    Jiang, Bin
    Zhou, Weixin
    Li, Keqin
    [J]. IEEE ACCESS, 2019, 7 : 40771 - 40781
  • [48] Triple attention network for sentimental visual question answering
    Ruwa, Nelson
    Mao, Qirong
    Song, Heping
    Jia, Hongjie
    Dong, Ming
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2019, 189
  • [49] Collaborative Attention Network to Enhance Visual Question Answering
    Gu, Rui
    [J]. BASIC & CLINICAL PHARMACOLOGY & TOXICOLOGY, 2019, 124 : 304 - 305
  • [50] Scene Graph Refinement Network for Visual Question Answering
    Qian, Tianwen
    Chen, Jingjing
    Chen, Shaoxiang
    Wu, Bo
    Jiang, Yu-Gang
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 3950 - 3961