Hierarchical reasoning based on perception action cycle for visual question answering

被引:1
|
作者
Mohamud, Safaa Abdullahi Moallim [1 ]
Jalali, Amin [3 ]
Lee, Minho [1 ,2 ,3 ]
机构
[1] Kyungpook Natl Univ, Grad Sch Artificial Intelligence, Daegu 41566, South Korea
[2] ALI Co Ltd, Daegu 41566, South Korea
[3] Kyungpook Natl Univ, AI Inst Technol, KNU LG Elect Convergence Res Ctr, Daegu 41566, South Korea
基金
新加坡国家研究基金会;
关键词
Visual question answering; Vision language tasks; Multi-modality fusion; Attention; Bilinear fusion; RECOGNITION; ATTENTION; NETWORK;
D O I
10.1016/j.eswa.2023.122698
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent visual question answering (VQA) frameworks employ different attention modules to derive a correct answer. The concept of attention is heavily established in human cognition, which led to its magnificent success in deep neural networks. In this study, we aim to consider a VQA framework that utilizes human biological and psychological concepts to achieve a good understanding of vision and language modalities. In this view, we introduce a hierarchical reasoning method based on the perception action cycle (HIPA) framework to tackle VQA tasks. The perception action cycle (PAC) explains how humans learn about and interact with their surrounding world. The proposed framework integrates the reasoning process of multi-modalities with the concepts introduced in PAC in multiple phases. It comprehends the visual modality through three phases of reasoning: object-level attention, organization, and interpretation. In addition, it comprehends the language modality through word-level attention, interpretation, and conditioning. Subsequently, vision and language modalities are interpreted dependently in a cyclic and hierarchical way throughout the entire framework. For further assessment of the generated visual and language features, we argue that image-question pairs of the same answer ought to eventually have similar visual and language features. As a result, we conduct visual and language feature evaluation experiments using metrics such as the standard deviation of cosine similarity and of Manhattan distance. We show that employing PAC in our framework improves the standard deviation compared with other VQA frameworks. For further assessment, we also test the novel proposed HIPA on the visual relationship detection (VRD) task. The proposed method achieves state-of-the-art results on the TDIUC and VRD datasets and obtains competitive results on the VQA 2.0 dataset. The code is available: github.com/Safaa1113/HiPA-Framework.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Encoder-decoder cycle for visual question answering based on perception-action cycle
    Mohamud, Safaa Abdullahi Moallim
    Jalali, Amin
    Lee, Minho
    [J]. PATTERN RECOGNITION, 2023, 144
  • [2] Learning Hierarchical Reasoning for Text-Based Visual Question Answering
    Li, Caiyuan
    Du, Qinyi
    Wang, Qingqing
    Jin, Yaohui
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2021, PT III, 2021, 12893 : 305 - 316
  • [3] Comprehensive-perception dynamic reasoning for visual question answering
    Shuang, Kai
    Guo, Jinyu
    Wang, Zihan
    [J]. PATTERN RECOGNITION, 2022, 131
  • [4] A Spatial Hierarchical Reasoning Network for Remote Sensing Visual Question Answering
    Zhang, Zixiao
    Jiao, Licheng
    Li, Lingling
    Liu, Xu
    Chen, Puhua
    Liu, Fang
    Li, Yuxuan
    Guo, Zhicheng
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [5] Hierarchical Multimodality Graph Reasoning for Remote Sensing Visual Question Answering
    Zhang, Han
    Wang, Keming
    Zhang, Laixian
    Wang, Bingshu
    Li, Xuelong
    [J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62
  • [6] Sequential Visual Reasoning for Visual Question Answering
    Liu, Jinlai
    Wu, Chenfei
    Wang, Xiaojie
    Dong, Xuan
    [J]. PROCEEDINGS OF 2018 5TH IEEE INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENCE SYSTEMS (CCIS), 2018, : 410 - 415
  • [7] Chain of Reasoning for Visual Question Answering
    Wu, Chenfei
    Liu, Jinlai
    Wang, Xiaojie
    Dong, Xuan
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [8] HAIR: Hierarchical Visual-Semantic Relational Reasoning for Video Question Answering
    Liu, Fei
    Liu, Jing
    Wang, Weining
    Lu, Hanqing
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1678 - 1687
  • [9] Explicit Knowledge-based Reasoning for Visual Question Answering
    Wang, Peng
    Wu, Qi
    Shen, Chunhua
    Dick, Anthony
    van den Hengel, Anton
    [J]. PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 1290 - 1296
  • [10] Research on Visual Question Answering Based on GAT Relational Reasoning
    Yalin Miao
    Wenfang Cheng
    Shuyun He
    Hui Jiang
    [J]. Neural Processing Letters, 2022, 54 : 1435 - 1448