Answer-checking in Context: A Multi-modal Fully Attention Network for Visual Question Answering

被引:3
|
作者
Huang, Hantao [1 ]
Han, Tao [1 ]
Han, Wei [1 ]
Yap, Deep [1 ]
Chiang, Cheng-Ming [2 ]
机构
[1] MediaTek, Singapore, Singapore
[2] MediaTek, Hsinchu, Taiwan
关键词
D O I
10.1109/ICPR48806.2021.9413078
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Question Answering (VQA) is challenging due to the complex cross-modal relations. It has received extensive attention from the research community. From the human perspective, to answer a visual question, one needs to read the question and then refer to the image to generate an answer. This answer will then be checked against the question and image again for the final confirmation. In this paper, we mimic this process and propose a fully attention based VQA architecture. Moreover, an answer-checking module is proposed to perform a unified attention on the jointly answer, question and image representation to update the answer. This mimics the human answer checking process to consider the answer in the context. With answer-checking modules and transferred BERT layers, our model achieves the state-of-the-art accuracy 71.57% using fewer parameters on VQA-v2.0 test-standard split.
引用
收藏
页码:1173 / 1180
页数:8
相关论文
共 50 条
  • [41] Co-Attention Network With Question Type for Visual Question Answering
    Yang, Chao
    Jiang, Mengqi
    Jiang, Bin
    Zhou, Weixin
    Li, Keqin
    [J]. IEEE ACCESS, 2019, 7 : 40771 - 40781
  • [42] RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training
    Yuan, Zheng
    Jin, Qiao
    Tan, Chuanqi
    Zhao, Zhengyun
    Yuan, Hongyi
    Huang, Fei
    Huang, Songfang
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 547 - 556
  • [43] Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question Answering
    Salemi, Alireza
    Rafiee, Mahta
    Zamani, Hamed
    [J]. PROCEEDINGS OF THE 2023 ACM SIGIR INTERNATIONAL CONFERENCE ON THE THEORY OF INFORMATION RETRIEVAL, ICTIR 2023, 2023, : 169 - 176
  • [44] TASK-ORIENTED MULTI-MODAL QUESTION ANSWERING FOR COLLABORATIVE APPLICATIONS
    Tan, Hui Li
    Leong, Mei Chee
    Xu, Qianli
    Li, Liyuan
    Fang, Fen
    Cheng, Yi
    Gauthier, Nicolas
    Sun, Ying
    Lim, Joo Iiwee
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 1426 - 1430
  • [45] MMTF: Multi-Modal Temporal Fusion for Commonsense Video Question Answering
    Ahmad, Mobeen
    Park, Geonwoo
    Park, Dongchan
    Park, Sanguk
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 4659 - 4664
  • [46] Multi-modal Question Answering System Driven by Domain Knowledge Graph
    Zhao, Zhengwei
    Wang, Xiaodong
    Xu, Xiaowei
    Wang, Qing
    [J]. 5TH INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING AND COMMUNICATIONS (BIGCOM 2019), 2019, : 43 - 47
  • [47] A multi-scale contextual attention network for remote sensing visual question answering
    Feng, Jiangfan
    Wang, Hui
    [J]. INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2024, 126
  • [48] Asymmetric cross-modal attention network with multimodal augmented mixup for medical visual question answering
    Li, Yong
    Yang, Qihao
    Wang, Fu Lee
    Lee, Lap-Kei
    Qu, Yingying
    Hao, Tianyong
    [J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2023, 144
  • [49] Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering
    Xu, Huijuan
    Saenko, Kate
    [J]. COMPUTER VISION - ECCV 2016, PT VII, 2016, 9911 : 451 - 466
  • [50] A Context-aware Attention Network for Interactive Question Answering
    Li, Huayu
    Min, Martin Renqiang
    Ge, Yong
    Kadav, Asim
    [J]. KDD'17: PROCEEDINGS OF THE 23RD ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2017, : 927 - 935