Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining

被引:29
|
作者
Zhang, Yundong [1 ]
Niebles, Juan Carlos [1 ]
Soto, Alvaro [2 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
[2] Univ Catolica Chile, Santiago, Chile
关键词
D O I
10.1109/WACV.2019.00043
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
A key aspect of visual question answering (VQA) models that are interpretable is their ability to ground their answers to relevant regions in the image. Current approaches with this capability rely on supervised learning and human annotated groundings to train attention mechanisms inside the VQA architecture. Unfortunately, obtaining human annotations specific for visual grounding is difficult and expensive. In this work, we demonstrate that we can effectively train a VQA architecture with grounding supervision that can be automatically obtained from available region descriptions and object annotations. We also show that our model trained with this mined supervision generates visual groundings that achieve a higher correlation with respect to manually-annotated groundings, meanwhile achieving state-of-the-art VQA accuracy.
引用
收藏
页码:349 / 357
页数:9
相关论文
共 50 条
  • [31] Learning Visual Question Answering by Bootstrapping Hard Attention
    Malinowski, Mateusz
    Doersch, Carl
    Santoro, Adam
    Battaglia, Peter
    [J]. COMPUTER VISION - ECCV 2018, PT VI, 2018, 11210 : 3 - 20
  • [32] Vision-Language Transformer for Interpretable Pathology Visual Question Answering
    Naseem, Usman
    Khushi, Matloob
    Kim, Jinman
    [J]. IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2023, 27 (04) : 1681 - 1690
  • [33] Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering
    Vedantam, Ramakrishna
    Desai, Karan
    Lee, Stefan
    Rohrbach, Marcus
    Batra, Dhruv
    Parikh, Devi
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [34] Co-Attention Network With Question Type for Visual Question Answering
    Yang, Chao
    Jiang, Mengqi
    Jiang, Bin
    Zhou, Weixin
    Li, Keqin
    [J]. IEEE ACCESS, 2019, 7 : 40771 - 40781
  • [35] QAlayout: Question Answering Layout Based on Multimodal Attention for Visual Question Answering on Corporate Document
    Mahamoud, Ibrahim Souleiman
    Coustaty, Mickael
    Joseph, Aurelie
    d'Andecy, Vincent Poulain
    Ogier, Jean-Marc
    [J]. DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 659 - 673
  • [36] GViG: Generative Visual Grounding Using Prompt-Based Language Modeling for Visual Question Answering
    Li, Yi-Ting
    Lin, Ying-Jia
    Yeh, Chia-Jen
    Lin, Chun-Yi
    Kao, Hung-Yu
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT VI, PAKDD 2024, 2024, 14650 : 83 - 94
  • [37] Multimodal Encoders and Decoders with Gate Attention for Visual Question Answering
    Li, Haiyan
    Han, Dezhi
    [J]. COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2021, 18 (03) : 1023 - 1040
  • [38] Local relation network with multilevel attention for visual question answering
    Sun, Bo
    Yao, Zeng
    Zhang, Yinghui
    Yu, Lejun
    [J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2020, 73
  • [39] Focal Visual-Text Attention for Memex Question Answering
    Liang, Junwei
    Jiang, Lu
    Cao, Liangliang
    Kalantidis, Yannis
    Li, Li-Jia
    Hauptmann, Alexander G.
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (08) : 1893 - 1908
  • [40] Word-to-region attention network for visual question answering
    Liang Peng
    Yang Yang
    Yi Bin
    Ning Xie
    Fumin Shen
    Yanli Ji
    Xing Xu
    [J]. Multimedia Tools and Applications, 2019, 78 : 3843 - 3858