Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining

被引:29
|
作者
Zhang, Yundong [1 ]
Niebles, Juan Carlos [1 ]
Soto, Alvaro [2 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
[2] Univ Catolica Chile, Santiago, Chile
关键词
D O I
10.1109/WACV.2019.00043
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
A key aspect of visual question answering (VQA) models that are interpretable is their ability to ground their answers to relevant regions in the image. Current approaches with this capability rely on supervised learning and human annotated groundings to train attention mechanisms inside the VQA architecture. Unfortunately, obtaining human annotations specific for visual grounding is difficult and expensive. In this work, we demonstrate that we can effectively train a VQA architecture with grounding supervision that can be automatically obtained from available region descriptions and object annotations. We also show that our model trained with this mined supervision generates visual groundings that achieve a higher correlation with respect to manually-annotated groundings, meanwhile achieving state-of-the-art VQA accuracy.
引用
收藏
页码:349 / 357
页数:9
相关论文
共 50 条
  • [1] INTERPRETABLE VISUAL QUESTION ANSWERING VIA REASONING SUPERVISION
    Parelli, Maria
    Mallis, Dimitrios
    Diomataris, Markos
    Pitsikalis, Vassilis
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 2525 - 2529
  • [2] Exploring Human-Like Attention Supervision in Visual Question Answering
    Qiao, Tingting
    Dong, Jianfeng
    Xu, Duanqing
    [J]. THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 7300 - 7307
  • [3] Graph Strategy for Interpretable Visual Question Answering
    Sarkisyan, Christina
    Savelov, Mikhail
    Kovalev, Alexey K.
    Panov, Aleksandr I.
    [J]. ARTIFICIAL GENERAL INTELLIGENCE, AGI 2022, 2023, 13539 : 86 - 99
  • [4] From Pixels to Objects: Cubic Visual Attention for Visual Question Answering
    Song, Jingkuan
    Zeng, Pengpeng
    Gao, Lianli
    Shen, Heng Tao
    [J]. PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 906 - 912
  • [5] Visual Question Answering using Explicit Visual Attention
    Lioutas, Vasileios
    Passalis, Nikolaos
    Tefas, Anastasios
    [J]. 2018 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2018,
  • [6] An Improved Attention for Visual Question Answering
    Rahman, Tanzila
    Chou, Shih-Han
    Sigal, Leonid
    Carenini, Giuseppe
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 1653 - 1662
  • [7] Multimodal Attention for Visual Question Answering
    Kodra, Lorena
    Mece, Elinda Kajo
    [J]. INTELLIGENT COMPUTING, VOL 1, 2019, 858 : 783 - 792
  • [8] Differential Attention for Visual Question Answering
    Patro, Badri
    Namboodiri, Vinay P.
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7680 - 7688
  • [9] Fusing Attention with Visual Question Answering
    Burt, Ryan
    Cudic, Mihael
    Principe, Jose C.
    [J]. 2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2017, : 949 - 953
  • [10] Multi-grained Attention with Object-level Grounding for Visual Question Answering
    Huang, Pingping
    Huang, Jianhui
    Guo, Yuqing
    Qiao, Min
    Zhu, Yong
    [J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3595 - 3600