Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining

被引：29

作者：

Zhang, Yundong ^{[1
]}

Niebles, Juan Carlos ^{[1
]}

Soto, Alvaro ^{[2
]}

机构：

[1] Stanford Univ, Stanford, CA 94305 USA

[2] Univ Catolica Chile, Santiago, Chile

来源：

2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV) | 2019年

关键词：

D O I：

10.1109/WACV.2019.00043

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

A key aspect of visual question answering (VQA) models that are interpretable is their ability to ground their answers to relevant regions in the image. Current approaches with this capability rely on supervised learning and human annotated groundings to train attention mechanisms inside the VQA architecture. Unfortunately, obtaining human annotations specific for visual grounding is difficult and expensive. In this work, we demonstrate that we can effectively train a VQA architecture with grounding supervision that can be automatically obtained from available region descriptions and object annotations. We also show that our model trained with this mined supervision generates visual groundings that achieve a higher correlation with respect to manually-annotated groundings, meanwhile achieving state-of-the-art VQA accuracy.

引用

页码：349 / 357

页数：9

共 50 条

[31] Learning Visual Question Answering by Bootstrapping Hard Attention
Malinowski, Mateusz
Doersch, Carl
Santoro, Adam
Battaglia, Peter
[J]. COMPUTER VISION - ECCV 2018, PT VI, 2018, 11210 : 3 - 20
[32] Vision-Language Transformer for Interpretable Pathology Visual Question Answering
Naseem, Usman
Khushi, Matloob
Kim, Jinman
[J]. IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2023, 27 (04) : 1681 - 1690
[33] Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering
Vedantam, Ramakrishna
Desai, Karan
Lee, Stefan
Rohrbach, Marcus
Batra, Dhruv
Parikh, Devi
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
[34] Co-Attention Network With Question Type for Visual Question Answering
Yang, Chao
Jiang, Mengqi
Jiang, Bin
Zhou, Weixin
Li, Keqin
[J]. IEEE ACCESS, 2019, 7 : 40771 - 40781
[35] QAlayout: Question Answering Layout Based on Multimodal Attention for Visual Question Answering on Corporate Document
Mahamoud, Ibrahim Souleiman
Coustaty, Mickael
Joseph, Aurelie
d'Andecy, Vincent Poulain
Ogier, Jean-Marc
[J]. DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 659 - 673
[36] GViG: Generative Visual Grounding Using Prompt-Based Language Modeling for Visual Question Answering
Li, Yi-Ting
Lin, Ying-Jia
Yeh, Chia-Jen
Lin, Chun-Yi
Kao, Hung-Yu
[J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT VI, PAKDD 2024, 2024, 14650 : 83 - 94
[37] Multimodal Encoders and Decoders with Gate Attention for Visual Question Answering
Li, Haiyan
Han, Dezhi
[J]. COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2021, 18 (03) : 1023 - 1040
[38] Local relation network with multilevel attention for visual question answering
Sun, Bo
Yao, Zeng
Zhang, Yinghui
Yu, Lejun
[J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2020, 73
[39] Focal Visual-Text Attention for Memex Question Answering
Liang, Junwei
Jiang, Lu
Cao, Liangliang
Kalantidis, Yannis
Li, Li-Jia
Hauptmann, Alexander G.
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (08) : 1893 - 1908
[40] Word-to-region attention network for visual question answering
Liang Peng
Yang Yang
Yi Bin
Ning Xie
Fumin Shen
Yanli Ji
Xing Xu
[J]. Multimedia Tools and Applications, 2019, 78 : 3843 - 3858

← 1 2 3 4 5 →