Lightweight Visual Question Answering using Scene Graphs

被引：12

作者：

Nuthalapati, Sai Vidyaranya ^{[1
]}

Chandradevan, Ramraj ^{[2
]}

Giunchiglia, Eleonora ^{[1
]}

Li, Bowen ^{[1
]}

Kayser, Maxime ^{[1
]}

Lukasiewicz, Thomas ^{[1
]}

Yang, Carl ^{[2
]}

机构：

[1] Univ Oxford, Dept Comp Sci, Oxford, England

[2] Emory Univ, Dept Comp Sci, Atlanta, GA USA

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021 | 2021年

基金：

英国工程与自然科学研究理事会;

关键词：

visual question answering; scene graphs; graph neural networks;

D O I：

10.1145/3459637.3482218

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Visual question answering (VQA) is a challenging problem in machine perception, which requires a deep joint understanding of both visual and textual data. Recent research has advanced the automatic generation of high-quality scene graphs from images, while powerful yet elegant models like graph neural networks (GNNs) have shown great power in reasoning over graph-structured data. In this work, we propose to bridge the gap between scene graph generation and VQA by leveraging GNNs. In particular, we design a new model called Conditional Enhanced Graph ATtention network (CE-GAT) to encode pairs of visual and semantic scene graphs with both node and edge features, which is seamlessly integrated with a textual question encoder to generate answers through questiongraph conditioning. Moreover, to alleviate the training difficulties of CE-GAT towards VQA, we enforce more useful inductive biases in the scene graphs through novel question-guided graph enriching and pruning. Finally, we evaluate the framework on one of the largest available VQA datasets (namely, GQA) with groundtruth scene graphs, achieving the accuracy of 77.87%, compared with the state of the art (namely, the neural state machine (NSM)), which gives 63.17%. Notably, by leveraging existing scene graphs, our framework is much lighter compared with end-to-end VQA methods (e.g., about 95.3% less parameters than a typical NSM).

引用

页码：3353 / 3357

页数：5

共 50 条

[1] DynGraph: Visual Question Answering via Dynamic Scene Graphs
Haurilet, Monica
Al-Halah, Ziad
Stiefelhagen, Rainer
PATTERN RECOGNITION, DAGM GCPR 2019, 2019, 11824 : 428 - 441
[2] Scene Text Visual Question Answering
Biten, Ali Furkan
Tito, Ruben
Mafla, Andres
Gomez, Lluis
Rusinol, Marcal
Valveny, Ernest
Jawahar, C. V.
Karatzas, Dimosthenis
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4290 - 4300
[3] Scene text visual question answering by using YOLO and STN
Nourali K.
Dolkhani E.
International Journal of Speech Technology, 2024, 27 (01) : 69 - 76
[4] Scene Understanding for Autonomous Driving Using Visual Question Answering
Wantiez, Adrien
Qiu, Tianming
Matthes, Stefan
Shen, Hao
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[5] A Multilingual Approach to Scene Text Visual Question Answering
Brugues i Pujolras, Josep
Gomez i Bigorda, Llufs
Karatzas, Dimosthenis
DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 65 - 79
[6] Scene Graph Refinement Network for Visual Question Answering
Qian, Tianwen
Chen, Jingjing
Chen, Shaoxiang
Wu, Bo
Jiang, Yu-Gang
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 3950 - 3961
[7] Visual Causal Scene Refinement for Video Question Answering
Wei, Yushen
Liu, Yang
Yan, Hong
Li, Guanbin
Lin, Liang
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 377 - 386
[8] Question Answering Mediated by Visual Clues and Knowledge Graphs
de Faria, Fabricio F.
Usbeck, Ricardo
Sarullo, Alessio
Mu, Tingting
Freitas, Andre
COMPANION PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE 2018 (WWW 2018), 2018, : 1937 - 1939
[9] Towards Reasoning Ability in Scene Text Visual Question Answering
Wang, Qingqing
Xiao, Liqiang
Lu, Yue
Jin, Yaohui
He, Hao
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2281 - 2289
[10] A Diagrammatic Approach for Visual Question Answering over Knowledge Graphs
Mouromtsev, Dmitry
Wohlgenannt, Gerhard
Haase, Peter
Pavlov, Dmitry
Emelyanov, Yury
Morozov, Alexey
SEMANTIC WEB: ESWC 2018 SATELLITE EVENTS, 2018, 11155 : 34 - 39

← 1 2 3 4 5 →