Multi-scale relation reasoning for multi-modal Visual Question Answering

被引：35

作者：

Wu, Yirui ^{[1
]}

Ma, Yuntao ^{[2
]}

Wan, Shaohua ^{[3
]}

机构：

[1] Hohai Univ, Coll Comp & Informat, Fochengxi Rd, Nanjing 210093, Peoples R China

[2] Nanjing Univ, Natl Key Lab Novel Software Technol, Xianling Rd, Nanjing 210093, Peoples R China

[3] Zhongnan Univ Econ & Law, Sch Informat & Safety Engn, Wuhan, Peoples R China

来源：

SIGNAL PROCESSING-IMAGE COMMUNICATION | 2021年 / 96卷

基金：

国家重点研发计划;

关键词：

Multi-modal data; Visual Question Answering; Multi-scale relation reasoning; Attention model;

D O I：

10.1016/j.image.2021.116319

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

The goal of Visual Question Answering (VQA) is to answer questions about images. For the same picture, there are often completely different types of questions. Therefore, the main difficulty of VQA task lies in how to properly reason relationships among multiple visual objects according to types of input questions. To solve this difficulty, this paper proposes a deep neural network to perform multi-modal relation reasoning in multi-scales, which successfully constructs a regional attention scheme to focus on informative and question-related regions for better answering. Specifically, we firstly design regional attention scheme to select regions of interest based on informative evaluation computed by a question-guided soft attention module. Afterwards, features computed by regional attention scheme are fused in scaled combinations, thus generating more distinctive features with scalable information. Due to designs of regional attention and multi-scale property, the proposed method is capable to describe scaled relationships from multi-modal inputs to offer accurate question-guided answers. By conducting experiments on VQA v1 and VQA v2 datasets, we show that the proposed method has superior efficiencies than most of the existing methods.

引用

页数：9

共 50 条

[31] Multi-modal and multi-scale retinal imaging with angiography
Shirazi, Muhammad Faizan
Andilla, Jordi
Cunquero, Marina
Lefaudeux, Nicolas
De Jesus, Danilo Andrade
Brea, Luisa Sanchez
Klein, Stefan
van Walsum, Theo
Grieve, Kate
Paques, Michel
Torm, Marie Elise Wistrup
Larsen, Michael
Loza-Alvarez, Pablo
Levecq, Xavier
Chateau, Nicolas
Pircher, Michael
[J]. INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE, 2021, 62 (08)
[32] Advancing Video Question Answering with a Multi-modal and Multi-layer Question Enhancement Network
Liu, Meng
Zhang, Fenglei
Luo, Xin
Liu, Fan
Wei, Yinwei
Nie, Liqiang
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3985 - 3993
[33] Holistic Multi-Modal Memory Network for Movie Question Answering
Wang, Anran
Anh Tuan Luu
Foo, Chuan-Sheng
Zhu, Hongyuan
Tay, Yi
Chandrasekhar, Vijay
[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 489 - 499
[34] A multi-scale contextual attention network for remote sensing visual question answering
Feng, Jiangfan
Wang, Hui
[J]. INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2024, 126
[35] Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering
Yu, Zhou
Yu, Jun
Fan, Jianping
Tao, Dacheng
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1839 - 1848
[36] Answer-checking in Context: A Multi-modal Fully Attention Network for Visual Question Answering
Huang, Hantao
Han, Tao
Han, Wei
Yap, Deep
Chiang, Cheng-Ming
[J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 1173 - 1180
[37] Interpretable medical image Visual Question Answering via multi-modal relationship graph learning
Hu, Xinyue
Gu, Lin
Kobayashi, Kazuma
Liu, Liangchen
Zhang, Mengliang
Harada, Tatsuya
Summers, Ronald M.
Zhu, Yingying
[J]. MEDICAL IMAGE ANALYSIS, 2024, 97
[38] NuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario
Qian, Tianwen
Chen, Jingjing
Zhuo, Linhai
Jiao, Yang
Jiang, Yu-Gang
[J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 5, 2024, : 4542 - 4550
[39] Medical Visual Question-Answering Model Based on Knowledge Enhancement and Multi-Modal Fusion
Zhang, Dianyuan
Yu, Chuanming
An, Lu
[J]. Proceedings of the Association for Information Science and Technology, 2024, 61 (01) : 703 - 708
[40] A Multi-scale and Multi-modal Transportation GIS for the City of Guangzhou
Chen, Shaopei
Claramunt, Christophe
Ray, Cyril
Tan, Jianjun
[J]. INFORMATION FUSION AND GEOGRAPHIC INFORMATION SYSTEMS, PROCEEDINGS, 2009, : 95 - 111

← 1 2 3 4 5 →