Answer-checking in Context: A Multi-modal Fully Attention Network for Visual Question Answering

被引：3

作者：

Huang, Hantao ^{[1
]}

Han, Tao ^{[1
]}

Han, Wei ^{[1
]}

Yap, Deep ^{[1
]}

Chiang, Cheng-Ming ^{[2
]}

机构：

[1] MediaTek, Singapore, Singapore

[2] MediaTek, Hsinchu, Taiwan

来源：

2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR) | 2021年

关键词：

D O I：

10.1109/ICPR48806.2021.9413078

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual Question Answering (VQA) is challenging due to the complex cross-modal relations. It has received extensive attention from the research community. From the human perspective, to answer a visual question, one needs to read the question and then refer to the image to generate an answer. This answer will then be checked against the question and image again for the final confirmation. In this paper, we mimic this process and propose a fully attention based VQA architecture. Moreover, an answer-checking module is proposed to perform a unified attention on the jointly answer, question and image representation to update the answer. This mimics the human answer checking process to consider the answer in the context. With answer-checking modules and transferred BERT layers, our model achieves the state-of-the-art accuracy 71.57% using fewer parameters on VQA-v2.0 test-standard split.

引用

页码：1173 / 1180

页数：8

共 50 条

[1] Adversarial Learning With Multi-Modal Attention for Visual Question Answering
Liu, Yun
Zhang, Xiaoming
Huang, Feiran
Cheng, Lei
Li, Zhoujun
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (09) : 3894 - 3908
[2] Multi-modal spatial relational attention networks for visual question answering
Yao, Haibo
Wang, Lipeng
Cai, Chengtao
Sun, Yuxin
Zhang, Zhi
Luo, Yongkang
[J]. IMAGE AND VISION COMPUTING, 2023, 140
[3] The multi-modal fusion in visual question answering: a review of attention mechanisms
Lu, Siyu
Liu, Mingzhe
Yin, Lirong
Yin, Zhengtong
Liu, Xuan
Zheng, Wenfeng
[J]. PEERJ COMPUTER SCIENCE, 2023, 9
[4] Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering
Guo, Zihan
Han, Dezhi
[J]. SENSORS, 2020, 20 (23) : 1 - 15
[5] Multi-modal co-attention relation networks for visual question answering
Zihan Guo
Dezhi Han
[J]. The Visual Computer, 2023, 39 : 5783 - 5795
[6] Multi-modal co-attention relation networks for visual question answering
Guo, Zihan
Han, Dezhi
[J]. VISUAL COMPUTER, 2023, 39 (11): : 5783 - 5795
[7] Multi-modal Contextual Graph Neural Network for Text Visual Question Answering
Liang, Yaoyuan
Wang, Xin
Duan, Xuguang
Zhu, Wenwu
[J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 3491 - 3498
[8] Multi-Modal Alignment of Visual Question Answering Based on Multi-Hop Attention Mechanism
Xia, Qihao
Yu, Chao
Hou, Yinong
Peng, Pingping
Zheng, Zhengqi
Chen, Wen
[J]. ELECTRONICS, 2022, 11 (11)
[9] Differentiated Attention with Multi-modal Reasoning for Video Question Answering
Yao, Shentao
Li, Kun
Xing, Kun
Wu, Kewei
Xie, Zhao
Guo, Dan
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING, BIG DATA AND ALGORITHMS (EEBDA), 2022, : 525 - 530
[10] Multi-modal adaptive gated mechanism for visual question answering
Xu, Yangshuyi
Zhang, Lin
Shen, Xiang
[J]. PLOS ONE, 2023, 18 (06):

← 1 2 3 4 5 →