Learning a Mixture of Conditional Gating Blocks for Visual Question Answering

被引:0
|
作者
Sun, Qiang [1 ,2 ]
Fu, Yan-Wei [3 ]
Xue, Xiang-Yang [4 ]
机构
[1] Shanghai Univ Int Business & Econ, Sch Stat & Informat, Shanghai 201620, Peoples R China
[2] Fudan Univ, Acad Engn & Technol, Shanghai 200433, Peoples R China
[3] Fudan Univ, Sch Data Sci, Shanghai 200433, Peoples R China
[4] Fudan Univ, Sch Comp Sci, Shanghai 200433, Peoples R China
基金
中国国家自然科学基金;
关键词
visual question answering; Transformer; dynamic network;
D O I
10.1007/s11390-024-2113-0
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
As a Turing test in multimedia, visual question answering (VQA) aims to answer the textual question with a given image. Recently, the "dynamic" property of neural networks has been explored as one of the most promising ways of improving the adaptability, interpretability, and capacity of the neural network models. Unfortunately, despite the prevalence of dynamic convolutional neural networks, it is relatively less touched and very nontrivial to exploit dynamics in the transformers of the VQA tasks through all the stages in an end-to-end manner. Typically, due to the large computation cost of transformers, researchers are inclined to only apply transformers on the extracted high-level visual features for downstream vision and language tasks. To this end, we introduce a question-guided dynamic layer to the transformer as it can effectively increase the model capacity and require fewer transformer layers for the VQA task. In particular, we name the dynamics in the Transformer as Conditional Multi-Head Self-Attention block (cMHSA). Furthermore, our questionguided cMHSA is compatible with conditional ResNeXt block (cResNeXt). Thus a novel model mixture of conditional gating blocks (McG) is proposed for VQA, which keeps the best of the Transformer, convolutional neural network (CNN), and dynamic networks. The pure conditional gating CNN model and the conditional gating Transformer model can be viewed as special examples of McG. We quantitatively and qualitatively evaluate McG on the CLEVR and VQA-Abstract datasets. Extensive experiments show that McG has achieved the state-of-the-art performance on these benchmark datasets.
引用
收藏
页码:912 / 928
页数:17
相关论文
共 50 条
  • [1] Medical Visual Question Answering via Conditional Reasoning and Contrastive Learning
    Liu, Bo
    Zhan, Li-Ming
    Xu, Li
    Wu, Xiao-Ming
    [J]. IEEE TRANSACTIONS ON MEDICAL IMAGING, 2023, 42 (05) : 1532 - 1545
  • [2] Learning on Structured Documents for Conditional Question Answering
    Wang, Zihan
    Qian, Hongjin
    Dou, Zhicheng
    [J]. CHINESE COMPUTATIONAL LINGUISTICS, CCL 2023, 2023, 14232 : 37 - 57
  • [3] Medical Visual Question Answering via Conditional Reasoning
    Zhan, Li-Ming
    Liu, Bo
    Fan, Lu
    Chen, Jiaxin
    Wu, Xiao-Ming
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 2345 - 2354
  • [4] Multitask Learning for Visual Question Answering
    Ma, Jie
    Liu, Jun
    Lin, Qika
    Wu, Bei
    Wang, Yaxian
    You, Yang
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (03) : 1380 - 1394
  • [5] Multi-Question Learning for Visual Question Answering
    Lei, Chenyi
    Wu, Lei
    Liu, Dong
    Li, Zhao
    Wang, Guoxin
    Tang, Haihong
    Li, Houqiang
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11328 - 11335
  • [6] VQAMix: Conditional Triplet Mixup for Medical Visual Question Answering
    Gong, Haifan
    Chen, Guanqi
    Mao, Mingzhi
    Li, Zhen
    Li, Guanbin
    [J]. IEEE TRANSACTIONS ON MEDICAL IMAGING, 2022, 41 (11) : 3332 - 3343
  • [7] Learning Answer Embeddings for Visual Question Answering
    Hu, Hexiang
    Chao, Wei-Lun
    Sha, Fei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 5428 - 5436
  • [8] A Survey on Representation Learning in Visual Question Answering
    Sahani, Manish
    Singh, Priyadarshan
    Jangpangi, Sachin
    Kumar, Shailender
    [J]. MACHINE LEARNING AND BIG DATA ANALYTICS (PROCEEDINGS OF INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND BIG DATA ANALYTICS (ICMLBDA) 2021), 2022, 256 : 326 - 336
  • [9] Multimodal Learning and Reasoning for Visual Question Answering
    Ilievski, Ilija
    Feng, Jiashi
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [10] Visual Question Answering as a Meta Learning Task
    Teney, Damien
    van den Hengel, Anton
    [J]. COMPUTER VISION - ECCV 2018, PT 15, 2018, 11219 : 229 - 245