Improving visual question answering using dropout and enhanced question encoder

被引:28
|
作者
Fang, Zhiwei [1 ,2 ]
Liu, Jing [1 ]
Li, Yong [3 ]
Qiao, Yanyuan [2 ]
Lu, Hanqing [1 ]
机构
[1] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, 95 Zhongguancun East Rd, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] JD Com, Business Growth BU, Intelligent Advertising Lab, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual question answering; Coherent dropout; Siamese dropout; Enhanced question encoder; NETWORKS;
D O I
10.1016/j.patcog.2019.01.038
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Using dropout in Visual Question Answering (VQA) is a common practice to prevent overfitting. However, the current way to use dropout in multi-path networks may cause two problems: the co-adaptations of neurons and the explosion of output variance. In this paper, we propose coherent dropout and siamese dropout mechanism to solve the two problems, respectively. Specifically, in coherent dropout, the relevant dropout layers in multiple paths are forced to work coherently to maximize the ability of preventing neuron co-adaptations. We show that the coherent dropout is simple in implementation but very effective to overcome overfitting. As for the explosion of output variance, we develop a siamese dropout mechanism to explicitly minimize the difference between the two output vectors produced from the same input data during training phase. Such mechanism can reduce the gap between training and inference phases and make the VQA model more robust. With the help of the two techniques, we further design an enhanced question encoder called Multi-path Stacked Residual RNNs which is deeper and wider and more powerful than current shallow question encoder. Extensive experiments are conducted to verify the effectiveness of coherent dropout, siamese dropout and the enhanced question encoder. And the results show that our methods can bring clear improvements to the state-of-the-art VQA models on VQA-vl and VQA-v2 datasets. (C) 2019 Elsevier Ltd. All rights reserved.
引用
收藏
页码:404 / 414
页数:11
相关论文
共 50 条
  • [1] Enhancing Visual Question Answering Using Dropout
    Fang, Zhiwei
    Liu, Jing
    Qiao, Yanyuan
    Tang, Qu
    Li, Yong
    Lu, Hanqing
    [J]. PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 1002 - 1010
  • [2] On the role of question encoder sequence model in robust visual question answering
    Kv, Gouthaman
    Mittal, Anurag
    [J]. PATTERN RECOGNITION, 2022, 131
  • [3] An Enhanced Term Weighted Question Embedding for Visual Question Answering
    Manmadhan, Sruthy
    Kovoor, Binsu C.
    [J]. JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT, 2022, 21 (02)
  • [4] Improving Visual Question Answering by Semantic Segmentation
    Pham, Viet-Quoc
    Mishima, Nao
    Nakasu, Toshiaki
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2021, PT III, 2021, 12893 : 459 - 470
  • [5] Contrastive training of a multimodal encoder for medical visual question answering
    Silva, Joao Daniel
    Martins, Bruno
    Magalhaes, Joao
    [J]. INTELLIGENT SYSTEMS WITH APPLICATIONS, 2023, 18
  • [6] Question Modifiers in Visual Question Answering
    Britton, William
    Sarkhel, Somdeb
    Venugopal, Deepak
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1472 - 1479
  • [7] Improving Visual Question Answering using Active Perception on Static Images
    Bozinis, Theodoros
    Passalis, Nikolaos
    Tefas, Anastasios
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 879 - 884
  • [8] Improving reasoning with contrastive visual information for visual question answering
    Long, Yu
    Tang, Pengjie
    Wang, Hanli
    Yu, Jian
    [J]. ELECTRONICS LETTERS, 2021, 57 (20) : 758 - 760
  • [9] Multimodal Encoder-Decoder Attention Networks for Visual Question Answering
    Chen, Chongqing
    Han, Dezhi
    Wang, Jun
    [J]. IEEE ACCESS, 2020, 8 : 35662 - 35671
  • [10] Multimodal Knowledge Reasoning for Enhanced Visual Question Answering
    Hussain, Afzaal
    Maqsood, Ifrah
    Shahzad, Muhammad
    Fraz, Muhammad Moazam
    [J]. 2022 16TH INTERNATIONAL CONFERENCE ON SIGNAL-IMAGE TECHNOLOGY & INTERNET-BASED SYSTEMS, SITIS, 2022, : 224 - 230