Deep Attention Neural Tensor Network for Visual Question Answering

被引:50
|
作者
Bai, Yalong [1 ,2 ]
Fu, Jianlong [3 ]
Zhao, Tiejun [1 ]
Mei, Tao [2 ]
机构
[1] Harbin Inst Technol, Harbin, Peoples R China
[2] JD AI Res, Beijing, Peoples R China
[3] Microsoft Res Asia, Beijing, Peoples R China
来源
关键词
Visual question answering; Neural tensor network; Open-ended VQA;
D O I
10.1007/978-3-030-01258-8_2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual question answering (VQA) has drawn great attention in cross-modal learning problems, which enables a machine to answer a natural language question given a reference image. Significant progress has been made by learning rich embedding features from images and questions by bilinear models, while neglects the key role from answers. In this paper, we propose a novel deep attention neural tensor network (DA-NTN) for visual question answering, which can discover the joint correlations over images, questions and answers with tensor-based representations. First, we model one of the pairwise interaction (e.g., image and question) by bilinear features, which is further encoded with the third dimension (e.g., answer) to be a triplet by bilinear tensor product. Second, we decompose the correlation of different triplets by different answer and question types, and further propose a slice-wise attention module on tensor to select the most discriminative reasoning process for inference. Third, we optimize the proposed DA-NTN by learning a label regression with KL-divergence losses. Such a design enables scalable training and fast convergence over a large number of answer set. We integrate the proposed DA-NTN structure into the state-of-the-art VQA models (e.g., MLB and MUTAN). Extensive experiments demonstrate the superior accuracy than the original MLB and MUTAN models, with 1.98%, 1.70% relative increases on VQA-2.0 dataset, respectively.
引用
收藏
页码:21 / 37
页数:17
相关论文
共 50 条
  • [1] Deep Modular Bilinear Attention Network for Visual Question Answering
    Yan, Feng
    Silamu, Wushouer
    Li, Yanbing
    [J]. SENSORS, 2022, 22 (03)
  • [2] Visual question answering model based on graph neural network and contextual attention
    Sharma, Himanshu
    Jalal, Anand Singh
    [J]. IMAGE AND VISION COMPUTING, 2021, 110
  • [3] DecomVQANet: Decomposing visual question answering deep network via tensor decomposition and regression
    Bai, Zongwen
    Li, Ying
    Wozniak, Marcin
    Zhou, Meili
    Li, Di
    [J]. PATTERN RECOGNITION, 2021, 110
  • [4] DecomVQANet: Decomposing visual question answering deep network via tensor decomposition and regression
    Bai, Zongwen
    Li, Ying
    Woźniak, Marcin
    Zhou, Meili
    Li, Di
    [J]. Pattern Recognition, 2021, 110
  • [5] Collaborative Attention Network to Enhance Visual Question Answering
    Gu, Rui
    [J]. BASIC & CLINICAL PHARMACOLOGY & TOXICOLOGY, 2019, 124 : 304 - 305
  • [6] Triple attention network for sentimental visual question answering
    Ruwa, Nelson
    Mao, Qirong
    Song, Heping
    Jia, Hongjie
    Dong, Ming
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2019, 189
  • [7] ADAPTIVE ATTENTION FUSION NETWORK FOR VISUAL QUESTION ANSWERING
    Gu, Geonmo
    Kim, Seong Tae
    Ro, Yong Man
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2017, : 997 - 1002
  • [8] Fair Attention Network for Robust Visual Question Answering
    Bi, Yandong
    Jiang, Huajie
    Hu, Yongli
    Sun, Yanfeng
    Yin, Baocai
    [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34 (09) : 7870 - 7881
  • [9] Co-Attention Network With Question Type for Visual Question Answering
    Yang, Chao
    Jiang, Mengqi
    Jiang, Bin
    Zhou, Weixin
    Li, Keqin
    [J]. IEEE ACCESS, 2019, 7 : 40771 - 40781
  • [10] Optimal Deep Neural Network-Based Model for Answering Visual Medical Question
    Gasmi, Karim
    Ben Ltaifa, Ibtihel
    Lejeune, Gael
    Alshammari, Hamoud
    Ben Ammar, Lassaad
    Mahmood, Mahmood A.
    [J]. CYBERNETICS AND SYSTEMS, 2022, 53 (05) : 403 - 424