Multi-source Multi-level Attention Networks for Visual Question Answering

被引:16
|
作者
Yu, Dongfei [1 ]
Fu, Jianlong [2 ]
Tian, Xinmei [3 ]
Mei, Tao [4 ]
机构
[1] Univ Sci & Technol China, Bldg 8,West Campus, Hefei, Anhui, Peoples R China
[2] Microsoft Res Asia, Microsoft Bldg 2,Danling St, Beijing, Peoples R China
[3] Univ Sci & Technol China, Room 1203,Tech Bldg, Hefei, Anhui, Peoples R China
[4] North Star Century Ctr, JD AI Res 8F,Bldg A,8 Beichen West St, Beijing 100105, Peoples R China
基金
国家重点研发计划;
关键词
Visual question answering; attention model; multi-modal representations; visual relationship;
D O I
10.1145/3316767
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, Visual Question Answering (VQA) has attracted increasing attention due to its requirement on cross-modal understanding and reasoning of vision and language. VQA is proposed to automatically answer natural language questions with reference to a given image. VQA is challenging, because the reasoning process on a visual domain needs a full understanding of the spatial relationship, semantic concepts, as well as the common sense for a real image. However, most existing approaches jointly embed the abstract low-level visual features and high-level question features to infer answers. These works have limited reasoning ability due to the lack of modeling of the rich spatial context of regions, high-level semantics of images, and knowledge across multiple sources. To solve the challenges, we propose multi-source multi-level attention networks for visual question answering that can benefit both spatial inferences by visual attention on context-aware region representation and reasoning by semantic attention on concepts as well as external knowledge. Indeed, we learn to reason on image representation by question-guided attention at different levels across multiple sources, including region and concept level representation from image source as well as sentence level representation from the external knowledge base. First, we encode region-based middle-level outputs from Convolutional Neural Networks (CNNs) into spatially embedded representation by a multi-directional two-dimensional recurrent neural network and, further, locate the answer-related regions by Multiple Layer Perceptron as visual attention. Second, we generate semantic concepts from high-level semantics in CNNs and select those question-related concepts as concept attention. Third, we query semantic knowledge from the general knowledge base by concepts and selected question-related knowledge as knowledge attention. Finally, we jointly optimize visual attention, concept attention, knowledge attention, and question embedding by a softmax classifier to infer the final answer. Extensive experiments show the proposed approach achieved significant improvement on two very challenging VQA datasets.
引用
收藏
页数:20
相关论文
共 50 条
  • [21] Adversarial Learning With Multi-Modal Attention for Visual Question Answering
    Liu, Yun
    Zhang, Xiaoming
    Huang, Feiran
    Cheng, Lei
    Li, Zhoujun
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (09) : 3894 - 3908
  • [22] A multi-level feature attention network for COVID-19 detection based on multi-source medical images
    Zhao, Aite
    Wu, Huimin
    Chen, Ming
    Wang, Nana
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (27) : 70419 - 70450
  • [23] Multi-Modal fusion with multi-level attention for Visual Dialog
    Zhang, Jingping
    Wang, Qiang
    Han, Yahong
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2020, 57 (04)
  • [24] Multi-Channel Co-Attention Network for Visual Question Answering
    Tian, Weidong
    He, Bin
    Wang, Nanxun
    Zhao, Zhongqiu
    [J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [25] Multi-Question Learning for Visual Question Answering
    Lei, Chenyi
    Wu, Lei
    Liu, Dong
    Li, Zhao
    Wang, Guoxin
    Tang, Haihong
    Li, Houqiang
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11328 - 11335
  • [26] Efficient Multi-step Reasoning Attention Network for Visual Question Answering
    Zhang, Haotian
    Wu, Wei
    Zhang, Meng
    [J]. THIRTEENTH INTERNATIONAL CONFERENCE ON GRAPHICS AND IMAGE PROCESSING (ICGIP 2021), 2022, 12083
  • [27] Multi-Modality Global Fusion Attention Network for Visual Question Answering
    Yang, Cheng
    Wu, Weijia
    Wang, Yuxing
    Zhou, Hong
    [J]. ELECTRONICS, 2020, 9 (11) : 1 - 12
  • [28] The multi-modal fusion in visual question answering: a review of attention mechanisms
    Lu, Siyu
    Liu, Mingzhe
    Yin, Lirong
    Yin, Zhengtong
    Liu, Xuan
    Zheng, Wenfeng
    [J]. PEERJ COMPUTER SCIENCE, 2023, 9
  • [29] Multi-scale Relational Reasoning with Regional Attention for Visual Question Answering
    Ma, Yuntao
    Lu, Tong
    Wu, Yirui
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 5642 - 5649
  • [30] app Popularity Prediction with Multi-Level Attention Networks
    Zhang Y.
    Guo B.
    Liu J.
    Ouyang Y.
    Yu Z.
    [J]. Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2020, 57 (05): : 984 - 995