Multi-source Multi-level Attention Networks for Visual Question Answering

被引:16
|
作者
Yu, Dongfei [1 ]
Fu, Jianlong [2 ]
Tian, Xinmei [3 ]
Mei, Tao [4 ]
机构
[1] Univ Sci & Technol China, Bldg 8,West Campus, Hefei, Anhui, Peoples R China
[2] Microsoft Res Asia, Microsoft Bldg 2,Danling St, Beijing, Peoples R China
[3] Univ Sci & Technol China, Room 1203,Tech Bldg, Hefei, Anhui, Peoples R China
[4] North Star Century Ctr, JD AI Res 8F,Bldg A,8 Beichen West St, Beijing 100105, Peoples R China
基金
国家重点研发计划;
关键词
Visual question answering; attention model; multi-modal representations; visual relationship;
D O I
10.1145/3316767
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, Visual Question Answering (VQA) has attracted increasing attention due to its requirement on cross-modal understanding and reasoning of vision and language. VQA is proposed to automatically answer natural language questions with reference to a given image. VQA is challenging, because the reasoning process on a visual domain needs a full understanding of the spatial relationship, semantic concepts, as well as the common sense for a real image. However, most existing approaches jointly embed the abstract low-level visual features and high-level question features to infer answers. These works have limited reasoning ability due to the lack of modeling of the rich spatial context of regions, high-level semantics of images, and knowledge across multiple sources. To solve the challenges, we propose multi-source multi-level attention networks for visual question answering that can benefit both spatial inferences by visual attention on context-aware region representation and reasoning by semantic attention on concepts as well as external knowledge. Indeed, we learn to reason on image representation by question-guided attention at different levels across multiple sources, including region and concept level representation from image source as well as sentence level representation from the external knowledge base. First, we encode region-based middle-level outputs from Convolutional Neural Networks (CNNs) into spatially embedded representation by a multi-directional two-dimensional recurrent neural network and, further, locate the answer-related regions by Multiple Layer Perceptron as visual attention. Second, we generate semantic concepts from high-level semantics in CNNs and select those question-related concepts as concept attention. Third, we query semantic knowledge from the general knowledge base by concepts and selected question-related knowledge as knowledge attention. Finally, we jointly optimize visual attention, concept attention, knowledge attention, and question embedding by a softmax classifier to infer the final answer. Extensive experiments show the proposed approach achieved significant improvement on two very challenging VQA datasets.
引用
收藏
页数:20
相关论文
共 50 条
  • [31] Multi-level retrieval with semantic Axiomatic Fuzzy Set clustering for question answering
    Lang, Qi
    Liu, Xiaodong
    Deng, Yingjie
    [J]. APPLIED SOFT COMPUTING, 2021, 111
  • [32] Multi-Modal Alignment of Visual Question Answering Based on Multi-Hop Attention Mechanism
    Xia, Qihao
    Yu, Chao
    Hou, Yinong
    Peng, Pingping
    Zheng, Zhengqi
    Chen, Wen
    [J]. ELECTRONICS, 2022, 11 (11)
  • [33] A multi-level Demand-Side Management algorithm for offgrid multi-source systems
    Roy, Anthony
    Auger, Francois
    Dupriez-Robin, Florian
    Bourguet, Salvy
    Quoc Tuan Tran
    [J]. ENERGY, 2020, 191 (191)
  • [34] Innovative behaviour in small and medium sized businesses, a multi-level, multi-source approach
    Gorgievski, Marjan
    Laguna, Mariola
    Antonio Moriano, Juan
    Stephan, Ute
    Lukes, Matin
    [J]. INTERNATIONAL JOURNAL OF PSYCHOLOGY, 2023, 58 : 397 - 397
  • [35] Multi-Source Test-Time Adaptation as Dueling Bandits for Extractive Question Answering
    Ye, Hai
    Xie, Qizhe
    Ng, Hwee Tou
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 9647 - 9660
  • [36] Multi-level Visual Fusion Networks for Image Captioning
    Zhou, Dongming
    Zhang, Canlong
    Li, Zhixin
    Wang, Zhiwen
    [J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [37] Integration of multi-level semantics in PTMs with an attention model for question matching
    Ye, Zheng
    Che, Linwei
    Ge, Jun
    Qin, Jun
    Liu, Jing
    [J]. PLOS ONE, 2024, 19 (08):
  • [38] A Study of Visual Question Answering Techniques Based on Collaborative Multi-Head Attention
    Yang, Yingli
    Jin, Jingxuan
    Li, De
    [J]. 2023 3RD ASIA-PACIFIC CONFERENCE ON COMMUNICATIONS TECHNOLOGY AND COMPUTER SCIENCE, ACCTCS, 2023, : 552 - 555
  • [39] A multi-scale contextual attention network for remote sensing visual question answering
    Feng, Jiangfan
    Wang, Hui
    [J]. INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2024, 126
  • [40] Visual-textual sentiment classification with bi-directional multi-level attention networks
    Xu, Jie
    Huang, Feiran
    Zhang, Xiaoming
    Wang, Senzhang
    Li, Chaozhuo
    Li, Zhoujun
    He, Yueying
    [J]. KNOWLEDGE-BASED SYSTEMS, 2019, 178 : 61 - 73