Word-to-region attention network for visual question answering

被引:20
|
作者
Peng, Liang [1 ,2 ]
Yang, Yang [1 ,2 ]
Bin, Yi [1 ,2 ]
Xie, Ning [1 ,2 ]
Shen, Fumin [1 ,2 ]
Ji, Yanli [1 ,2 ]
Xu, Xing [1 ,2 ]
机构
[1] Univ Elect Sci & Technol China, Ctr Future Media, Chengdu, Sichuan, Peoples R China
[2] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu, Sichuan, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual question answering; Word attention; Image attention; Word-to-region;
D O I
10.1007/s11042-018-6389-3
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Visual attention, which allows more concentration on the image regions that are relevant to a reference question, brings remarkable performance improvement in Visual Question Answering (VQA). Most VQA attention models employ the entire reference question representation to query relevant image regions. Nonetheless, only certain salient words of the question play an effective role in an attention operation. In this paper, we propose a novel Word-to-Region Attention Network (WRAN), which can 1) simultaneously locate pertinent object regions instead of a uniform grid of image regions of euqal size and identify the corresponding words of the reference question; as well as 2) enforce consistency between image object regions and core semantics in questions. We evaluate the proposed model on the VQA v1.0 and VQA v2.0 datasets. Experimental results demonstrate the superiority of the proposed model as compared to the state-of-the-arts.
引用
收藏
页码:3843 / 3858
页数:16
相关论文
共 50 条
  • [1] Word-to-region attention network for visual question answering
    Liang Peng
    Yang Yang
    Yi Bin
    Ning Xie
    Fumin Shen
    Yanli Ji
    Xing Xu
    [J]. Multimedia Tools and Applications, 2019, 78 : 3843 - 3858
  • [2] Collaborative Attention Network to Enhance Visual Question Answering
    Gu, Rui
    [J]. BASIC & CLINICAL PHARMACOLOGY & TOXICOLOGY, 2019, 124 : 304 - 305
  • [3] ADAPTIVE ATTENTION FUSION NETWORK FOR VISUAL QUESTION ANSWERING
    Gu, Geonmo
    Kim, Seong Tae
    Ro, Yong Man
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2017, : 997 - 1002
  • [4] Triple attention network for sentimental visual question answering
    Ruwa, Nelson
    Mao, Qirong
    Song, Heping
    Jia, Hongjie
    Dong, Ming
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2019, 189
  • [5] Fair Attention Network for Robust Visual Question Answering
    Bi Y.
    Jiang H.
    Hu Y.
    Sun Y.
    Yin B.
    [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34 (09) : 1 - 1
  • [6] A Dual-Attention Learning Network With Word and Sentence Embedding for Medical Visual Question Answering
    Huang, Xiaofei
    Gong, Hongfang
    [J]. IEEE TRANSACTIONS ON MEDICAL IMAGING, 2024, 43 (02) : 832 - 845
  • [7] Co-Attention Network With Question Type for Visual Question Answering
    Yang, Chao
    Jiang, Mengqi
    Jiang, Bin
    Zhou, Weixin
    Li, Keqin
    [J]. IEEE ACCESS, 2019, 7 : 40771 - 40781
  • [8] Local relation network with multilevel attention for visual question answering
    Sun, Bo
    Yao, Zeng
    Zhang, Yinghui
    Yu, Lejun
    [J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2020, 73 (73)
  • [9] Deep Attention Neural Tensor Network for Visual Question Answering
    Bai, Yalong
    Fu, Jianlong
    Zhao, Tiejun
    Mei, Tao
    [J]. COMPUTER VISION - ECCV 2018, PT XII, 2018, 11216 : 21 - 37
  • [10] Deep Modular Bilinear Attention Network for Visual Question Answering
    Yan, Feng
    Silamu, Wushouer
    Li, Yanbing
    [J]. SENSORS, 2022, 22 (03)