Multi-grained Attention with Object-level Grounding for Visual Question Answering

被引:0
|
作者
Huang, Pingping [1 ]
Huang, Jianhui [1 ]
Guo, Yuqing [1 ]
Qiao, Min [1 ]
Zhu, Yong [1 ]
机构
[1] Baidu Inc, Beijing, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Attention mechanisms are widely used in Visual Question Answering (VQA) to search for visual clues related to the question. Most approaches train attention models from a coarse-grained association between sentences and images, which tends to fail on small objects or uncommon concepts. To address this problem, this paper proposes a multi-grained attention method. It learns explicit word-object correspondence by two types of word-level attention complementary to the sentence-image association. Evaluated on the VQA benchmark, the multi-grained attention model achieves competitive performance with state-of-the-art models. And the visualized attention maps demonstrate that addition of object-level groundings leads to a better understanding of the images and locates the attended objects more precisely.
引用
收藏
页码:3595 / 3600
页数:6
相关论文
共 50 条
  • [1] Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering
    Shaoning Xiao
    Yimeng Li
    Yunan Ye
    Long Chen
    Shiliang Pu
    Zhou Zhao
    Jian Shao
    Jun Xiao
    [J]. Neural Processing Letters, 2020, 52 : 993 - 1003
  • [2] Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering
    Xiao, Shaoning
    Li, Yimeng
    Ye, Yunan
    Chen, Long
    Pu, Shiliang
    Zhao, Zhou
    Shao, Jian
    Xiao, Jun
    [J]. NEURAL PROCESSING LETTERS, 2020, 52 (02) : 993 - 1003
  • [3] Multi-grained unsupervised evidence retrieval for question answering
    Hao You
    [J]. Neural Computing and Applications, 2023, 35 : 21247 - 21257
  • [4] Multi-grained unsupervised evidence retrieval for question answering
    You, Hao
    [J]. NEURAL COMPUTING & APPLICATIONS, 2023, 35 (28): : 21247 - 21257
  • [5] Question -Led object attention for visual question answering
    Gao, Lianli
    Cao, Liangfu
    Xu, Xing
    Shao, Jie
    Song, Jingkuan
    [J]. NEUROCOMPUTING, 2020, 391 : 227 - 233
  • [6] Multi-level Attention Networks for Visual Question Answering
    Yu, Dongfei
    Fu, Jianlong
    Mei, Tao
    Rui, Yong
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4187 - 4195
  • [7] Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining
    Zhang, Yundong
    Niebles, Juan Carlos
    Soto, Alvaro
    [J]. 2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 349 - 357
  • [8] Multi-source Multi-level Attention Networks for Visual Question Answering
    Yu, Dongfei
    Fu, Jianlong
    Tian, Xinmei
    Mei, Tao
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (02)
  • [9] A Multi-level Mesh Mutual Attention Model for Visual Question Answering
    Zhi Lei
    Guixian Zhang
    Lijuan Wu
    Kui Zhang
    Rongjiao Liang
    [J]. Data Science and Engineering, 2022, 7 : 339 - 353
  • [10] A Multi-level Mesh Mutual Attention Model for Visual Question Answering
    Lei, Zhi
    Zhang, Guixian
    Wu, Lijuan
    Zhang, Kui
    Liang, Rongjiao
    [J]. DATA SCIENCE AND ENGINEERING, 2022, 7 (04) : 339 - 353