TRAR: Routing the Attention Spans in Transformer for Visual Question Answering

被引:75
|
作者
Zhou, Yiyi [1 ,2 ]
Ren, Tianhe [1 ,2 ]
Zhu, Chaoyang [1 ,2 ]
Sun, Xiaoshuai [1 ,2 ]
Liu, Jianzhuang [3 ]
Ding, Xinghao [2 ]
Xu, Mingliang [4 ]
Ji, Rongrong [1 ,2 ]
机构
[1] Xiamen Univ, Sch Informat, Media Analyt & Comp Lab, Xiamen, Peoples R China
[2] Xiamen Univ, Sch Informat, Xiamen, Peoples R China
[3] Huawei Technol, Noahs Ark Lab, Shenzhen, Peoples R China
[4] Zhengzhou Univ, Zhengzhou, Peoples R China
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
D O I
10.1109/ICCV48922.2021.00208
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Due to the superior ability of global dependency modeling, Transformer and its variants have become the primary choice of many vision-and-language tasks. However, in tasks like Visual Question Answering (VQA) and Referring Expression Comprehension (REC), the multimodal prediction often requires visual information from macro- to micro-views. Therefore, how to dynamically schedule the global and local dependency modeling in Transformer has become an emerging issue. In this paper, we propose an example-dependent routing scheme called TRAnsformer Routing (TRAR) to address this issue(1). Specifically, in TRAR, each visual Transformer layer is equipped with a routing module with different attention spans. The model can dynamically select the corresponding attentions based on the output of the previous inference step, so as to formulate the optimal routing path for each example. Notably, with careful designs, TRAR can reduce the additional computation and memory overhead to almost negligible. To validate TRAR, we conduct extensive experiments on five benchmark datasets of VQA and REC, and achieve superior performance gains than the standard Transformers and a bunch of state-of-the-art methods.
引用
收藏
页码:2054 / 2064
页数:11
相关论文
共 50 条
  • [31] Learning Visual Question Answering by Bootstrapping Hard Attention
    Malinowski, Mateusz
    Doersch, Carl
    Santoro, Adam
    Battaglia, Peter
    COMPUTER VISION - ECCV 2018, PT VI, 2018, 11210 : 3 - 20
  • [32] Co-Attention Network With Question Type for Visual Question Answering
    Yang, Chao
    Jiang, Mengqi
    Jiang, Bin
    Zhou, Weixin
    Li, Keqin
    IEEE ACCESS, 2019, 7 : 40771 - 40781
  • [33] Dual Attention and Question Categorization-Based Visual Question Answering
    Mishra A.
    Anand A.
    Guha P.
    IEEE Transactions on Artificial Intelligence, 2023, 4 (01): : 81 - 91
  • [34] A lightweight Transformer-based visual question answering network with Weight-Sharing Hybrid Attention
    Zhu, Yue
    Chen, Dongyue
    Jia, Tong
    Deng, Shizhuo
    NEUROCOMPUTING, 2024, 608
  • [35] Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering
    Yu, Zhou
    Jin, Zitian
    Yu, Jun
    Xu, Mingliang
    Wang, Hongbo
    Fan, Jianping
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 9543 - 9556
  • [36] A Transformer-based Medical Visual Question Answering Model
    Liu, Lei
    Su, Xiangdong
    Guo, Hui
    Zhu, Daobin
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1712 - 1718
  • [37] VISION AND TEXT TRANSFORMER FOR PREDICTING ANSWERABILITY ON VISUAL QUESTION ANSWERING
    Le, Tung
    Huy Tien Nguyen
    Minh Le Nguyen
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 934 - 938
  • [38] From Pixels to Objects: Cubic Visual Attention for Visual Question Answering
    Song, Jingkuan
    Zeng, Pengpeng
    Gao, Lianli
    Shen, Heng Tao
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 906 - 912
  • [39] Transformer Module Networks for Systematic Generalization in Visual Question Answering
    Yamada, Moyuru
    D'Amario, Vanessa
    Takemoto, Kentaro
    Boix, Xavier
    Sasaki, Tomotake
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 10096 - 10105
  • [40] Hierarchical Question-Image Co-Attention for Visual Question Answering
    Lu, Jiasen
    Yang, Jianwei
    Batra, Dhruv
    Parikh, Devi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29