TRAR: Routing the Attention Spans in Transformer for Visual Question Answering

被引:75
|
作者
Zhou, Yiyi [1 ,2 ]
Ren, Tianhe [1 ,2 ]
Zhu, Chaoyang [1 ,2 ]
Sun, Xiaoshuai [1 ,2 ]
Liu, Jianzhuang [3 ]
Ding, Xinghao [2 ]
Xu, Mingliang [4 ]
Ji, Rongrong [1 ,2 ]
机构
[1] Xiamen Univ, Sch Informat, Media Analyt & Comp Lab, Xiamen, Peoples R China
[2] Xiamen Univ, Sch Informat, Xiamen, Peoples R China
[3] Huawei Technol, Noahs Ark Lab, Shenzhen, Peoples R China
[4] Zhengzhou Univ, Zhengzhou, Peoples R China
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
D O I
10.1109/ICCV48922.2021.00208
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Due to the superior ability of global dependency modeling, Transformer and its variants have become the primary choice of many vision-and-language tasks. However, in tasks like Visual Question Answering (VQA) and Referring Expression Comprehension (REC), the multimodal prediction often requires visual information from macro- to micro-views. Therefore, how to dynamically schedule the global and local dependency modeling in Transformer has become an emerging issue. In this paper, we propose an example-dependent routing scheme called TRAnsformer Routing (TRAR) to address this issue(1). Specifically, in TRAR, each visual Transformer layer is equipped with a routing module with different attention spans. The model can dynamically select the corresponding attentions based on the output of the previous inference step, so as to formulate the optimal routing path for each example. Notably, with careful designs, TRAR can reduce the additional computation and memory overhead to almost negligible. To validate TRAR, we conduct extensive experiments on five benchmark datasets of VQA and REC, and achieve superior performance gains than the standard Transformers and a bunch of state-of-the-art methods.
引用
收藏
页码:2054 / 2064
页数:11
相关论文
共 50 条
  • [21] RVT-Transformer: Residual Attention in Answerability Prediction on Visual Question Answering for Blind People
    Duy-Minh Nguyen-Tran
    Tung Le
    Khoa Pho
    Minh Le Nguyen
    Huy Tien Nguyen
    ADVANCES IN COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2022, 2022, 1653 : 423 - 435
  • [22] Advancing Vietnamese Visual Question Answering with Transformer and Convolutional
    Nguyen, Ngoc Son
    Nguyen, Van Son
    Le, Tung
    COMPUTERS & ELECTRICAL ENGINEERING, 2024, 119
  • [23] Focal Visual-Text Attention for Visual Question Answering
    Liang, Junwei
    Jiang, Lu
    Cao, Liangliang
    Li, Li-Jia
    Hauptmann, Alexander
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6135 - 6143
  • [24] QAlayout: Question Answering Layout Based on Multimodal Attention for Visual Question Answering on Corporate Document
    Mahamoud, Ibrahim Souleiman
    Coustaty, Mickael
    Joseph, Aurelie
    d'Andecy, Vincent Poulain
    Ogier, Jean-Marc
    DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 659 - 673
  • [25] ADAPTIVE ATTENTION FUSION NETWORK FOR VISUAL QUESTION ANSWERING
    Gu, Geonmo
    Kim, Seong Tae
    Ro, Yong Man
    2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2017, : 997 - 1002
  • [26] Triple attention network for sentimental visual question answering
    Ruwa, Nelson
    Mao, Qirong
    Song, Heping
    Jia, Hongjie
    Dong, Ming
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2019, 189
  • [27] Collaborative Attention Network to Enhance Visual Question Answering
    Gu, Rui
    BASIC & CLINICAL PHARMACOLOGY & TOXICOLOGY, 2019, 124 : 304 - 305
  • [28] Densely Connected Attention Flow for Visual Question Answering
    Liu, Fei
    Liu, Jing
    Fang, Zhiwei
    Hong, Richang
    Lu, Hanging
    PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 869 - 875
  • [29] Adversarial Learning with Bidirectional Attention for Visual Question Answering
    Li, Qifeng
    Tang, Xinyi
    Jian, Yi
    SENSORS, 2021, 21 (21)
  • [30] Fair Attention Network for Robust Visual Question Answering
    Bi, Yandong
    Jiang, Huajie
    Hu, Yongli
    Sun, Yanfeng
    Yin, Baocai
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (09) : 7870 - 7881