TRAR: Routing the Attention Spans in Transformer for Visual Question Answering

被引:75
|
作者
Zhou, Yiyi [1 ,2 ]
Ren, Tianhe [1 ,2 ]
Zhu, Chaoyang [1 ,2 ]
Sun, Xiaoshuai [1 ,2 ]
Liu, Jianzhuang [3 ]
Ding, Xinghao [2 ]
Xu, Mingliang [4 ]
Ji, Rongrong [1 ,2 ]
机构
[1] Xiamen Univ, Sch Informat, Media Analyt & Comp Lab, Xiamen, Peoples R China
[2] Xiamen Univ, Sch Informat, Xiamen, Peoples R China
[3] Huawei Technol, Noahs Ark Lab, Shenzhen, Peoples R China
[4] Zhengzhou Univ, Zhengzhou, Peoples R China
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
D O I
10.1109/ICCV48922.2021.00208
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Due to the superior ability of global dependency modeling, Transformer and its variants have become the primary choice of many vision-and-language tasks. However, in tasks like Visual Question Answering (VQA) and Referring Expression Comprehension (REC), the multimodal prediction often requires visual information from macro- to micro-views. Therefore, how to dynamically schedule the global and local dependency modeling in Transformer has become an emerging issue. In this paper, we propose an example-dependent routing scheme called TRAnsformer Routing (TRAR) to address this issue(1). Specifically, in TRAR, each visual Transformer layer is equipped with a routing module with different attention spans. The model can dynamically select the corresponding attentions based on the output of the previous inference step, so as to formulate the optimal routing path for each example. Notably, with careful designs, TRAR can reduce the additional computation and memory overhead to almost negligible. To validate TRAR, we conduct extensive experiments on five benchmark datasets of VQA and REC, and achieve superior performance gains than the standard Transformers and a bunch of state-of-the-art methods.
引用
收藏
页码:2054 / 2064
页数:11
相关论文
共 50 条
  • [41] Multimodal Encoders and Decoders with Gate Attention for Visual Question Answering
    Li, Haiyan
    Han, Dezhi
    COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2021, 18 (03) : 1023 - 1040
  • [42] DHHG-TAC: Fusion of Dynamic Heterogeneous Hypergraphs and Transformer Attention Mechanism for Visual Question Answering Tasks
    Liu, Xuetao
    Dong, Ruiliang
    Yang, Hongyan
    IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2025, 21 (01) : 545 - 553
  • [43] Local relation network with multilevel attention for visual question answering
    Sun, Bo
    Yao, Zeng
    Zhang, Yinghui
    Yu, Lejun
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2020, 73
  • [44] Focal Visual-Text Attention for Memex Question Answering
    Liang, Junwei
    Jiang, Lu
    Cao, Liangliang
    Kalantidis, Yannis
    Li, Li-Jia
    Hauptmann, Alexander G.
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (08) : 1893 - 1908
  • [45] Latent Attention Network With Position Perception for Visual Question Answering
    Zhang, Jing
    Liu, Xiaoqiang
    Wang, Zhe
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2025, 36 (03) : 5059 - 5069
  • [46] Stacked Self-Attention Networks for Visual Question Answering
    Sun, Qiang
    Fu, Yanwei
    ICMR'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2019, : 207 - 211
  • [47] Stacked Attention based Textbook Visual Question Answering with BERT
    Aishwarya, R.
    Sarath, P.
    Rahman, Shibil P.
    Sneha, U.
    Manmadhan, Sruthy
    2022 IEEE 19TH INDIA COUNCIL INTERNATIONAL CONFERENCE, INDICON, 2022,
  • [48] Multi-stage Attention based Visual Question Answering
    Mishra, Aakansha
    Anand, Ashish
    Guha, Prithwijit
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9407 - 9414
  • [49] Multimodal attention-driven visual question answering for Malayalam
    Kovath A.G.
    Nayyar A.
    Sikha O.K.
    Neural Computing and Applications, 2024, 36 (24) : 14691 - 14708
  • [50] Deep Attention Neural Tensor Network for Visual Question Answering
    Bai, Yalong
    Fu, Jianlong
    Zhao, Tiejun
    Mei, Tao
    COMPUTER VISION - ECCV 2018, PT XII, 2018, 11216 : 21 - 37