An Efficient and Effective Transformer Decoder-Based Framework for Multi-task Visual Grounding

被引:0
|
作者
Chen, Wei [1 ]
Chen, Long [2 ]
Wu, Yu [1 ]
机构
[1] Wuhan Univ, Wuhan, Peoples R China
[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
Visual Grounding; Transformer Decoder; Token Elimination;
D O I
10.1007/978-3-031-72995-9_8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most advanced visual grounding methods rely on Transformers for visual-linguistic feature fusion. However, these Transformer-based approaches encounter a significant drawback: the computational costs escalate quadratically due to the self-attention mechanism in the Transformer Encoder, particularly when dealing with high-resolution images or long context sentences. This quadratic increase in computational burden restricts the applicability of visual grounding to more intricate scenes, such as conversation-based reasoning segmentation, which involves lengthy language expressions. In this paper, we propose an efficient and effective multi-task visual grounding (EEVG) framework based on Transformer Decoder to address this issue, which reduces the cost in both language and visual aspects. In the language aspect, we employ the Transformer Decoder to fuse visual and linguistic features, where linguistic features are input as memory and visual features as queries. This allows fusion to scale linearly with language expression length. In the visual aspect, we introduce a parameter-free approach to reduce computation by eliminating background visual tokens based on attention scores. We then design a light mask head to directly predict segmentation masks from the remaining sparse feature maps. Extensive results and ablation studies on benchmarks demonstrate the efficiency and effectiveness of our approach. Code is available in https://github.com/chenwei746/EEVG.
引用
收藏
页码:125 / 141
页数:17
相关论文
共 50 条
  • [21] Paraphrase Bidirectional Transformer with Multi-Task Learning
    Ko, Bowon
    Choi, Ho-Jin
    2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP 2020), 2020, : 217 - 220
  • [22] InvPT plus plus : Inverted Pyramid Multi-Task Transformer for Visual Scene Understanding
    Ye, Hanrong
    Xu, Dan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 7493 - 7508
  • [23] Bidirectional Transformer Based Multi-Task Learning for Natural Language Understanding
    Tripathi, Suraj
    Singh, Chirag
    Kumar, Abhay
    Pandey, Chandan
    Jain, Nishant
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2019), 2019, 11608 : 54 - 65
  • [24] A Framework for Area-efficient Multi-task BERT Execution on ReRAM-based Accelerators
    Kang, Myeonggu
    Shin, Hyein
    Shin, Jaekang
    Kim, Lee-Sup
    2021 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER AIDED DESIGN (ICCAD), 2021,
  • [25] iPCa-Former: A Multi-Task Transformer Framework for Perceiving Incidental Prostate Cancer
    Pan, Xianwei
    Wang, Simiao
    Liu, Yunan
    Wen, Lijie
    Lu, Mingyu
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 785 - 789
  • [26] Multi-task framework of precipitation nowcasting
    Zhang, Zheng
    Luo, Chuyao
    Zhang, Baoquan
    Jiang, Hao
    Zhang, Bowen
    CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2023, 8 (04) : 1350 - 1363
  • [27] Dual-decoder transformer network for answer grounding in visual question answering
    Zhu, Liangjun
    Peng, Li
    Zhou, Weinan
    Yang, Jielong
    PATTERN RECOGNITION LETTERS, 2023, 171 : 53 - 60
  • [28] A Multi-Task Framework for Action Prediction
    Yu, Tianyu
    Liu, Cuiwei
    Yan, Zhuo
    Shi, Xiangbin
    INFORMATION, 2020, 11 (03)
  • [29] A Multi-Task Framework for Weather Recognition
    Li, Xuelong
    Wang, Zhigang
    Lu, Xiaoqiang
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1318 - 1326
  • [30] Multi-task neural framework for sexism
    Abburi, Harika
    Parikh, Pulkit
    Chhaya, Niyati
    Varma, Vasudeva
    COMPUTER SPEECH AND LANGUAGE, 2023, 83