An Efficient and Effective Transformer Decoder-Based Framework for Multi-task Visual Grounding

被引:0
|
作者
Chen, Wei [1 ]
Chen, Long [2 ]
Wu, Yu [1 ]
机构
[1] Wuhan Univ, Wuhan, Peoples R China
[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
Visual Grounding; Transformer Decoder; Token Elimination;
D O I
10.1007/978-3-031-72995-9_8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most advanced visual grounding methods rely on Transformers for visual-linguistic feature fusion. However, these Transformer-based approaches encounter a significant drawback: the computational costs escalate quadratically due to the self-attention mechanism in the Transformer Encoder, particularly when dealing with high-resolution images or long context sentences. This quadratic increase in computational burden restricts the applicability of visual grounding to more intricate scenes, such as conversation-based reasoning segmentation, which involves lengthy language expressions. In this paper, we propose an efficient and effective multi-task visual grounding (EEVG) framework based on Transformer Decoder to address this issue, which reduces the cost in both language and visual aspects. In the language aspect, we employ the Transformer Decoder to fuse visual and linguistic features, where linguistic features are input as memory and visual features as queries. This allows fusion to scale linearly with language expression length. In the visual aspect, we introduce a parameter-free approach to reduce computation by eliminating background visual tokens based on attention scores. We then design a light mask head to directly predict segmentation masks from the remaining sparse feature maps. Extensive results and ablation studies on benchmarks demonstrate the efficiency and effectiveness of our approach. Code is available in https://github.com/chenwei746/EEVG.
引用
收藏
页码:125 / 141
页数:17
相关论文
共 50 条
  • [31] A Transformer-Based Multi-Task Learning Framework for Myoelectric Pattern Recognition Supporting Muscle Force Estimation
    Li, Xinhui
    Zhang, Xu
    Zhang, Liwei
    Chen, Xiang
    Zhou, Ping
    IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, 2023, 31 : 3255 - 3264
  • [32] Adaptive transformer-based multi-task learning framework for synchronous prediction of substation flooding and outage risks
    Shi, Yu
    Shi, Ying
    Yao, Degui
    Lu, Ming
    Liang, Yun
    ELECTRIC POWER SYSTEMS RESEARCH, 2025, 242
  • [33] MTT: an efficient model for encrypted network traffic classification using multi-task transformer
    Zheng, Weiping
    Zhong, Jianhao
    Zhang, Qizhi
    Zhao, Gansen
    APPLIED INTELLIGENCE, 2022, 52 (09) : 10741 - 10756
  • [34] Probabilistic movement primitives based multi-task learning framework
    Yue, Chengfei
    Gao, Tian
    Lu, Lang
    Lin, Tao
    Wu, Yunhua
    COMPUTERS & INDUSTRIAL ENGINEERING, 2024, 191
  • [35] Iterative framework based on multi-task learning for service recommendation
    Yu, Ting
    Yu, Dongjin
    Wang, Dongjing
    Yang, Quanxin
    Hu, Xueyou
    JOURNAL OF SYSTEMS AND SOFTWARE, 2024, 207
  • [36] Iterative framework based on multi-task learning for service recommendation
    Yu, Ting
    Yu, Dongjin
    Wang, Dongjing
    Yang, Quanxin
    Hu, Xueyou
    Journal of Systems and Software, 2024, 207
  • [37] MTT: an efficient model for encrypted network traffic classification using multi-task transformer
    Weiping Zheng
    Jianhao Zhong
    Qizhi Zhang
    Gansen Zhao
    Applied Intelligence, 2022, 52 : 10741 - 10756
  • [38] Efficient Controllable Multi-Task Architectures
    Aich, Abhishek
    Schulter, Samuel
    Roy-Chowdhury, Amit K.
    Chandraker, Manmohan
    Suh, Yumin
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 5717 - 5728
  • [39] End-to-end Multi-task Learning Framework for Spatio-Temporal Grounding in Video Corpus
    Gao, Yingqi
    Luo, Zhiling
    Chen, Shiqian
    Zhou, Wei
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 3958 - 3962
  • [40] Prompt Guided Transformer for Multi-Task Dense Prediction
    Lu, Yuxiang
    Sirejiding, Shalayiding
    Ding, Yue
    Wang, Chunlin
    Lu, Hongtao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 6375 - 6385