An Efficient and Effective Transformer Decoder-Based Framework for Multi-task Visual Grounding

被引：0

作者：

Chen, Wei ^{[1
]}

Chen, Long ^{[2
]}

Wu, Yu ^{[1
]}

机构：

[1] Wuhan Univ, Wuhan, Peoples R China

[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China

来源：

COMPUTER VISION - ECCV 2024, PT XLV | 2025年 / 15103卷

基金：

中国国家自然科学基金;

关键词：

Visual Grounding; Transformer Decoder; Token Elimination;

D O I：

10.1007/978-3-031-72995-9_8

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Most advanced visual grounding methods rely on Transformers for visual-linguistic feature fusion. However, these Transformer-based approaches encounter a significant drawback: the computational costs escalate quadratically due to the self-attention mechanism in the Transformer Encoder, particularly when dealing with high-resolution images or long context sentences. This quadratic increase in computational burden restricts the applicability of visual grounding to more intricate scenes, such as conversation-based reasoning segmentation, which involves lengthy language expressions. In this paper, we propose an efficient and effective multi-task visual grounding (EEVG) framework based on Transformer Decoder to address this issue, which reduces the cost in both language and visual aspects. In the language aspect, we employ the Transformer Decoder to fuse visual and linguistic features, where linguistic features are input as memory and visual features as queries. This allows fusion to scale linearly with language expression length. In the visual aspect, we introduce a parameter-free approach to reduce computation by eliminating background visual tokens based on attention scores. We then design a light mask head to directly predict segmentation masks from the remaining sparse feature maps. Extensive results and ablation studies on benchmarks demonstrate the efficiency and effectiveness of our approach. Code is available in https://github.com/chenwei746/EEVG.

引用

页码：125 / 141

页数：17

共 50 条

[31] A Transformer-Based Multi-Task Learning Framework for Myoelectric Pattern Recognition Supporting Muscle Force Estimation
Li, Xinhui
Zhang, Xu
Zhang, Liwei
Chen, Xiang
Zhou, Ping
IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, 2023, 31 : 3255 - 3264
[32] Adaptive transformer-based multi-task learning framework for synchronous prediction of substation flooding and outage risks
Shi, Yu
Shi, Ying
Yao, Degui
Lu, Ming
Liang, Yun
ELECTRIC POWER SYSTEMS RESEARCH, 2025, 242
[33] MTT: an efficient model for encrypted network traffic classification using multi-task transformer
Zheng, Weiping
Zhong, Jianhao
Zhang, Qizhi
Zhao, Gansen
APPLIED INTELLIGENCE, 2022, 52 (09) : 10741 - 10756
[34] Probabilistic movement primitives based multi-task learning framework
Yue, Chengfei
Gao, Tian
Lu, Lang
Lin, Tao
Wu, Yunhua
COMPUTERS & INDUSTRIAL ENGINEERING, 2024, 191
[35] Iterative framework based on multi-task learning for service recommendation
Yu, Ting
Yu, Dongjin
Wang, Dongjing
Yang, Quanxin
Hu, Xueyou
JOURNAL OF SYSTEMS AND SOFTWARE, 2024, 207
[36] Iterative framework based on multi-task learning for service recommendation
Yu, Ting
Yu, Dongjin
Wang, Dongjing
Yang, Quanxin
Hu, Xueyou
Journal of Systems and Software, 2024, 207
[37] MTT: an efficient model for encrypted network traffic classification using multi-task transformer
Weiping Zheng
Jianhao Zhong
Qizhi Zhang
Gansen Zhao
Applied Intelligence, 2022, 52 : 10741 - 10756
[38] Efficient Controllable Multi-Task Architectures
Aich, Abhishek
Schulter, Samuel
Roy-Chowdhury, Amit K.
Chandraker, Manmohan
Suh, Yumin
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 5717 - 5728
[39] End-to-end Multi-task Learning Framework for Spatio-Temporal Grounding in Video Corpus
Gao, Yingqi
Luo, Zhiling
Chen, Shiqian
Zhou, Wei
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 3958 - 3962
[40] Prompt Guided Transformer for Multi-Task Dense Prediction
Lu, Yuxiang
Sirejiding, Shalayiding
Ding, Yue
Wang, Chunlin
Lu, Hongtao
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 6375 - 6385

← 1 2 3 4 5 →