An Efficient and Effective Transformer Decoder-Based Framework for Multi-task Visual Grounding

被引：0

作者：

Chen, Wei ^{[1
]}

Chen, Long ^{[2
]}

Wu, Yu ^{[1
]}

机构：

[1] Wuhan Univ, Wuhan, Peoples R China

[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China

来源：

COMPUTER VISION - ECCV 2024, PT XLV | 2025年 / 15103卷

基金：

中国国家自然科学基金;

关键词：

Visual Grounding; Transformer Decoder; Token Elimination;

D O I：

10.1007/978-3-031-72995-9_8

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Most advanced visual grounding methods rely on Transformers for visual-linguistic feature fusion. However, these Transformer-based approaches encounter a significant drawback: the computational costs escalate quadratically due to the self-attention mechanism in the Transformer Encoder, particularly when dealing with high-resolution images or long context sentences. This quadratic increase in computational burden restricts the applicability of visual grounding to more intricate scenes, such as conversation-based reasoning segmentation, which involves lengthy language expressions. In this paper, we propose an efficient and effective multi-task visual grounding (EEVG) framework based on Transformer Decoder to address this issue, which reduces the cost in both language and visual aspects. In the language aspect, we employ the Transformer Decoder to fuse visual and linguistic features, where linguistic features are input as memory and visual features as queries. This allows fusion to scale linearly with language expression length. In the visual aspect, we introduce a parameter-free approach to reduce computation by eliminating background visual tokens based on attention scores. We then design a light mask head to directly predict segmentation masks from the remaining sparse feature maps. Extensive results and ablation studies on benchmarks demonstrate the efficiency and effectiveness of our approach. Code is available in https://github.com/chenwei746/EEVG.

引用

页码：125 / 141

页数：17

共 50 条

[1] Referring Transformer: A One-step Approach to Multi-task Visual Grounding
Li, Muchen
Sigal, Leonid
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[2] Language AdaptiveWeight Generation for Multi-task Visual Grounding
Su, Wei
Miao, Peihan
Dou, Huanzhang
Wang, Gaoang
Qiao, Liang
Li, Zheyang
Li, Xi
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10857 - 10866
[3] MTDiag: An Effective Multi-Task Framework for Automatic Diagnosis
Hou, Zhenyu
Cen, Yukuo
Liu, Ziding
Wu, Dongxue
Wang, Baoyan
Li, Xuanhe
Hong, Lei
Tang, Jie
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 12, 2023, : 14241 - 14248
[4] Decoder-based multi-context interconnect architecture
Lodi, A
Ciccarelli, L
Cappelli, A
Campi, F
Toma, M
ISVLSI 2003: IEEE COMPUTER SOCIETY ANNUAL SYMPOSIUM ON VLSI, PROCEEDINGS: NEW TRENDS AND TECHNOLOGIES FOR VLSI SYSTEMS DESIGN, 2003, : 231 - 233
[5] Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding
Shi, Fengyuan
Gao, Ruopeng
Huang, Weilin
Wang, Limin
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (02) : 1181 - 1198
[6] A Dynamic Feature Interaction Framework for Multi-task Visual Perception
Yuling Xi
Hao Chen
Ning Wang
Peng Wang
Yanning Zhang
Chunhua Shen
Yifan Liu
International Journal of Computer Vision, 2023, 131 : 2977 - 2993
[7] A Dynamic Feature Interaction Framework for Multi-task Visual Perception
Xi, Yuling
Chen, Hao
Wang, Ning
Wang, Peng
Zhang, Yanning
Shen, Chunhua
Liu, Yifan
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 131 (11) : 2977 - 2993
[8] ChangeMask: Deep multi-task encoder-transformer-decoder architecture for semantic change detection
Zheng, Zhuo
Zhong, Yanfei
Tian, Shiqi
Ma, Ailong
Zhang, Liangpei
ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2022, 183 : 228 - 239
[9] Efficient Computation Sharing for Multi-Task Visual Scene Understanding
Shoouri, Sara
Yang, Mingyu
Fan, Zichen
Kim, Hun-Seok
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 17084 - 17095
[10] An effective multi-task learning framework for drug repurposing based on graph representation learning
Ye, Shengwei
Zhao, Weizhong
Shen, Xianjun
Jiang, Xingpeng
He, Tingting
METHODS, 2023, 218 : 48 - 56

← 1 2 3 4 5 →