An Efficient and Effective Transformer Decoder-Based Framework for Multi-task Visual Grounding

被引：0

作者：

Chen, Wei ^{[1
]}

Chen, Long ^{[2
]}

Wu, Yu ^{[1
]}

机构：

[1] Wuhan Univ, Wuhan, Peoples R China

[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China

来源：

COMPUTER VISION - ECCV 2024, PT XLV | 2025年 / 15103卷

基金：

中国国家自然科学基金;

关键词：

Visual Grounding; Transformer Decoder; Token Elimination;

D O I：

10.1007/978-3-031-72995-9_8

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Most advanced visual grounding methods rely on Transformers for visual-linguistic feature fusion. However, these Transformer-based approaches encounter a significant drawback: the computational costs escalate quadratically due to the self-attention mechanism in the Transformer Encoder, particularly when dealing with high-resolution images or long context sentences. This quadratic increase in computational burden restricts the applicability of visual grounding to more intricate scenes, such as conversation-based reasoning segmentation, which involves lengthy language expressions. In this paper, we propose an efficient and effective multi-task visual grounding (EEVG) framework based on Transformer Decoder to address this issue, which reduces the cost in both language and visual aspects. In the language aspect, we employ the Transformer Decoder to fuse visual and linguistic features, where linguistic features are input as memory and visual features as queries. This allows fusion to scale linearly with language expression length. In the visual aspect, we introduce a parameter-free approach to reduce computation by eliminating background visual tokens based on attention scores. We then design a light mask head to directly predict segmentation masks from the remaining sparse feature maps. Extensive results and ablation studies on benchmarks demonstrate the efficiency and effectiveness of our approach. Code is available in https://github.com/chenwei746/EEVG.

引用

页码：125 / 141

页数：17

共 50 条

[41] Efficient Multi-Task and Transfer Reinforcement Learning With Parameter-Compositional Framework
Sun, Lingfeng
Zhang, Haichao
Xu, Wei
Tomizuka, Masayoshi
IEEE ROBOTICS AND AUTOMATION LETTERS, 2023, 8 (08): : 4569 - 4576
[42] Multi-Task Learning with Personalized Transformer for Review Recommendation
Wang, Haiming
Liu, Wei
Yin, Jian
WEB INFORMATION SYSTEMS ENGINEERING - WISE 2021, PT II, 2021, 13081 : 162 - 176
[43] TransNuSeg: A Lightweight Multi-task Transformer for Nuclei Segmentation
He, Zhenqi
Unberath, Mathias
Ke, Jing
Shen, Yiqing
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT IV, 2023, 14223 : 206 - 215
[44] A Reinforcement-Learning-Based Energy-Efficient Framework for Multi-Task Video Analytics Pipeline
Zhao, Yingying
Dong, Mingzhi
Wang, Yujiang
Feng, Da
Lv, Qin
Dick, Robert P.
Li, Dongsheng
Lu, Tun
Gu, Ning
Shang, Li
IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 2150 - 2163
[45] Transformer Decoder-Based Enhanced Exploration Method to Alleviate Initial Exploration Problems in Reinforcement Learning
Kyoung, Dohyun
Sung, Yunsick
SENSORS, 2023, 23 (17)
[46] Autism spectrum disorders detection based on multi-task transformer neural network
Gao, Le
Wang, Zhimin
Long, Yun
Zhang, Xin
Su, Hexing
Yu, Yong
Hong, Jin
BMC NEUROSCIENCE, 2024, 25 (01):
[47] Multi-Task Mean Teacher Medical Image Segmentation Based on Swin Transformer
Zhang, Jie
Li, Fan
Zhang, Xin
Cheng, Yue
Hei, Xinhong
APPLIED SCIENCES-BASEL, 2024, 14 (07):
[48] PARFormer: Transformer-Based Multi-Task Network for Pedestrian Attribute Recognition
Fan, Xinwen
Zhang, Yukang
Lu, Yang
Wang, Hanzi
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (01) : 411 - 423
[49] Predicting Outcomes for Cancer Patients with Transformer-Based Multi-task Learning
Gerrard, Leah
Peng, Xueping
Clarke, Allison
Schlegel, Clement
Jiang, Jing
AI 2021: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, 13151 : 381 - 392
[50] TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework Using Self-Supervised Multi-Task Learning
Qu, Linhao
Liu, Shaolei
Wang, Manning
Song, Zhijian
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 2126 - 2134

← 1 2 3 4 5 →