An Efficient and Effective Transformer Decoder-Based Framework for Multi-task Visual Grounding

被引：0

作者：

Chen, Wei ^{[1
]}

Chen, Long ^{[2
]}

Wu, Yu ^{[1
]}

机构：

[1] Wuhan Univ, Wuhan, Peoples R China

[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China

来源：

COMPUTER VISION - ECCV 2024, PT XLV | 2025年 / 15103卷

基金：

中国国家自然科学基金;

关键词：

Visual Grounding; Transformer Decoder; Token Elimination;

D O I：

10.1007/978-3-031-72995-9_8

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Most advanced visual grounding methods rely on Transformers for visual-linguistic feature fusion. However, these Transformer-based approaches encounter a significant drawback: the computational costs escalate quadratically due to the self-attention mechanism in the Transformer Encoder, particularly when dealing with high-resolution images or long context sentences. This quadratic increase in computational burden restricts the applicability of visual grounding to more intricate scenes, such as conversation-based reasoning segmentation, which involves lengthy language expressions. In this paper, we propose an efficient and effective multi-task visual grounding (EEVG) framework based on Transformer Decoder to address this issue, which reduces the cost in both language and visual aspects. In the language aspect, we employ the Transformer Decoder to fuse visual and linguistic features, where linguistic features are input as memory and visual features as queries. This allows fusion to scale linearly with language expression length. In the visual aspect, we introduce a parameter-free approach to reduce computation by eliminating background visual tokens based on attention scores. We then design a light mask head to directly predict segmentation masks from the remaining sparse feature maps. Extensive results and ablation studies on benchmarks demonstrate the efficiency and effectiveness of our approach. Code is available in https://github.com/chenwei746/EEVG.

引用

页码：125 / 141

页数：17

共 50 条

[21] Paraphrase Bidirectional Transformer with Multi-Task Learning
Ko, Bowon
Choi, Ho-Jin
2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP 2020), 2020, : 217 - 220
[22] InvPT plus plus : Inverted Pyramid Multi-Task Transformer for Visual Scene Understanding
Ye, Hanrong
Xu, Dan
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 7493 - 7508
[23] Bidirectional Transformer Based Multi-Task Learning for Natural Language Understanding
Tripathi, Suraj
Singh, Chirag
Kumar, Abhay
Pandey, Chandan
Jain, Nishant
NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2019), 2019, 11608 : 54 - 65
[24] A Framework for Area-efficient Multi-task BERT Execution on ReRAM-based Accelerators
Kang, Myeonggu
Shin, Hyein
Shin, Jaekang
Kim, Lee-Sup
2021 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER AIDED DESIGN (ICCAD), 2021,
[25] iPCa-Former: A Multi-Task Transformer Framework for Perceiving Incidental Prostate Cancer
Pan, Xianwei
Wang, Simiao
Liu, Yunan
Wen, Lijie
Lu, Mingyu
IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 785 - 789
[26] Multi-task framework of precipitation nowcasting
Zhang, Zheng
Luo, Chuyao
Zhang, Baoquan
Jiang, Hao
Zhang, Bowen
CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2023, 8 (04) : 1350 - 1363
[27] Dual-decoder transformer network for answer grounding in visual question answering
Zhu, Liangjun
Peng, Li
Zhou, Weinan
Yang, Jielong
PATTERN RECOGNITION LETTERS, 2023, 171 : 53 - 60
[28] A Multi-Task Framework for Action Prediction
Yu, Tianyu
Liu, Cuiwei
Yan, Zhuo
Shi, Xiangbin
INFORMATION, 2020, 11 (03)
[29] A Multi-Task Framework for Weather Recognition
Li, Xuelong
Wang, Zhigang
Lu, Xiaoqiang
PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1318 - 1326
[30] Multi-task neural framework for sexism
Abburi, Harika
Parikh, Pulkit
Chhaya, Niyati
Varma, Vasudeva
COMPUTER SPEECH AND LANGUAGE, 2023, 83

← 1 2 3 4 5 →