An Efficient and Effective Transformer Decoder-Based Framework for Multi-task Visual Grounding

被引:0
|
作者
Chen, Wei [1 ]
Chen, Long [2 ]
Wu, Yu [1 ]
机构
[1] Wuhan Univ, Wuhan, Peoples R China
[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
Visual Grounding; Transformer Decoder; Token Elimination;
D O I
10.1007/978-3-031-72995-9_8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most advanced visual grounding methods rely on Transformers for visual-linguistic feature fusion. However, these Transformer-based approaches encounter a significant drawback: the computational costs escalate quadratically due to the self-attention mechanism in the Transformer Encoder, particularly when dealing with high-resolution images or long context sentences. This quadratic increase in computational burden restricts the applicability of visual grounding to more intricate scenes, such as conversation-based reasoning segmentation, which involves lengthy language expressions. In this paper, we propose an efficient and effective multi-task visual grounding (EEVG) framework based on Transformer Decoder to address this issue, which reduces the cost in both language and visual aspects. In the language aspect, we employ the Transformer Decoder to fuse visual and linguistic features, where linguistic features are input as memory and visual features as queries. This allows fusion to scale linearly with language expression length. In the visual aspect, we introduce a parameter-free approach to reduce computation by eliminating background visual tokens based on attention scores. We then design a light mask head to directly predict segmentation masks from the remaining sparse feature maps. Extensive results and ablation studies on benchmarks demonstrate the efficiency and effectiveness of our approach. Code is available in https://github.com/chenwei746/EEVG.
引用
收藏
页码:125 / 141
页数:17
相关论文
共 50 条
  • [1] Referring Transformer: A One-step Approach to Multi-task Visual Grounding
    Li, Muchen
    Sigal, Leonid
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [2] Language AdaptiveWeight Generation for Multi-task Visual Grounding
    Su, Wei
    Miao, Peihan
    Dou, Huanzhang
    Wang, Gaoang
    Qiao, Liang
    Li, Zheyang
    Li, Xi
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10857 - 10866
  • [3] MTDiag: An Effective Multi-Task Framework for Automatic Diagnosis
    Hou, Zhenyu
    Cen, Yukuo
    Liu, Ziding
    Wu, Dongxue
    Wang, Baoyan
    Li, Xuanhe
    Hong, Lei
    Tang, Jie
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 12, 2023, : 14241 - 14248
  • [4] Decoder-based multi-context interconnect architecture
    Lodi, A
    Ciccarelli, L
    Cappelli, A
    Campi, F
    Toma, M
    ISVLSI 2003: IEEE COMPUTER SOCIETY ANNUAL SYMPOSIUM ON VLSI, PROCEEDINGS: NEW TRENDS AND TECHNOLOGIES FOR VLSI SYSTEMS DESIGN, 2003, : 231 - 233
  • [5] Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding
    Shi, Fengyuan
    Gao, Ruopeng
    Huang, Weilin
    Wang, Limin
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (02) : 1181 - 1198
  • [6] A Dynamic Feature Interaction Framework for Multi-task Visual Perception
    Yuling Xi
    Hao Chen
    Ning Wang
    Peng Wang
    Yanning Zhang
    Chunhua Shen
    Yifan Liu
    International Journal of Computer Vision, 2023, 131 : 2977 - 2993
  • [7] A Dynamic Feature Interaction Framework for Multi-task Visual Perception
    Xi, Yuling
    Chen, Hao
    Wang, Ning
    Wang, Peng
    Zhang, Yanning
    Shen, Chunhua
    Liu, Yifan
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 131 (11) : 2977 - 2993
  • [8] ChangeMask: Deep multi-task encoder-transformer-decoder architecture for semantic change detection
    Zheng, Zhuo
    Zhong, Yanfei
    Tian, Shiqi
    Ma, Ailong
    Zhang, Liangpei
    ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2022, 183 : 228 - 239
  • [9] Efficient Computation Sharing for Multi-Task Visual Scene Understanding
    Shoouri, Sara
    Yang, Mingyu
    Fan, Zichen
    Kim, Hun-Seok
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 17084 - 17095
  • [10] An effective multi-task learning framework for drug repurposing based on graph representation learning
    Ye, Shengwei
    Zhao, Weizhong
    Shen, Xianjun
    Jiang, Xingpeng
    He, Tingting
    METHODS, 2023, 218 : 48 - 56