An Efficient and Effective Transformer Decoder-Based Framework for Multi-task Visual Grounding

被引:0
|
作者
Chen, Wei [1 ]
Chen, Long [2 ]
Wu, Yu [1 ]
机构
[1] Wuhan Univ, Wuhan, Peoples R China
[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
Visual Grounding; Transformer Decoder; Token Elimination;
D O I
10.1007/978-3-031-72995-9_8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most advanced visual grounding methods rely on Transformers for visual-linguistic feature fusion. However, these Transformer-based approaches encounter a significant drawback: the computational costs escalate quadratically due to the self-attention mechanism in the Transformer Encoder, particularly when dealing with high-resolution images or long context sentences. This quadratic increase in computational burden restricts the applicability of visual grounding to more intricate scenes, such as conversation-based reasoning segmentation, which involves lengthy language expressions. In this paper, we propose an efficient and effective multi-task visual grounding (EEVG) framework based on Transformer Decoder to address this issue, which reduces the cost in both language and visual aspects. In the language aspect, we employ the Transformer Decoder to fuse visual and linguistic features, where linguistic features are input as memory and visual features as queries. This allows fusion to scale linearly with language expression length. In the visual aspect, we introduce a parameter-free approach to reduce computation by eliminating background visual tokens based on attention scores. We then design a light mask head to directly predict segmentation masks from the remaining sparse feature maps. Extensive results and ablation studies on benchmarks demonstrate the efficiency and effectiveness of our approach. Code is available in https://github.com/chenwei746/EEVG.
引用
收藏
页码:125 / 141
页数:17
相关论文
共 50 条
  • [41] Efficient Multi-Task and Transfer Reinforcement Learning With Parameter-Compositional Framework
    Sun, Lingfeng
    Zhang, Haichao
    Xu, Wei
    Tomizuka, Masayoshi
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2023, 8 (08): : 4569 - 4576
  • [42] Multi-Task Learning with Personalized Transformer for Review Recommendation
    Wang, Haiming
    Liu, Wei
    Yin, Jian
    WEB INFORMATION SYSTEMS ENGINEERING - WISE 2021, PT II, 2021, 13081 : 162 - 176
  • [43] TransNuSeg: A Lightweight Multi-task Transformer for Nuclei Segmentation
    He, Zhenqi
    Unberath, Mathias
    Ke, Jing
    Shen, Yiqing
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT IV, 2023, 14223 : 206 - 215
  • [44] A Reinforcement-Learning-Based Energy-Efficient Framework for Multi-Task Video Analytics Pipeline
    Zhao, Yingying
    Dong, Mingzhi
    Wang, Yujiang
    Feng, Da
    Lv, Qin
    Dick, Robert P.
    Li, Dongsheng
    Lu, Tun
    Gu, Ning
    Shang, Li
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 2150 - 2163
  • [45] Transformer Decoder-Based Enhanced Exploration Method to Alleviate Initial Exploration Problems in Reinforcement Learning
    Kyoung, Dohyun
    Sung, Yunsick
    SENSORS, 2023, 23 (17)
  • [46] Autism spectrum disorders detection based on multi-task transformer neural network
    Gao, Le
    Wang, Zhimin
    Long, Yun
    Zhang, Xin
    Su, Hexing
    Yu, Yong
    Hong, Jin
    BMC NEUROSCIENCE, 2024, 25 (01):
  • [47] Multi-Task Mean Teacher Medical Image Segmentation Based on Swin Transformer
    Zhang, Jie
    Li, Fan
    Zhang, Xin
    Cheng, Yue
    Hei, Xinhong
    APPLIED SCIENCES-BASEL, 2024, 14 (07):
  • [48] PARFormer: Transformer-Based Multi-Task Network for Pedestrian Attribute Recognition
    Fan, Xinwen
    Zhang, Yukang
    Lu, Yang
    Wang, Hanzi
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (01) : 411 - 423
  • [49] Predicting Outcomes for Cancer Patients with Transformer-Based Multi-task Learning
    Gerrard, Leah
    Peng, Xueping
    Clarke, Allison
    Schlegel, Clement
    Jiang, Jing
    AI 2021: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, 13151 : 381 - 392
  • [50] TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework Using Self-Supervised Multi-Task Learning
    Qu, Linhao
    Liu, Shaolei
    Wang, Manning
    Song, Zhijian
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 2126 - 2134