InvPT plus plus : Inverted Pyramid Multi-Task Transformer for Visual Scene Understanding

被引:0
|
作者
Ye, Hanrong [1 ]
Xu, Dan [1 ]
机构
[1] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Hong Kong, Peoples R China
关键词
Dense prediction; multi-task learning; scene understanding; transformer;
D O I
10.1109/TPAMI.2024.3397031
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multi-task scene understanding aims to design models that can simultaneously predict several scene understanding tasks with one versatile model. Previous studies typically process multi-task features in a more local way, and thus cannot effectively learn spatially global and cross-task interactions, which hampers the models' ability to fully leverage the consistency of various tasks in multi-task learning. To tackle this problem, we propose an Inverted Pyramid multi-task Transformer, capable of modeling cross-task interaction among spatial features of different tasks in a global context. Specifically, we first utilize a transformer encoder to capture task-generic features for all tasks. And then, we design a transformer decoder to establish spatial and cross-task interaction globally, and a novel UP-Transformer block is devised to increase the resolutions of multi-task features gradually and establish cross-task interaction at different scales. Furthermore, two types of Cross-Scale Self-Attention modules, i.e., Fusion Attention and Selective Attention, are proposed to efficiently facilitate cross-task interaction across different feature scales. An Encoder Feature Aggregation strategy is further introduced to better model multi-scale information in the decoder. Comprehensive experiments on several 2D/3D multi-task benchmarks clearly demonstrate our proposal's effectiveness, establishing significant state-of-the-art performances.
引用
收藏
页码:7493 / 7508
页数:16
相关论文
共 50 条
  • [21] Referring Transformer: A One-step Approach to Multi-task Visual Grounding
    Li, Muchen
    Sigal, Leonid
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [22] GIFGIF plus : Collecting Emotional Animated GIFs with Clustered Multi-Task Learning
    Chen, Weixuan
    Rudovic, Ognjen
    Picard, Rosalind W.
    2017 SEVENTH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2017, : 510 - 517
  • [23] MTMamba: Enhancing Multi-task Dense Scene Understanding by Mamba-Based Decoders
    Lin, Baijiong
    Jiang, Weisen
    Chen, Pengguang
    Zhang, Yu
    Liu, Shu
    Chen, Ying-Cong
    COMPUTER VISION - ECCV 2024, PT LXX, 2025, 15128 : 314 - 330
  • [24] Semi-Supervised Learning for Multi-Task Scene Understanding by Neural Graph Consensus
    Leordeanu, Marius
    Pirvu, Mihai Cristian
    Costea, Dragos
    Marcu, Alina E.
    Slusanschi, Emil
    Sukthankar, Rahul
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 1882 - 1892
  • [25] Improving Vision Transformer with Multi-Task Training
    Ahn, Woo Jin
    Yang, Geun Yeong
    Choi, Hyun Duck
    Lim, Myo Taeg
    Kang, Tae Koo
    2022 22ND INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION AND SYSTEMS (ICCAS 2022), 2022, : 1963 - 1965
  • [26] Paraphrase Bidirectional Transformer with Multi-Task Learning
    Ko, Bowon
    Choi, Ho-Jin
    2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP 2020), 2020, : 217 - 220
  • [27] An Efficient and Effective Transformer Decoder-Based Framework for Multi-task Visual Grounding
    Chen, Wei
    Chen, Long
    Wu, Yu
    COMPUTER VISION - ECCV 2024, PT XLV, 2025, 15103 : 125 - 141
  • [28] P2T: Pyramid Pooling Transformer for Scene Understanding
    Wu, Yu-Huan
    Liu, Yun
    Zhan, Xin
    Cheng, Ming-Ming
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 12760 - 12771
  • [29] Multi-view representation learning in multi-task scene
    Run-kun Lu
    Jian-wei Liu
    Si-ming Lian
    Xin Zuo
    Neural Computing and Applications, 2020, 32 : 10403 - 10422
  • [30] Multi-view representation learning in multi-task scene
    Lu, Run-kun
    Liu, Jian-wei
    Lian, Si-ming
    Zuo, Xin
    NEURAL COMPUTING & APPLICATIONS, 2020, 32 (14): : 10403 - 10422