InvPT plus plus : Inverted Pyramid Multi-Task Transformer for Visual Scene Understanding

被引：0

作者：

Ye, Hanrong ^{[1
]}

Xu, Dan ^{[1
]}

机构：

[1] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Hong Kong, Peoples R China

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2024年 / 46卷 / 12期

关键词：

Dense prediction; multi-task learning; scene understanding; transformer;

D O I：

10.1109/TPAMI.2024.3397031

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multi-task scene understanding aims to design models that can simultaneously predict several scene understanding tasks with one versatile model. Previous studies typically process multi-task features in a more local way, and thus cannot effectively learn spatially global and cross-task interactions, which hampers the models' ability to fully leverage the consistency of various tasks in multi-task learning. To tackle this problem, we propose an Inverted Pyramid multi-task Transformer, capable of modeling cross-task interaction among spatial features of different tasks in a global context. Specifically, we first utilize a transformer encoder to capture task-generic features for all tasks. And then, we design a transformer decoder to establish spatial and cross-task interaction globally, and a novel UP-Transformer block is devised to increase the resolutions of multi-task features gradually and establish cross-task interaction at different scales. Furthermore, two types of Cross-Scale Self-Attention modules, i.e., Fusion Attention and Selective Attention, are proposed to efficiently facilitate cross-task interaction across different feature scales. An Encoder Feature Aggregation strategy is further introduced to better model multi-scale information in the decoder. Comprehensive experiments on several 2D/3D multi-task benchmarks clearly demonstrate our proposal's effectiveness, establishing significant state-of-the-art performances.

引用

页码：7493 / 7508

页数：16

共 50 条

[21] Referring Transformer: A One-step Approach to Multi-task Visual Grounding
Li, Muchen
Sigal, Leonid
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[22] GIFGIF plus : Collecting Emotional Animated GIFs with Clustered Multi-Task Learning
Chen, Weixuan
Rudovic, Ognjen
Picard, Rosalind W.
2017 SEVENTH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2017, : 510 - 517
[23] MTMamba: Enhancing Multi-task Dense Scene Understanding by Mamba-Based Decoders
Lin, Baijiong
Jiang, Weisen
Chen, Pengguang
Zhang, Yu
Liu, Shu
Chen, Ying-Cong
COMPUTER VISION - ECCV 2024, PT LXX, 2025, 15128 : 314 - 330
[24] Semi-Supervised Learning for Multi-Task Scene Understanding by Neural Graph Consensus
Leordeanu, Marius
Pirvu, Mihai Cristian
Costea, Dragos
Marcu, Alina E.
Slusanschi, Emil
Sukthankar, Rahul
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 1882 - 1892
[25] Improving Vision Transformer with Multi-Task Training
Ahn, Woo Jin
Yang, Geun Yeong
Choi, Hyun Duck
Lim, Myo Taeg
Kang, Tae Koo
2022 22ND INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION AND SYSTEMS (ICCAS 2022), 2022, : 1963 - 1965
[26] Paraphrase Bidirectional Transformer with Multi-Task Learning
Ko, Bowon
Choi, Ho-Jin
2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP 2020), 2020, : 217 - 220
[27] An Efficient and Effective Transformer Decoder-Based Framework for Multi-task Visual Grounding
Chen, Wei
Chen, Long
Wu, Yu
COMPUTER VISION - ECCV 2024, PT XLV, 2025, 15103 : 125 - 141
[28] P2T: Pyramid Pooling Transformer for Scene Understanding
Wu, Yu-Huan
Liu, Yun
Zhan, Xin
Cheng, Ming-Ming
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 12760 - 12771
[29] Multi-view representation learning in multi-task scene
Run-kun Lu
Jian-wei Liu
Si-ming Lian
Xin Zuo
Neural Computing and Applications, 2020, 32 : 10403 - 10422
[30] Multi-view representation learning in multi-task scene
Lu, Run-kun
Liu, Jian-wei
Lian, Si-ming
Zuo, Xin
NEURAL COMPUTING & APPLICATIONS, 2020, 32 (14): : 10403 - 10422

← 1 2 3 4 5 →