End-to-End Multitask Learning With Vision Transformer

被引:4
|
作者
Tian, Yingjie [1 ,2 ,3 ]
Bai, Kunlong [4 ]
机构
[1] Univ Chinese Acad Sci, Sch Econ & Management, Beijing 100190, Peoples R China
[2] Chinese Acad Sci, Res Ctr Fictitious Econ & Data Sci, Beijing 100190, Peoples R China
[3] Chinese Acad Sci, Key Lab Big Data Min & Knowledge Management, Beijing 100190, Peoples R China
[4] Univ Chinese Acad Sci, Sch Comp Sci & Technol, Beijing 100190, Peoples R China
基金
中国国家自然科学基金;
关键词
Task analysis; Transformers; Neural networks; Visualization; Biological system modeling; Benchmark testing; Correlation; Deep neural network algorithms; machine learning applications; multitask learning (MTL);
D O I
10.1109/TNNLS.2023.3234166
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multitask learning (MTL) is a challenging puzzle, particularly in the realm of computer vision (CV). Setting up vanilla deep MTL requires either hard or soft parameter sharing schemes that employ greedy search to find the optimal network designs. Despite its widespread application, the performance of MTL models is vulnerable to under-constrained parameters. In this article, we draw on the recent success of vision transformer (ViT) to propose a multitask representation learning method called multitask ViT (MTViT), which proposes a multiple branch transformer to sequentially process the image patches (i.e., tokens in transformer) that are associated with various tasks. Through the proposed cross-task attention (CA) module, a task token from each task branch is regarded as a query for exchanging information with other task branches. In contrast to prior models, our proposed method extracts intrinsic features using the built-in self-attention mechanism of the ViT and requires just linear time on memory and computation complexity, rather than quadratic time. Comprehensive experiments are carried out on two benchmark datasets, including NYU-Depth V2 (NYUDv2) and CityScapes, after which it is found that our proposed MTViT outperforms or is on par with existing convolutional neural network (CNN)-based MTL methods. In addition, we apply our method to a synthetic dataset in which task relatedness is controlled. Surprisingly, experimental results reveal that the MTViT exhibits excellent performance when tasks are less related.
引用
收藏
页码:9579 / 9590
页数:12
相关论文
共 50 条
  • [31] End-to-end deblurring model for microscopic vision
    Xu, Zheng
    He, Jiaheng
    Wang, Yanqi
    Wang, Xiaodong
    Ren, Tongqun
    [J]. Guangxue Jingmi Gongcheng/Optics and Precision Engineering, 2024, 32 (20): : 3047 - 3058
  • [32] End-to-End Dense Video Captioning with Masked Transformer
    Zhou, Luowei
    Zhou, Yingbo
    Corso, Jason J.
    Socher, Richard
    Xiong, Caiming
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 8739 - 8748
  • [33] SGTR: End-to-end Scene Graph Generation with Transformer
    Li, Rongjie
    Zhang, Songyang
    He, Xuming
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19464 - 19474
  • [34] Transformer Based End-to-End Mispronunciation Detection and Diagnosis
    Wu, Minglin
    Li, Kun
    Leung, Wai-Kim
    Meng, Helen
    [J]. INTERSPEECH 2021, 2021, : 3954 - 3958
  • [35] SDformer: Efficient End-to-End Transformer for Depth Completion
    Qian, Jian
    Sun, Miao
    Lee, Ashley
    Li, Jie
    Zhuo, Shenglong
    Chiang, Patrick Yin
    [J]. 2022 INTERNATIONAL CONFERENCE ON INDUSTRIAL AUTOMATION, ROBOTICS AND CONTROL ENGINEERING, IARCE, 2022, : 56 - 61
  • [36] SRDD: a lightweight end-to-end object detection with transformer
    Zhu, Yuan
    Xia, Qingyuan
    Jin, Wen
    [J]. CONNECTION SCIENCE, 2022, 34 (01) : 2448 - 2465
  • [37] Dynamic deformable transformer for end-to-end face alignment
    Han, Liming
    Yang, Chi
    Li, Qing
    Yao, Bin
    Jiao, Zixian
    Xie, Qianyang
    [J]. IET COMPUTER VISION, 2023, 17 (08) : 948 - 961
  • [38] End-to-End Pedestrian Trajectory Forecasting with Transformer Network
    Yao, Hai-Yan
    Wan, Wang-Gen
    Li, Xiang
    [J]. ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION, 2022, 11 (01)
  • [39] DocEnTr: An End-to-End Document Image Enhancement Transformer
    Souibgui, Mohamed Ali
    Biswas, Sanket
    Jemni, Sana Khamekhem
    Kessentini, Yousri
    Fornes, Alicia
    Llados, Josep
    Pal, Umapada
    [J]. 2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1699 - 1705
  • [40] End-to-End Speaker-Attributed ASR with Transformer
    Kanda, Naoyuki
    Ye, Guoli
    Gaur, Yashesh
    Wang, Xiaofei
    Meng, Zhong
    Chen, Zhuo
    Yoshioka, Takuya
    [J]. INTERSPEECH 2021, 2021, : 4413 - 4417