End-to-End Multitask Learning With Vision Transformer

被引:4
|
作者
Tian, Yingjie [1 ,2 ,3 ]
Bai, Kunlong [4 ]
机构
[1] Univ Chinese Acad Sci, Sch Econ & Management, Beijing 100190, Peoples R China
[2] Chinese Acad Sci, Res Ctr Fictitious Econ & Data Sci, Beijing 100190, Peoples R China
[3] Chinese Acad Sci, Key Lab Big Data Min & Knowledge Management, Beijing 100190, Peoples R China
[4] Univ Chinese Acad Sci, Sch Comp Sci & Technol, Beijing 100190, Peoples R China
基金
中国国家自然科学基金;
关键词
Task analysis; Transformers; Neural networks; Visualization; Biological system modeling; Benchmark testing; Correlation; Deep neural network algorithms; machine learning applications; multitask learning (MTL);
D O I
10.1109/TNNLS.2023.3234166
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multitask learning (MTL) is a challenging puzzle, particularly in the realm of computer vision (CV). Setting up vanilla deep MTL requires either hard or soft parameter sharing schemes that employ greedy search to find the optimal network designs. Despite its widespread application, the performance of MTL models is vulnerable to under-constrained parameters. In this article, we draw on the recent success of vision transformer (ViT) to propose a multitask representation learning method called multitask ViT (MTViT), which proposes a multiple branch transformer to sequentially process the image patches (i.e., tokens in transformer) that are associated with various tasks. Through the proposed cross-task attention (CA) module, a task token from each task branch is regarded as a query for exchanging information with other task branches. In contrast to prior models, our proposed method extracts intrinsic features using the built-in self-attention mechanism of the ViT and requires just linear time on memory and computation complexity, rather than quadratic time. Comprehensive experiments are carried out on two benchmark datasets, including NYU-Depth V2 (NYUDv2) and CityScapes, after which it is found that our proposed MTViT outperforms or is on par with existing convolutional neural network (CNN)-based MTL methods. In addition, we apply our method to a synthetic dataset in which task relatedness is controlled. Surprisingly, experimental results reveal that the MTViT exhibits excellent performance when tasks are less related.
引用
收藏
页码:9579 / 9590
页数:12
相关论文
共 50 条
  • [1] MulT: An End-to-End Multitask Learning Transformer
    Bhattacharjee, Deblina
    Zhang, Tong
    Suesstrunk, Sabine
    Salzmann, Mathieu
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12021 - 12031
  • [2] End-to-End Video Captioning with Multitask Reinforcement Learning
    Li, Lijun
    Gong, Boqing
    [J]. 2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 339 - 348
  • [3] End-to-End Audiovisual Speech Recognition System With Multitask Learning
    Tao, Fei
    Busso, Carlos
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 1 - 11
  • [4] AN END-TO-END MULTITASK LEARNING MODEL TO IMPROVE SPEECH EMOTION RECOGNITION
    Fu, Changzeng
    Liu, Chaoran
    Ishi, Carlos Toshinori
    Ishiguro, Hiroshi
    [J]. 28TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2020), 2021, : 351 - 355
  • [5] An End-to-End Video Coding Method via Adaptive Vision Transformer
    Yang, Haoyan
    Zhou, Mingliang
    Shang, Zhaowei
    Pu, Huayan
    Luo, Jun
    Huang, Xiaoxu
    Wang, Shilong
    Cao, Huajun
    Wei, Xuekai
    Xian, Weizhi
    [J]. INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2024, 38 (01)
  • [6] DFDT: An End-to-End DeepFake Detection Framework Using Vision Transformer
    Khormali, Aminollah
    Yuan, Jiann-Shiun
    [J]. APPLIED SCIENCES-BASEL, 2022, 12 (06):
  • [7] Neurorecognition visualization in multitask end-to-end speech
    Mamyrbayev, Orken
    Pavlov, Sergii
    Bekarystankyzy, Akbayan
    Oralbekova, Dina
    Zhumazhanov, Bagashar
    Azarova, Larysa
    Mussayeva, Dinara
    Koval, Tetiana
    Gromaszek, Konrad
    Issimov, Nurdaulet
    Shiyapov, Kadrzhan
    [J]. OPTICAL FIBERS AND THEIR APPLICATIONS 2023, 2024, 12985
  • [8] Cross-Modal Multitask Transformer for End-to-End Multimodal Aspect-Based Sentiment Analysis
    Yang, Li
    Na, Jin-Cheon
    Yu, Jianfei
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2022, 59 (05)
  • [9] TransVG plus plus : End-to-End Visual Grounding With Language Conditioned Vision Transformer
    Deng, Jiajun
    Yang, Zhengyuan
    Liu, Daqing
    Chen, Tianlang
    Zhou, Wengang
    Zhang, Yanyong
    Li, Houqiang
    Ouyang, Wanli
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 13636 - 13652
  • [10] Optimization of Neuroprosthetic Vision via End-to-End Deep Reinforcement Learning
    Kucukoglu, Burcu
    Rueckauer, Bodo
    Ahmad, Nasir
    van Steveninck, Jaap de Ruyter
    Guclu, Umut
    van Gerven, Marcel
    [J]. INTERNATIONAL JOURNAL OF NEURAL SYSTEMS, 2022, 32 (11)