Optimizing Deeper Transformers on Small Datasets

被引:0
|
作者
Xu, Peng [1 ]
Kumar, Dhruv [1 ,2 ]
Yang, Wei [1 ]
Zi, Wenjie [1 ]
Tang, Keyi [1 ]
Huang, Chenyang [1 ,5 ]
Cheung, Jackie Chi Kit [1 ,3 ,4 ]
Prince, Simon J. D. [1 ]
Cao, Yanshuai [1 ]
机构
[1] Borealis AI, Toronto, ON, Canada
[2] Univ Waterloo, Waterloo, ON, Canada
[3] McGill Univ, Montreal, PQ, Canada
[4] Mila, Canada CIFAR Chair, Montreal, PQ, Canada
[5] Univ Alberta, Edmonton, AB, Canada
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
It is a common belief that training deep transformers from scratch requires large datasets. Consequently, for small datasets, people usually use shallow and simple additional layers on top of pre-trained models during finetuning. This work shows that this does not always need to be the case: with proper initialization and optimization, the benefits of very deep transformers can carry over to challenging tasks with small datasets, including Textto-SQL semantic parsing and logical reading comprehension. In particular, we successfully train 48 layers of transformers, comprising 24 fine-tuned layers from pre-trained RoBERTa and 24 relation-aware layers trained from scratch. With fewer training steps and no task-specific pre-training, we obtain the state-of-the-art performance on the challenging cross-domain Text-to-SQL parsing benchmark Spider1. We achieve this by deriving a novel Data-dependent Transformer Fixedupdate initialization scheme (DT-Fixup), inspired by the prior T-Fixup work (Huang et al., 2020). Further error analysis shows that increasing depth can help improve generalization on small datasets for hard cases that require reasoning and structural understanding.
引用
收藏
页码:2089 / 2102
页数:14
相关论文
共 50 条
  • [31] GITPose: going shallow and deeper using vision transformers for human pose estimation
    Evans Aidoo
    Xun Wang
    Zhenguang Liu
    Abraham Opanfo Abbam
    Edwin Kwadwo Tenagyei
    Victor Nonso Ejianya
    Seth Larweh Kodjiku
    Esther Stacy E. B. Aggrey
    [J]. Complex & Intelligent Systems, 2024, 10 : 4507 - 4520
  • [32] Digging Deeper: Operator Analysis for Optimizing Nonlinearity of Boolean Functions
    Durasevic, Marko
    Jakobovic, Domagoj
    Mariot, Luca
    Picek, Stjepan
    [J]. PROCEEDINGS OF THE 2023 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE COMPANION, GECCO 2023 COMPANION, 2023, : 199 - 202
  • [33] Thermal modelling of small transformers
    Fard, H.G.
    Oraee, H.
    [J]. 2001, University Federal de Uberlandia (10):
  • [34] Voltage Regulation in Small Transformers
    Banach, Henryk
    [J]. 2017 INTERNATIONAL SYMPOSIUM ON ELECTRICAL MACHINES (SME), 2017,
  • [35] Enhancing Transformers Loadability for Optimizing Assets Utilization and Efficiency
    Sbravati, Alan
    Oka, Marcelo Hisao
    Maso, Jair Anisio
    Valmus, Jeff
    [J]. 2018 IEEE ELECTRICAL INSULATION CONFERENCE (EIC), 2018, : 144 - 149
  • [36] Optimizing Computations of Ferrite-Based Transformers.
    Artem'ev, A.A.
    Artem'ev, A.I.
    [J]. Izvestiya Vysshikh Uchebnykh Zavedenii, Elektromekhanika, 1979, (03): : 231 - 236
  • [37] Optimizing Mobile Vision Transformers for Land Cover Classification
    Rozario, Papia F.
    Gadgil, Ravi
    Lee, Junsu
    Gomes, Rahul
    Keller, Paige
    Liu, Yiheng
    Sipos, Gabriel
    Mcdonnell, Grace
    Impola, Westin
    Rudolph, Joseph
    [J]. APPLIED SCIENCES-BASEL, 2024, 14 (13):
  • [38] Optimizing core and winding design in high frequency transformers
    Hurley, WG
    [J]. CIEP 96 - V IEEE INTERNATIONAL POWER ELECTRONICS CONGRESS, TECHNICAL PROCEEDINGS, 1996, : 2 - 13
  • [39] Conditional GAN for Small Datasets
    Hiruta, Komei
    Saito, Ryusuke
    Hatakeyama, Taro
    Hashimoto, Atsushi
    Kurihara, Satoshi
    [J]. 2022 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM), 2022, : 278 - 281
  • [40] Hebbian dreaming for small datasets
    Agliari, Elena
    Alemanno, Francesco
    Aquaro, Miriam
    Barra, Adriano
    Durante, Fabrizio
    Kanter, Ido
    [J]. NEURAL NETWORKS, 2024, 173