Optimizing Deeper Transformers on Small Datasets

被引：0

作者：

Xu, Peng ^{[1
]}

Kumar, Dhruv ^{[1
,2
]}

Yang, Wei ^{[1
]}

Zi, Wenjie ^{[1
]}

Tang, Keyi ^{[1
]}

Huang, Chenyang ^{[1
,5
]}

Cheung, Jackie Chi Kit ^{[1
,3
,4
]}

Prince, Simon J. D. ^{[1
]}

Cao, Yanshuai ^{[1
]}

机构：

[1] Borealis AI, Toronto, ON, Canada

[2] Univ Waterloo, Waterloo, ON, Canada

[3] McGill Univ, Montreal, PQ, Canada

[4] Mila, Canada CIFAR Chair, Montreal, PQ, Canada

[5] Univ Alberta, Edmonton, AB, Canada

来源：

59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021) | 2021年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

It is a common belief that training deep transformers from scratch requires large datasets. Consequently, for small datasets, people usually use shallow and simple additional layers on top of pre-trained models during finetuning. This work shows that this does not always need to be the case: with proper initialization and optimization, the benefits of very deep transformers can carry over to challenging tasks with small datasets, including Textto-SQL semantic parsing and logical reading comprehension. In particular, we successfully train 48 layers of transformers, comprising 24 fine-tuned layers from pre-trained RoBERTa and 24 relation-aware layers trained from scratch. With fewer training steps and no task-specific pre-training, we obtain the state-of-the-art performance on the challenging cross-domain Text-to-SQL parsing benchmark Spider1. We achieve this by deriving a novel Data-dependent Transformer Fixedupdate initialization scheme (DT-Fixup), inspired by the prior T-Fixup work (Huang et al., 2020). Further error analysis shows that increasing depth can help improve generalization on small datasets for hard cases that require reasoning and structural understanding.

引用

页码：2089 / 2102

页数：14

共 50 条

[31] GITPose: going shallow and deeper using vision transformers for human pose estimation
Evans Aidoo
Xun Wang
Zhenguang Liu
Abraham Opanfo Abbam
Edwin Kwadwo Tenagyei
Victor Nonso Ejianya
Seth Larweh Kodjiku
Esther Stacy E. B. Aggrey
[J]. Complex & Intelligent Systems, 2024, 10 : 4507 - 4520
[32] Digging Deeper: Operator Analysis for Optimizing Nonlinearity of Boolean Functions
Durasevic, Marko
Jakobovic, Domagoj
Mariot, Luca
Picek, Stjepan
[J]. PROCEEDINGS OF THE 2023 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE COMPANION, GECCO 2023 COMPANION, 2023, : 199 - 202
[33] Thermal modelling of small transformers
Fard, H.G.
Oraee, H.
[J]. 2001, University Federal de Uberlandia (10):
[34] Voltage Regulation in Small Transformers
Banach, Henryk
[J]. 2017 INTERNATIONAL SYMPOSIUM ON ELECTRICAL MACHINES (SME), 2017,
[35] Enhancing Transformers Loadability for Optimizing Assets Utilization and Efficiency
Sbravati, Alan
Oka, Marcelo Hisao
Maso, Jair Anisio
Valmus, Jeff
[J]. 2018 IEEE ELECTRICAL INSULATION CONFERENCE (EIC), 2018, : 144 - 149
[36] Optimizing Computations of Ferrite-Based Transformers.
Artem'ev, A.A.
Artem'ev, A.I.
[J]. Izvestiya Vysshikh Uchebnykh Zavedenii, Elektromekhanika, 1979, (03): : 231 - 236
[37] Optimizing Mobile Vision Transformers for Land Cover Classification
Rozario, Papia F.
Gadgil, Ravi
Lee, Junsu
Gomes, Rahul
Keller, Paige
Liu, Yiheng
Sipos, Gabriel
Mcdonnell, Grace
Impola, Westin
Rudolph, Joseph
[J]. APPLIED SCIENCES-BASEL, 2024, 14 (13):
[38] Optimizing core and winding design in high frequency transformers
Hurley, WG
[J]. CIEP 96 - V IEEE INTERNATIONAL POWER ELECTRONICS CONGRESS, TECHNICAL PROCEEDINGS, 1996, : 2 - 13
[39] Conditional GAN for Small Datasets
Hiruta, Komei
Saito, Ryusuke
Hatakeyama, Taro
Hashimoto, Atsushi
Kurihara, Satoshi
[J]. 2022 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM), 2022, : 278 - 281
[40] Hebbian dreaming for small datasets
Agliari, Elena
Alemanno, Francesco
Aquaro, Miriam
Barra, Adriano
Durante, Fabrizio
Kanter, Ido
[J]. NEURAL NETWORKS, 2024, 173

← 1 2 3 4 5 →