Optimizing Deeper Transformers on Small Datasets

被引：0

作者：

Xu, Peng ^{[1
]}

Kumar, Dhruv ^{[1
,2
]}

Yang, Wei ^{[1
]}

Zi, Wenjie ^{[1
]}

Tang, Keyi ^{[1
]}

Huang, Chenyang ^{[1
,5
]}

Cheung, Jackie Chi Kit ^{[1
,3
,4
]}

Prince, Simon J. D. ^{[1
]}

Cao, Yanshuai ^{[1
]}

机构：

[1] Borealis AI, Toronto, ON, Canada

[2] Univ Waterloo, Waterloo, ON, Canada

[3] McGill Univ, Montreal, PQ, Canada

[4] Mila, Canada CIFAR Chair, Montreal, PQ, Canada

[5] Univ Alberta, Edmonton, AB, Canada

来源：

59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021) | 2021年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

It is a common belief that training deep transformers from scratch requires large datasets. Consequently, for small datasets, people usually use shallow and simple additional layers on top of pre-trained models during finetuning. This work shows that this does not always need to be the case: with proper initialization and optimization, the benefits of very deep transformers can carry over to challenging tasks with small datasets, including Textto-SQL semantic parsing and logical reading comprehension. In particular, we successfully train 48 layers of transformers, comprising 24 fine-tuned layers from pre-trained RoBERTa and 24 relation-aware layers trained from scratch. With fewer training steps and no task-specific pre-training, we obtain the state-of-the-art performance on the challenging cross-domain Text-to-SQL parsing benchmark Spider1. We achieve this by deriving a novel Data-dependent Transformer Fixedupdate initialization scheme (DT-Fixup), inspired by the prior T-Fixup work (Huang et al., 2020). Further error analysis shows that increasing depth can help improve generalization on small datasets for hard cases that require reasoning and structural understanding.

引用

页码：2089 / 2102

页数：14

共 50 条

[1] Transformers Meet Small Datasets
Shao, Ran
Bi, Xiao-Jun
[J]. IEEE ACCESS, 2022, 10 : 118454 - 118464
[2] Efficient Training of Visual Transformers with Small Datasets
Liu, Yahui
Sangineto, Enver
Bi, Wei
Sebe, Nicu
Lepri, Bruno
De Nadai, Marco
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[3] Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets
Chen, Xiangyu
Hu, Qinghao
Li, Kaidong
Zhong, Cuncong
Wang, Guanghui
[J]. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 3973 - 3981
[4] Going deeper with Image Transformers
Touvron, Hugo
Cord, Matthieu
Sablayrolles, Alexandre
Synnaeve, Gabriel
Jegou, Herve
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 32 - 42
[5] AViT: Adapting Vision Transformers for Small Skin Lesion Segmentation Datasets
Du, Siyi
Bayasi, Nourhan
Hamarneh, Ghassan
Garbi, Rafeef
[J]. MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023 WORKSHOPS, 2023, 14393 : 25 - 36
[6] Exploring Advances in Transformers and CNN for Skin Lesion Diagnosis on Small Datasets
de Lima, Leandro M.
Krohling, Renato A.
[J]. INTELLIGENT SYSTEMS, PT II, 2022, 13654 : 282 - 296
[7] Vision Transformers for Small Histological Datasets Learned Through Knowledge Distillation
Kanwal, Neel
Eftestol, Trygve
Khoraminia, Farbod
Zuiverloon, Tahlita C. M.
Engan, Kjersti
[J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2023, PT III, 2023, 13937 : 167 - 179
[8] Optimizing CNN Hyperparameters for Blastocyst Quality Assessment in Small Datasets
Irmawati
Chai, Rifai
Basari
Gunawan, Dadang
[J]. IEEE ACCESS, 2022, 10 : 88621 - 88631
[9] Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets
Lu, Zhiying
Xie, Hongtao
Liu, Chuanbin
Zhang, Yongdong
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
[10] Deeper and deeper into the pediatric small bowel
Steiner, Steven J.
[J]. GASTROINTESTINAL ENDOSCOPY, 2012, 75 (01) : 95 - 97

← 1 2 3 4 5 →