Improving Pretraining Techniques for Code-Switched NLP

被引：0

作者：

Das, Richeek ^{[1
]}

Ranjan, Sahasra ^{[1
]}

Pathak, Shreya ^{[2
]}

Jyothi, Preethi ^{[1
]}

机构：

[1] Indian Inst Technol, Mumbai, Maharashtra, India

[2] Deepmind, Mumbai, Maharashtra, India

来源：

PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1 | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Pretrained models are a mainstay in modern NLP applications. Pretraining requires access to large volumes of unlabeled text. While monolingual text is readily available for many of the world's languages, access to large quantities of code-switched text (i.e., text with tokens of multiple languages interspersed within a sentence) is much more scarce. Given this resource constraint, the question of how pretraining using limited amounts of code-switched text could be altered to improve performance for code-switched NLP becomes important to tackle. In this paper, we explore different masked language modeling (MLM) pretraining techniques for code-switched text that are cognizant of language boundaries prior to masking. The language identity of the tokens can either come from human annotators, trained language classifiers, or simple relative frequencybased estimates. We also present an MLM variant by introducing a residual connection from an earlier layer in the pretrained model that uniformly boosts performance on downstream tasks. Experiments on two downstream tasks, Question Answering (QA) and Sentiment Analysis (SA), involving four code-switched language pairs (Hindi-English, Spanish-English, Tamil-English, Malayalam-English) yield relative improvements of up to 5.8 and 2.7 F1 scores on QA (Hindi-English) and SA (Tamil-English), respectively, compared to standard pretraining techniques. To understand our task improvements better, we use a series of probes to study what additional information is encoded by our pretraining techniques and also introduce an auxiliary loss function that explicitly models language identification to further aid the residual MLM variants.

引用

页码：1176 / 1191

页数：16

共 50 条

[1] GLUECoS : An Evaluation Benchmark for Code-Switched NLP
Khanuja, Simran
Dandapat, Sandipan
Srinivasan, Anirudh
Sitaram, Sunayana
Choudhury, Monojit
58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 3575 - 3585
[2] Improving Low Resource Code-switched ASR using Augmented Code-switched TTS
Sharma, Yash
Abraham, Basil
Taneja, Karan
Jyothi, Preethi
INTERSPEECH 2020, 2020, : 4771 - 4775
[3] Exploring Enhanced Code-Switched Noising for Pretraining in Neural Machine Translation
Iyer, Vivek
Oncevay, Arturo
Birch, Alexandra
17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 984 - 998
[4] Improving N-Best Rescoring in Under-Resourced Code-Switched Speech Recognition Using Pretraining and Data Augmentation
van Vuren, Joshua Jansen
Niesler, Thomas
LANGUAGES, 2022, 7 (03)
[5] Detecting Propaganda Techniques in Code-Switched Social Media Text
Salman, Muhammad Umar
Hanif, Asif
Shehata, Shady
Nakov, Preslav
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 16794 - 16812
[6] The phonetics of code-switched vowels
Muldner, Kasia
Hoiting, Leah
Sanger, Leyna
Blumenfeld, Lev
Toivonen, Ida
INTERNATIONAL JOURNAL OF BILINGUALISM, 2019, 23 (01) : 37 - 52
[7] Improving Code-Switched Language Modeling Performance Using Cognate Features
Soto, Victor
Hirschberg, Julia
INTERSPEECH 2019, 2019, : 3725 - 3729
[8] Structural constraints in code-switched advertising
Luna, D
Lerman, D
Peracchio, LA
JOURNAL OF CONSUMER RESEARCH, 2005, 32 (03) : 416 - 423
[9] The perception of code-switched speech in noise
Gavino, Maria Fernanda
Goldrick, Matthew
JASA EXPRESS LETTERS, 2024, 4 (03):
[10] Data Augmentation Techniques for Machine Translation of Code-Switched Texts: A Comparative Study
Hamed, Injy
Habash, Nizar
Ngoc Thang Vu
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 140 - 154

← 1 2 3 4 5 →