Improving Pretraining Techniques for Code-Switched NLP

被引:0
|
作者
Das, Richeek [1 ]
Ranjan, Sahasra [1 ]
Pathak, Shreya [2 ]
Jyothi, Preethi [1 ]
机构
[1] Indian Inst Technol, Mumbai, Maharashtra, India
[2] Deepmind, Mumbai, Maharashtra, India
来源
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1 | 2023年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Pretrained models are a mainstay in modern NLP applications. Pretraining requires access to large volumes of unlabeled text. While monolingual text is readily available for many of the world's languages, access to large quantities of code-switched text (i.e., text with tokens of multiple languages interspersed within a sentence) is much more scarce. Given this resource constraint, the question of how pretraining using limited amounts of code-switched text could be altered to improve performance for code-switched NLP becomes important to tackle. In this paper, we explore different masked language modeling (MLM) pretraining techniques for code-switched text that are cognizant of language boundaries prior to masking. The language identity of the tokens can either come from human annotators, trained language classifiers, or simple relative frequencybased estimates. We also present an MLM variant by introducing a residual connection from an earlier layer in the pretrained model that uniformly boosts performance on downstream tasks. Experiments on two downstream tasks, Question Answering (QA) and Sentiment Analysis (SA), involving four code-switched language pairs (Hindi-English, Spanish-English, Tamil-English, Malayalam-English) yield relative improvements of up to 5.8 and 2.7 F1 scores on QA (Hindi-English) and SA (Tamil-English), respectively, compared to standard pretraining techniques. To understand our task improvements better, we use a series of probes to study what additional information is encoded by our pretraining techniques and also introduce an auxiliary loss function that explicitly models language identification to further aid the residual MLM variants.
引用
收藏
页码:1176 / 1191
页数:16
相关论文
共 50 条
  • [1] GLUECoS : An Evaluation Benchmark for Code-Switched NLP
    Khanuja, Simran
    Dandapat, Sandipan
    Srinivasan, Anirudh
    Sitaram, Sunayana
    Choudhury, Monojit
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 3575 - 3585
  • [2] Improving Low Resource Code-switched ASR using Augmented Code-switched TTS
    Sharma, Yash
    Abraham, Basil
    Taneja, Karan
    Jyothi, Preethi
    INTERSPEECH 2020, 2020, : 4771 - 4775
  • [3] Exploring Enhanced Code-Switched Noising for Pretraining in Neural Machine Translation
    Iyer, Vivek
    Oncevay, Arturo
    Birch, Alexandra
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 984 - 998
  • [4] Improving N-Best Rescoring in Under-Resourced Code-Switched Speech Recognition Using Pretraining and Data Augmentation
    van Vuren, Joshua Jansen
    Niesler, Thomas
    LANGUAGES, 2022, 7 (03)
  • [5] Detecting Propaganda Techniques in Code-Switched Social Media Text
    Salman, Muhammad Umar
    Hanif, Asif
    Shehata, Shady
    Nakov, Preslav
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 16794 - 16812
  • [6] The phonetics of code-switched vowels
    Muldner, Kasia
    Hoiting, Leah
    Sanger, Leyna
    Blumenfeld, Lev
    Toivonen, Ida
    INTERNATIONAL JOURNAL OF BILINGUALISM, 2019, 23 (01) : 37 - 52
  • [7] Improving Code-Switched Language Modeling Performance Using Cognate Features
    Soto, Victor
    Hirschberg, Julia
    INTERSPEECH 2019, 2019, : 3725 - 3729
  • [8] Structural constraints in code-switched advertising
    Luna, D
    Lerman, D
    Peracchio, LA
    JOURNAL OF CONSUMER RESEARCH, 2005, 32 (03) : 416 - 423
  • [9] The perception of code-switched speech in noise
    Gavino, Maria Fernanda
    Goldrick, Matthew
    JASA EXPRESS LETTERS, 2024, 4 (03):
  • [10] Data Augmentation Techniques for Machine Translation of Code-Switched Texts: A Comparative Study
    Hamed, Injy
    Habash, Nizar
    Ngoc Thang Vu
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 140 - 154