Towards Cross-Corpora Generalization for Low-Resource Spoken Language Identification

被引:0
|
作者
Dey, Spandan [1 ,2 ]
Sahidullah, Md [3 ,4 ]
Saha, Goutam [1 ]
机构
[1] Indian Inst Technol Kharagpur, Dept E & ECE, Kharagpur 721302, India
[2] Samsung R&D Inst India Bangalore, Bengaluru 560037, India
[3] TCG CREST, Inst Adv Intelligence, Bidhannagar 700091, India
[4] Acad Sci & Innovat Res, Ghaziabad 201002, India
关键词
Vectors; Training; Correlation; Speech processing; Recording; Noise; Measurement; Databases; Training data; NIST; Spoken language identification; low-resource; cross-corpora evaluation; corpora mismatch; domain invariance;
D O I
10.1109/TASLP.2024.3492807
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Low-resource spoken language identification (LID) systems are prone to poor generalization across unknown domains. In this study, using multiple widely used low-resourced South Asian LID corpora, we conduct an in-depth analysis for understanding the key non-lingual bias factors that create corpora mismatch and degrade LID generalization. To quantify the biases, we extract different data-driven and rule-based summary vectors that capture non-lingual aspects, such as speaker characteristics, spoken context, accents or dialects, recording channels, background noise, and environments. We then conduct a statistical analysis to identify the most crucial non-lingual bias factors and corpora mismatch components that impact LID performance. Following these analyses, we then propose effective bias compensation approaches for the most relevant summary vectors. We generate pseudo-labels using hierarchical clustering over language-domain-gender constrained summary vectors and use them to train adversarial networks with conditioned metric loss. The compensations learn invariance for the corpora mismatches due to the non-lingual biases and help to improve the generalization. With the proposed compensation method, we improve equal error rate up to 5.22% and 8.14% for the same-corpora and cross-corpora evaluations, respectively.
引用
收藏
页码:5040 / 5050
页数:11
相关论文
共 50 条
  • [21] Wavelet Scattering Transform for Improving Generalization in Low-Resourced Spoken Language Identification
    Dey, Spandan
    Singh, Premjeet
    Saha, Goutam
    INTERSPEECH 2023, 2023, : 1953 - 1957
  • [22] Towards Realistic Practices In Low-Resource Natural Language Processing: The Development Set
    Kann, Katharina
    Cho, Kyunghyun
    Bowman, Samuel R.
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 3342 - 3349
  • [23] Machine Translation into Low-resource Language Varieties
    Kumar, Sachin
    Anastasopoulos, Antonios
    Wintner, Shuly
    Tsvetkov, Yulia
    ACL-IJCNLP 2021: THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 2, 2021, : 110 - 121
  • [24] Character Profiling in Low-Resource Language Documents
    Wong, Tak-sum
    Lee, John
    ADCS 2019: PROCEEDINGS OF THE 24TH AUSTRALASIAN DOCUMENT COMPUTING SYMPOSIUM, 2019,
  • [25] HindiMD: A Multi-domain Corpora for Low-resource Sentiment Analysis
    Mamta
    Ekbal, Asif
    Bhattacharyya, Pushpak
    Saha, Tista
    Kumar, Alka
    Srivastava, Shikha
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 7061 - 7070
  • [26] Neural machine translation for low-resource languages without parallel corpora
    Karakanta, Alina
    Dehdari, Jon
    van Genabith, Josef
    MACHINE TRANSLATION, 2018, 32 (1-2) : 167 - 189
  • [27] Low-resource entity resolution with domain generalization and active learning
    Xu, Zhihong
    Wang, Ning
    NEUROCOMPUTING, 2024, 599
  • [28] Evaluation of the morphological rules for the Tenyidie language: a low-resource language
    Angami, Teisovi
    Kevichusa-Ezung, Mimi
    Singh, Sanasam Ranbir
    Tuithung, Themrichon
    LANGUAGE RESOURCES AND EVALUATION, 2024,
  • [29] CAM: A cross-lingual adaptation framework for low-resource language speech recognition
    Hu, Qing
    Zhang, Yan
    Zhang, Xianlei
    Han, Zongyu
    Yu, Xilong
    INFORMATION FUSION, 2024, 111
  • [30] Improving Loanword Identification in Low-Resource Language with Data Augmentation and Multiple Feature Fusion
    Mi, Chenggang
    Zhu, Shaolin
    Nie, Rui
    COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2021, 2021