Towards Cross-Corpora Generalization for Low-Resource Spoken Language Identification

被引:0
|
作者
Dey, Spandan [1 ,2 ]
Sahidullah, Md [3 ,4 ]
Saha, Goutam [1 ]
机构
[1] Indian Inst Technol Kharagpur, Dept E & ECE, Kharagpur 721302, India
[2] Samsung R&D Inst India Bangalore, Bengaluru 560037, India
[3] TCG CREST, Inst Adv Intelligence, Bidhannagar 700091, India
[4] Acad Sci & Innovat Res, Ghaziabad 201002, India
关键词
Vectors; Training; Correlation; Speech processing; Recording; Noise; Measurement; Databases; Training data; NIST; Spoken language identification; low-resource; cross-corpora evaluation; corpora mismatch; domain invariance;
D O I
10.1109/TASLP.2024.3492807
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Low-resource spoken language identification (LID) systems are prone to poor generalization across unknown domains. In this study, using multiple widely used low-resourced South Asian LID corpora, we conduct an in-depth analysis for understanding the key non-lingual bias factors that create corpora mismatch and degrade LID generalization. To quantify the biases, we extract different data-driven and rule-based summary vectors that capture non-lingual aspects, such as speaker characteristics, spoken context, accents or dialects, recording channels, background noise, and environments. We then conduct a statistical analysis to identify the most crucial non-lingual bias factors and corpora mismatch components that impact LID performance. Following these analyses, we then propose effective bias compensation approaches for the most relevant summary vectors. We generate pseudo-labels using hierarchical clustering over language-domain-gender constrained summary vectors and use them to train adversarial networks with conditioned metric loss. The compensations learn invariance for the corpora mismatches due to the non-lingual biases and help to improve the generalization. With the proposed compensation method, we improve equal error rate up to 5.22% and 8.14% for the same-corpora and cross-corpora evaluations, respectively.
引用
收藏
页码:5040 / 5050
页数:11
相关论文
共 50 条
  • [1] Cross-corpora spoken language identification with domain diversification and generalization
    Dey, Spandan
    Sahidullah, Md
    Saha, Goutam
    COMPUTER SPEECH AND LANGUAGE, 2023, 81
  • [2] Towards dialect-inclusive recognition in a low-resource language: are balanced corpora the answer?
    Lonergan, Liam
    Qian, Mengjie
    Chiarain, Neasa Ni
    Gobl, Christer
    Chasaide, Ailbhe Ni
    INTERSPEECH 2023, 2023, : 5082 - 5086
  • [3] Bidirectional Representations for Low-Resource Spoken Language Understanding
    Meeus, Quentin
    Moens, Marie-Francine
    Van Hamme, Hugo
    APPLIED SCIENCES-BASEL, 2023, 13 (20):
  • [4] A Study on Low-resource Language Identification
    Qi, Zhaodi
    Ma, Yong
    Gu, Mingliang
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1897 - 1902
  • [5] Data Selection using Spoken Language Identification for Low-Resource and Zero-Resource Speech Recognition
    Chen, Jianan
    Chu, Chenhui
    Li, Sheng
    Kawahara, Tatsuya
    APSIPA ASC 2024 - Asia Pacific Signal and Information Processing Association Annual Summit and Conference 2024, 2024,
  • [6] Meta Auxiliary Learning for Low-resource Spoken Language Understanding
    Gao, Yingying
    Feng, Junlan
    Deng, Chao
    Zhang, Shilei
    INTERSPEECH 2022, 2022, : 2703 - 2707
  • [7] GlotLID: Language Identification for Low-Resource Languages
    Kargaran, Amir Hossein
    Imani, Ayyoob
    Yvon, Francois
    Schuetze, Hinrich
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 6155 - 6218
  • [8] Cross-Corpora Language Recognition: A Preliminary Investigation with Indian Languages
    Dey, Spandan
    Saha, Goutam
    Sahidullah, Md
    29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 546 - 550
  • [9] Bottleneck Low-rank Transformers for Low-resource Spoken Language Understanding
    Wang, Pu
    Van Hamme, Hugo
    INTERSPEECH 2022, 2022, : 1248 - 1252
  • [10] Multilingual Offensive Language Identification for Low-resource Languages
    Ranasinghe, Tharindu
    Zampieri, Marcos
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (01)