Domain Generalization for Language-Independent Automatic Speech Recognition

被引:1
|
作者
Gao, Heting [1 ]
Ni, Junrui [1 ]
Zhang, Yang [2 ]
Qian, Kaizhi [2 ]
Chang, Shiyu [2 ,3 ]
Hasegawa-Johnson, Mark [1 ]
机构
[1] Univ Illinois, Beckman Inst, Dept Elect & Comp Engn ECE, Urbana, IL 61820 USA
[2] MIT, IBM Watson Lab, Cambridge, MA USA
[3] Univ Calif Santa Barbara, Dept Comp Sci, Santa Barbara, CA USA
来源
关键词
automatic speech recognition; under-resourced languages; invariant risk minimization; distributionally robust optimization; regret minimization; domain generalization;
D O I
10.3389/frai.2022.806274
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A language-independent automatic speech recognizer (ASR) is one that can be used for phonetic transcription in languages other than the languages in which it was trained. Language-independent ASR is difficult to train, because different languages implement phones differently: even when phonemes in two different languages are written using the same symbols in the international phonetic alphabet, they are differentiated by different distributions of language-dependent redundant articulatory features. This article demonstrates that the goal of language-independence may be approximated in different ways, depending on the size of the training set, the presence vs. absence of familial relationships between the training and test languages, and the method used to implement phone recognition or classification. When the training set contains many languages, and when every language in the test set is related (shares the same language family with) a language in the training set, then language-independent ASR may be trained using an empirical risk minimization strategy (e.g., using connectionist temporal classification without extra regularizers). When the training set is limited to a small number of languages from one language family, however, and the test languages are not from the same language family, then the best performance is achieved by using domain-invariant representation learning strategies. Two different representation learning strategies are tested in this article: invariant risk minimization, and regret minimization. We find that invariant risk minimization is better at the task of phone token classification (given known segment boundary times), while regret minimization is better at the task of phone token recognition.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Language-independent and language-adaptive acoustic modeling for speech recognition
    Schultz, T
    Waibel, A
    [J]. SPEECH COMMUNICATION, 2001, 35 (1-2) : 31 - 51
  • [2] Speaker-and language-independent speech recognition in mobile communication systems
    Viikki, I
    Kiss, I
    Tian, J
    [J]. 2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-VI, PROCEEDINGS: VOL I: SPEECH PROCESSING 1; VOL II: SPEECH PROCESSING 2 IND TECHNOL TRACK DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS NEURALNETWORKS FOR SIGNAL PROCESSING; VOL III: IMAGE & MULTIDIMENSIONAL SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING - VOL IV: SIGNAL PROCESSING FOR COMMUNICATIONS; VOL V: SIGNAL PROCESSING EDUCATION SENSOR ARRAY & MULTICHANNEL SIGNAL PROCESSING AUDIO & ELECTROACOUSTICS; VOL VI: SIGNAL PROCESSING THEORY & METHODS STUDENT FORUM, 2001, : 5 - 8
  • [3] Language-independent hyperparameter optimization based speech emotion recognition system
    Thakur A.
    Dhull S.K.
    [J]. International Journal of Information Technology, 2022, 14 (7) : 3691 - 3699
  • [4] Investigation of speech-based language-independent possibilities of depression recognition
    Kiss, Gabor
    [J]. 2022 45TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING, TSP, 2022, : 226 - 229
  • [5] HIGHLIGHTS - LANGUAGE-INDEPENDENT AND DOMAIN-INDEPENDENT AUTOMATIC-INDEXING TERMS FOR ABSTRACTING
    COHEN, JD
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1995, 46 (03): : 162 - 174
  • [6] Language-independent computer emotion recognition
    Mitsuyoshi, S
    Ren, FJ
    [J]. Proceedings of the Ninth IASTED International Conference on Artificial Intelligence and Soft Computing, 2005, : 417 - 422
  • [7] CONFIDENCE INDEX DYNAMIC TIME WARPING FOR LANGUAGE-INDEPENDENT EMBEDDED SPEECH RECOGNITION
    Zhang, Xianglilan
    Sun, Jiping
    Luo, Zhigang
    Li, Ming
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 8066 - 8070
  • [8] Language-Independent Approach for Automatic Computation of Vowel Articulation Features in Dysarthric Speech Assessment
    Liu, Yuanyuan
    Penttila, Nelly
    Ihalainen, Tiina
    Lintula, Juulia
    Convey, Rachel
    Rasanen, Okko
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 2228 - 2243
  • [9] Towards Building a Language-Independent Speech Scoring Assessment
    Gupta, Shreyansh
    Unnam, Abhishek
    Yadav, Kuldeep
    Aggarwal, Varun
    [J]. THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23200 - 23206
  • [10] Written-Domain Language Modeling for Automatic Speech Recognition
    Sak, Hasim
    Sung, Yun-hsuan
    Beaufays, Francoise
    Allauzen, Cyril
    [J]. 14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 675 - 679