A Comparative Study on Selecting Acoustic Modeling Units for WFST-based Mongolian Speech Recognition

被引:0
|
作者
Wang Yonghe [1 ]
Bao, Feilong [1 ]
Gao, Gaunglai [1 ]
机构
[1] Inner Mongolia Univ, Coll Comp Sci, 235 West Coll Rd, Hohhot 010021, Inner Mongolia, Peoples R China
关键词
Mongolian; speech recognition; acoustic modeling unit; alignment model; WFST;
D O I
10.1145/3617830
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Traditional weighted finite-state transducer- (WFST) based Mongolian automatic speech recognition (ASR) systems use phonemes as pronunciation lexicon modeling units. However, Mongolian is an agglutinative, low-resource language, and building an ASR system based on the phoneme pronunciation lexicon remains a challenge for various reasons. First, the phoneme pronunciation lexicon manually constructed by Mongolian linguists is finite, which is usually used to build a grapheme-to-phoneme conversion (G2P) model to frequently expand new words. However, the data sparsity decreases the robustness of the G2P model and affects the performance of the final ASR system. Second, homophones and polysyllabic words are common in Mongolian, which has a certain impact on the construction of the Mongolian acoustic model. To address these problems, in this work, we first propose a grapheme-to-phoneme alignment model to obtain the mapping relationship between phonemes and subword units. Then, we construct an acoustic subword segmentation set to segment words directly instead of using the traditional G2P method to predict phoneme sequences to expand the pronunciation lexicon. Further, by analyzing the Mongolian encoding form, we also propose an acoustic subword modeling units construction method that removes control characters. Finally, we investigate various acoustic subword modeling units for pronunciation lexicon construction for the Mongolian ASR system. Experiments on a Mongolian dataset with 325 hours of training show that the pronunciation lexicon based on the acoustic subword modeling unit can effectively construct the WFST-based Mongolian ASR system. Further, removing the control characters when building the acoustic subword modeling unit can further improve the ASR system performance.
引用
收藏
页数:20
相关论文
共 50 条
  • [21] SIMULTANEOUS SPEECH RECOGNITION AND ACOUSTIC EVENT DETECTION USING AN LSTM-CTC ACOUSTIC MODEL AND A WFST DECODER
    Fujimura, Hiroshi
    Nagao, Manabu
    Masuko, Takashi
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5834 - 5838
  • [22] Acoustic Modeling Based on Model Structure Annealing for Speech Recognition
    Shiota, Sayaka
    Hashimoto, Kei
    Zen, Heiga
    Nankaku, Yoshihiko
    Lee, Akinobu
    Tokuda, Keiichi
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 932 - 935
  • [23] GMM-BASED ACOUSTIC MODELING FOR EMBEDDED SPEECH RECOGNITION
    Levy, Christophe
    Linares, Georges
    Bonastre, Jean-Francois
    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1726 - 1729
  • [24] TRANSFORMER-BASED ACOUSTIC MODELING FOR HYBRID SPEECH RECOGNITION
    Wang, Yongqiang
    Mohamed, Abdelrahman
    Le, Duc
    Liu, Chunxi
    Xiao, Alex
    Mahadeokar, Jay
    Huang, Hongzhao
    Tjandra, Andros
    Zhang, Xiaohui
    Zhang, Frank
    Fuegen, Christian
    Zweig, Geoffrey
    Seltzer, Michael L.
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6874 - 6878
  • [25] A study on acoustic modeling for speech recognition of predominantly monosyllabic languages
    Maneenoi, E
    Ahkuputra, V
    Luksaneeyanawin, S
    Jitapunkul, S
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2004, E87D (05): : 1146 - 1163
  • [26] A study on acoustic modeling for speech recognition of predominantly monosyllabic languages
    Maneenoi, Ekkarit
    Ahkuputra, Visarut
    Luksaneeyanawin, Sudaporn
    Jitapunkul, Somchai
    IEICE Transactions on Information and Systems, 2004, E87-D (05) : 1146 - 1163
  • [27] Research on Mongolian Speech Recognition Based on FSMN
    Wang, Yonghe
    Bao, Feilong
    Zhang, Hongwei
    Gao, Guanglai
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2017, 2018, 10619 : 243 - 254
  • [28] A Mongolian speech recognition system based on HMM
    Gao, Guanglai
    Biligetu
    Nabuqing
    Zhang, Shuwu
    COMPUTATIONAL INTELLIGENCE, PT 2, PROCEEDINGS, 2006, 4114 : 667 - 676
  • [29] A Comparative Study on the Effect of Different Codecs on Speech Recognition Accuracy Using Various Acoustic Modeling Techniques
    Raghavan, Srinivasa
    Meenakshi, Nisha
    Mittal, Sanjeev Kumar
    Yarra, Chiranjeevi
    Mandal, Anupam
    Kumar, K. R. Prasanna
    Ghosh, Prasanta Kumar
    2017 TWENTY-THIRD NATIONAL CONFERENCE ON COMMUNICATIONS (NCC), 2017,
  • [30] Using Syllables as Acoustic Units for Spontaneous Speech Recognition
    Hejtmanek, Jan
    TEXT, SPEECH AND DIALOGUE, 2010, 6231 : 299 - 305