A Comparative Study on Selecting Acoustic Modeling Units for WFST-based Mongolian Speech Recognition

被引:0
|
作者
Wang Yonghe [1 ]
Bao, Feilong [1 ]
Gao, Gaunglai [1 ]
机构
[1] Inner Mongolia Univ, Coll Comp Sci, 235 West Coll Rd, Hohhot 010021, Inner Mongolia, Peoples R China
关键词
Mongolian; speech recognition; acoustic modeling unit; alignment model; WFST;
D O I
10.1145/3617830
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Traditional weighted finite-state transducer- (WFST) based Mongolian automatic speech recognition (ASR) systems use phonemes as pronunciation lexicon modeling units. However, Mongolian is an agglutinative, low-resource language, and building an ASR system based on the phoneme pronunciation lexicon remains a challenge for various reasons. First, the phoneme pronunciation lexicon manually constructed by Mongolian linguists is finite, which is usually used to build a grapheme-to-phoneme conversion (G2P) model to frequently expand new words. However, the data sparsity decreases the robustness of the G2P model and affects the performance of the final ASR system. Second, homophones and polysyllabic words are common in Mongolian, which has a certain impact on the construction of the Mongolian acoustic model. To address these problems, in this work, we first propose a grapheme-to-phoneme alignment model to obtain the mapping relationship between phonemes and subword units. Then, we construct an acoustic subword segmentation set to segment words directly instead of using the traditional G2P method to predict phoneme sequences to expand the pronunciation lexicon. Further, by analyzing the Mongolian encoding form, we also propose an acoustic subword modeling units construction method that removes control characters. Finally, we investigate various acoustic subword modeling units for pronunciation lexicon construction for the Mongolian ASR system. Experiments on a Mongolian dataset with 325 hours of training show that the pronunciation lexicon based on the acoustic subword modeling unit can effectively construct the WFST-based Mongolian ASR system. Further, removing the control characters when building the acoustic subword modeling unit can further improve the ASR system performance.
引用
收藏
页数:20
相关论文
共 50 条
  • [1] Improved subword modeling for WFST-based speech recognition
    Smit, Peter
    Virpioja, Sami
    Kurimo, Mikko
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2551 - 2555
  • [2] AN ASYNCHRONOUS WFST-BASED DECODER FOR AUTOMATIC SPEECH RECOGNITION
    Lv, Hang
    Chen, Zhehuai
    Xu, Hainan
    Povey, Daniel
    Xie, Lei
    Khudanpur, Sanjeev
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6019 - 6023
  • [3] Dynamic Grammars with Lookahead Composition for WFST-based Speech Recognition
    Novak, Josef R.
    Minematsu, Nobuaki
    Hirose, Keikichi
    13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 1078 - 1081
  • [4] WFST-BASED STRUCTURAL CLASSIFICATION INTEGRATING DNN ACOUSTIC FEATURES AND RNN LANGUAGE FEATURES FOR SPEECH RECOGNITION
    Quoc Truong Do
    Nakamura, Satoshi
    Delcroix, Marc
    Hori, Takaaki
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4959 - 4963
  • [5] A comparative study on selecting acoustic modeling units in deep neural networks based large vocabulary Chinese speech recognition
    Li, Xiangang
    Yang, Yuning
    Pang, Zaihu
    Wu, Xihong
    NEUROCOMPUTING, 2015, 170 : 251 - 256
  • [6] H- AND C-LEVEL WFST-BASED LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION ON GRAPHICS PROCESSING UNITS
    Kim, Jungsuk
    You, Kisun
    Sung, Wonyong
    2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 1733 - 1736
  • [7] Tied-State Mixture Language Model for WFST-based Speech Recognition
    Yamamoto, Hitoshi
    Dixon, Paul R.
    Matsuda, Shigeki
    Hori, Chiori
    Kashioka, Hideki
    13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 174 - 177
  • [8] Compact and Efficient WFST-based Decoders for Handwriting Recognition
    Cai, Meng
    Huo, Qiang
    2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), VOL 1, 2017, : 143 - 148
  • [9] Large Vocabulary Continuous Speech Recognition Using WFST-based Linear Classifier for Structured Data
    Watanabe, Shinji
    Hori, Takaaki
    Nakamura, Atsushi
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 346 - 349
  • [10] SILENCE IS GOLDEN: MODELING NON-SPEECH EVENTS IN WFST-BASED DYNAMIC NETWORK DECODERS
    Rybach, David
    Schlueter, Ralf
    Ney, Hermann
    2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 4205 - 4208