A Comparative Study on Selecting Acoustic Modeling Units for WFST-based Mongolian Speech Recognition

被引:0
|
作者
Wang Yonghe [1 ]
Bao, Feilong [1 ]
Gao, Gaunglai [1 ]
机构
[1] Inner Mongolia Univ, Coll Comp Sci, 235 West Coll Rd, Hohhot 010021, Inner Mongolia, Peoples R China
关键词
Mongolian; speech recognition; acoustic modeling unit; alignment model; WFST;
D O I
10.1145/3617830
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Traditional weighted finite-state transducer- (WFST) based Mongolian automatic speech recognition (ASR) systems use phonemes as pronunciation lexicon modeling units. However, Mongolian is an agglutinative, low-resource language, and building an ASR system based on the phoneme pronunciation lexicon remains a challenge for various reasons. First, the phoneme pronunciation lexicon manually constructed by Mongolian linguists is finite, which is usually used to build a grapheme-to-phoneme conversion (G2P) model to frequently expand new words. However, the data sparsity decreases the robustness of the G2P model and affects the performance of the final ASR system. Second, homophones and polysyllabic words are common in Mongolian, which has a certain impact on the construction of the Mongolian acoustic model. To address these problems, in this work, we first propose a grapheme-to-phoneme alignment model to obtain the mapping relationship between phonemes and subword units. Then, we construct an acoustic subword segmentation set to segment words directly instead of using the traditional G2P method to predict phoneme sequences to expand the pronunciation lexicon. Further, by analyzing the Mongolian encoding form, we also propose an acoustic subword modeling units construction method that removes control characters. Finally, we investigate various acoustic subword modeling units for pronunciation lexicon construction for the Mongolian ASR system. Experiments on a Mongolian dataset with 325 hours of training show that the pronunciation lexicon based on the acoustic subword modeling unit can effectively construct the WFST-based Mongolian ASR system. Further, removing the control characters when building the acoustic subword modeling unit can further improve the ASR system performance.
引用
收藏
页数:20
相关论文
共 50 条
  • [41] CNN Based Automatic Speech Recognition: A Comparative Study
    Ilgaz, Hilal
    Akkoyun, Beyza
    Alpay, Ozlem
    Akcayol, M. Ali
    ADCAIJ-ADVANCES IN DISTRIBUTED COMPUTING AND ARTIFICIAL INTELLIGENCE JOURNAL, 2024, 13
  • [42] A COMPARABLE STUDY OF MODELING UNITS FOR END-TO-END MANDARIN SPEECH RECOGNITION
    Zou, Wei
    Jiang, Dongwei
    Zhao, Shuaijiang
    Yang, Guilin
    Li, Xiangang
    2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 369 - 373
  • [43] DNN-Based Acoustic Modeling for Russian Speech Recognition Using Kaldi
    Kipyatkova, Irina
    Karpov, Alexey
    SPEECH AND COMPUTER, 2016, 9811 : 246 - 253
  • [44] Selection of acoustic modeling unit for Tibetan speech recognition based on deep learning
    Gong, Baojia
    Cai, Rangzhuoma
    Cai, Zhijie
    Ding, Yuntao
    Peng, Maozhaxi
    2020 2ND INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE COMMUNICATION AND NETWORK SECURITY (CSCNS2020), 2021, 336
  • [45] Acoustic modeling with contextual additive structure for HMM-based speech recognition
    Nankaku, Yoshihiko
    Nakamura, Kazuhiro
    Zen, Heiga
    Tokuda, Keiichi
    2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 4469 - 4472
  • [46] A frame-based context-dependent acoustic modeling for speech recognition
    Terashima R.
    Zen H.
    Nankaku Y.
    Tokuda K.
    IEEJ Transactions on Electronics, Information and Systems, 2010, 130 (10) : 1856 - 1864+24
  • [47] Graph-Based Semisupervised Learning for Acoustic Modeling in Automatic Speech Recognition
    Liu, Yuzong
    Kirchhoff, Katrin
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (11) : 1946 - 1956
  • [48] MULTILINGUAL ACOUSTIC MODELING FOR SPEECH RECOGNITION BASED ON SUBSPACE GAUSSIAN MIXTURE MODELS
    Burget, Lukas
    Schwarz, Petr
    Agarwal, Mohit
    Akyazi, Pinar
    Feng, Kai
    Ghoshal, Arnab
    Glembek, Ondrej
    Goel, Nagendra
    Karafiat, Martin
    Povey, Daniel
    Rastrow, Ariya
    Rose, Richard C.
    Thomas, Samuel
    2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 4334 - 4337
  • [49] Rule-Based Triphone Mapping for Acoustic Modeling in Automatic Speech Recognition
    Darjaa, Sakhia
    Cernak, Milos
    Benus, Stefan
    Rusko, Milan
    Sabo, Robert
    Trnka, Marian
    TEXT, SPEECH AND DIALOGUE, TSD 2011, 2011, 6836 : 268 - 275
  • [50] Deep Neural Networks for Syllable based Acoustic Modeling in Chinese Speech Recognition
    Li, Xiangang
    Hong, Caifu
    Yang, Yuning
    Wu, Xihong
    2013 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2013,