Monophone-based connected word Hindi speech recognition improvement

被引:3
|
作者
Bhatt S. [1 ]
Jain A. [1 ]
Dev A. [2 ]
机构
[1] University School of Information and Communication Technology, GGSIP University, New Delhi
[2] Indira Gandhi Delhi Technical University for Women, New Delhi
关键词
connected word; Hidden Markov Model; Hindi; MFCCs; monophone; PLP; Speech recognition;
D O I
10.1007/s12046-021-01614-3
中图分类号
学科分类号
摘要
In this paper, a model is proposed to improve monophone-based connected word speech recognition for the Hindi language by utilizing the Hidden Markov Model (HMM). The model consists of hybrid subword units and domain-specific syntactic structures. The hybrid units contain both phoneme- and syllable-based subword units. As the syllable-based subword units cover a larger acoustic span, contextual effects are reduced. The syllable-based acoustic units are applied for modelling only nasal sound in the hybrid model for improving the recognition score of a nasal sound. Further, improvement is proposed using syntactic structures in the grammar definition during the recognition process. Using the domain-specific syntactic structures in the grammar, the search space for the recognizer is reduced; consequently, the performance of the system is improved. For example, two grammar definitions (gram1) with no restriction and grammar(gram2) with domain-specific structures were applied. The speech recognition framework was implemented using the HMM-based toolkit HTK with five-state HMMs. The self-created connected word speech dataset is used with a vocabulary of 240 Hindi words. The Mel frequency cepstral coefficients (MFCCs), MFCCs with energy (MFCC_E), and perceptual linear prediction coefficients with energy (PLP_E) are utilized for feature extraction. Further, monophones were trained with and without using silence fixing to check the impact of short pauses on the recognizer’s performance. The system was tested for both speaker-dependent and speaker-independent modes. It was found that using a hybrid model and grammar(gram2) with silence fixing provided the best results. The system obtained an overall word accuracy of 80.28%, word correct of 80.28%, and a word error rate of 19.72% using MFCCs, gram2, phoneme-based modelling, and silence fixing. For the PLP_E coefficients, hybrid model, silence fixing, and gram2, the system obtained an overall word accuracy of 88.54%, word correct of 88.54%, and the word error rate of 11.46%. © 2021, Indian Academy of Sciences.
引用
收藏
相关论文
共 50 条
  • [1] High resolution speech feature parametrization for monophone-based stressed speech recognition
    Sarikaya, R
    Hansen, JHL
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2000, 7 (07) : 182 - 185
  • [2] MONOPHONE-BASED BACKGROUND MODELING FOR TWO-STAGE ON-DEVICE WAKE WORD DETECTION
    Wu, Minhua
    Panchapagesan, Sankaran
    Sun, Ming
    Gu, Jiacheng
    Thomas, Ryan
    Vitaladevuni, Shiv Naga Prasad
    Hoffmeister, Bjorn
    Mandal, Arindam
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5494 - 5498
  • [3] Syllable based Hindi speech recognition
    Bhatt, Shobha
    Jain, Anurag
    Dev, Amita
    [J]. JOURNAL OF INFORMATION & OPTIMIZATION SCIENCES, 2020, 41 (06): : 1333 - 1351
  • [4] Marathi Connected Word Speech Recognition System
    Patil, Priyartka P.
    Pardeshi, Sanjay A.
    [J]. 2014 FIRST INTERNATIONAL CONFERENCE ON NETWORKS & SOFT COMPUTING (ICNSC), 2014, : 314 - 318
  • [5] Optimum HMM combined with vector quantization for hindi speech word recognition
    Bansal, Poonam
    Dev, Amita
    Jain, Shail Bala
    [J]. IETE JOURNAL OF RESEARCH, 2008, 54 (04) : 239 - 243
  • [6] Chinese Connected Word Speech Recognition Based on Derivative Dynamic Time Warping
    He, Zhiguo
    Liu, Zemin
    [J]. AUTOMATIC MANUFACTURING SYSTEMS II, PTS 1 AND 2, 2012, 542-543 : 1324 - 1329
  • [7] SOME COMPARISONS BETWEEN ARTICULATION RATES OF LPC AND DIPHONE OR MONOPHONE-BASED SYNTHESIS BY RULES
    BRAUN, HJ
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1978, 64 : S160 - S160
  • [8] Efficient data selection for speech recognition based on prior confidence estimation using speech and monophone models
    Kobashikawa, Satoshi
    Asami, Taichi
    Yamaguchi, Yoshikazu
    Masataki, Hirokazu
    Takahashi, Satoshi
    [J]. COMPUTER SPEECH AND LANGUAGE, 2014, 28 (06): : 1287 - 1297
  • [9] Confusion analysis in phoneme based speech recognition in Hindi
    Shobha Bhatt
    Amita Dev
    Anurag Jain
    [J]. Journal of Ambient Intelligence and Humanized Computing, 2020, 11 : 4213 - 4238
  • [10] Confusion analysis in phoneme based speech recognition in Hindi
    Bhatt, Shobha
    Dev, Amita
    Jain, Anurag
    [J]. Bhatt, Shobha (bhattsho@gmail.com), 1600, Springer Science and Business Media Deutschland GmbH (11): : 4213 - 4238