Toward growing modular deep neural networks for continuous speech recognition

被引:0
|
作者
Zohreh Ansari
Seyyed Ali Seyyedsalehi
机构
[1] Amirkabir University of Technology (Tehran Polytechnic),Speech Processing Lab., Faculty of Biomedical Engineering
来源
关键词
Deep neural networks; Modular neural networks; Pre-training; Nonlinear filtering; Double spatiotemporal; Speaker adaptation; Continuous speech recognition;
D O I
暂无
中图分类号
学科分类号
摘要
The performance drop of typical automatic speech recognition systems in real applications is related to their not properly designed structure and training procedure. In this article, a growing modular deep neural network (MDNN) for speech recognition is introduced. According to its structure, this network is pre-trained in a special manner. The ability of the MDNN to grow enables it to implement spatiotemporal information of the frame sequences at the input and their labels at the output layer at the same time. The trained network with such a double spatiotemporal (DST) structure has learned valid phonetic sequences subspace. Therefore, it can filter out invalid output sequences in its own structure. In order to improve the proposed network performance in speaker variations, two speaker adaptation methods are also presented in this work. In these adaptation methods, the network trains how to move distorted input representations nonlinearly to their optimal positions or to adapt itself based on the input information. To evaluate the proposed MDNN structure and its modified versions, two Persian speech datasets are used: FARSDAT and Large FARSDAT. As there is no frame-level transcription for large vocabulary speech datasets, a semi-supervised learning algorithm is explored to train MDNN on Large FARSDAT. Experimental results on FARSDAT verify that implementing the DST structure besides speaker adaptation methods achieves up to 7.3 and 10.6 % absolute phone accuracy rate improvement over the MDNN and typical hidden Markov model, respectively. Likewise, semi-supervised training of the grown MDNN on Large FARSDAT improves its recognition performance up to 5 %.
引用
收藏
页码:1177 / 1196
页数:19
相关论文
共 50 条
  • [31] Acceleration Strategies for Speech Recognition based on Deep Neural Networks
    Tian, Chao
    Liu, Jia
    Peng, Zhaomeng
    MECHATRONICS ENGINEERING, COMPUTING AND INFORMATION TECHNOLOGY, 2014, 556-562 : 5181 - 5185
  • [32] Comparative Analysis of Deep Recurrent Neural Networks for Speech Recognition
    Atosha, Pascal Bahavu
    Ozbilge, Emre
    Kirsal, Yonal
    32ND IEEE SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, SIU 2024, 2024,
  • [33] Speech Recognition Using Deep Neural Networks: A Systematic Review
    Nassif, Ali Bou
    Shahin, Ismail
    Attili, Imtinan
    Azzeh, Mohammad
    Shaalan, Khaled
    IEEE ACCESS, 2019, 7 : 19143 - 19165
  • [34] AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION
    Seltzer, Michael L.
    Yu, Dong
    Wang, Yongqiang
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7398 - 7402
  • [35] A CLUSTER-BASED MULTIPLE DEEP NEURAL NETWORKS METHOD FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION
    Zhou, Pan
    Liu, Cong
    Liu, Qingfeng
    Dai, Lirong
    Jiang, Hui
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 6650 - 6654
  • [36] Deep Neural Network Quantizers Outperforming Continuous Speech Recognition Systems
    Watzel, Tobias
    Li, Lujun
    Kuerzinger, Ludwig
    Rigoll, Gerhard
    SPEECH AND COMPUTER, SPECOM 2019, 2019, 11658 : 530 - 539
  • [37] Modular neural networks exploit large acoustic context through broad-class posteriors for continuous speech recognition
    Antoniou, C
    2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-VI, PROCEEDINGS: VOL I: SPEECH PROCESSING 1; VOL II: SPEECH PROCESSING 2 IND TECHNOL TRACK DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS NEURALNETWORKS FOR SIGNAL PROCESSING; VOL III: IMAGE & MULTIDIMENSIONAL SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING - VOL IV: SIGNAL PROCESSING FOR COMMUNICATIONS; VOL V: SIGNAL PROCESSING EDUCATION SENSOR ARRAY & MULTICHANNEL SIGNAL PROCESSING AUDIO & ELECTROACOUSTICS; VOL VI: SIGNAL PROCESSING THEORY & METHODS STUDENT FORUM, 2001, : 505 - 508
  • [38] Speech Emotion Recognition using Convolution Neural Networks and Deep Stride Convolutional Neural Networks
    Wani, Taiba Majid
    Gunawan, Teddy Surya
    Qadri, Syed Asif Ahmad
    Mansor, Hasmah
    Kartiwi, Mira
    Ismail, Nanang
    PROCEEDING OF 2020 6TH INTERNATIONAL CONFERENCE ON WIRELESS AND TELEMATICS (ICWT), 2020,
  • [39] TOWARD CONTINUOUS-SPEECH RECOGNITION
    VERHAEGHE, B
    BYTE, 1992, 17 (04): : 158 - 158
  • [40] TONE RECOGNITION OF CONTINUOUS MANDARINE SPEECH-BASED ON NEURAL NETWORKS
    CHEN, SH
    WANG, YR
    IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1995, 3 (02): : 146 - 150