Reverberant speech recognition combining deep neural networks and deep autoencoders augmented with a phone-class feature

被引:0
|
作者
Masato Mimura
Shinsuke Sakai
Tatsuya Kawahara
机构
[1] Kyoto University,Academic Center for Computing and Media Studies
关键词
Reverberant speech recognition; Deep Neural Networks (DNN); Deep Autoencoder (DAE);
D O I
暂无
中图分类号
学科分类号
摘要
We propose an approach to reverberant speech recognition adopting deep learning in the front-end as well as b a c k-e n d o f a r e v e r b e r a n t s p e e c h r e c o g n i t i o n s y s t e m, a n d a n o v e l m e t h o d t o i m p r o v e t h e d e r e v e r b e r a t i o n p e r f o r m a n c e of the front-end network using phone-class information. At the front-end, we adopt a deep autoencoder (DAE) for enhancing the speech feature parameters, and speech recognition is performed in the back-end using DNN-HMM acoustic models trained on multi-condition data. The system was evaluated through the ASR task in the Reverb Challenge 2014. The DNN-HMM system trained on the multi-condition training set achieved a conspicuously higher word accuracy compared to the MLLR-adapted GMM-HMM system trained on the same data. Furthermore, feature enhancement with the deep autoencoder contributed to the improvement of recognition accuracy especially in the more adverse conditions. While the mapping between reverberant and clean speech in DAE-based dereverberation is conventionally conducted only with the acoustic information, we presume the mapping is also dependent on the phone information. Therefore, we propose a new scheme (pDAE), which augments a phone-class feature to the standard acoustic features as input. Two types of the phone-class feature are investigated. One is the hard recognition result of monophones, and the other is a soft representation derived from the posterior outputs of monophone DNN. The augmented feature in either type results in a significant improvement (7–8 % relative) from the standard DAE.
引用
收藏
相关论文
共 50 条
  • [31] Speech Emotion Recognition using Deep Dropout Autoencoders
    Pal, Arghya
    Baskar, S.
    Baskar, S.
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ENGINEERING AND TECHNOLOGY (ICETECH), 2015, : 124 - 129
  • [32] On Deep and Shallow Neural Networks in Speech Recognition from Speech Spectrum
    Zelinka, Jan
    Salajka, Petr
    Mueller, Ludek
    [J]. SPEECH AND COMPUTER (SPECOM 2015), 2015, 9319 : 301 - 308
  • [33] Audio-visual feature fusion via deep neural networks for automatic speech recognition
    Rahmani, Mohammad Hasan
    Almasganj, Farshad
    Seyyedsalehi, Seyyed Ali
    [J]. DIGITAL SIGNAL PROCESSING, 2018, 82 : 54 - 63
  • [34] Deep neural networks for emotion recognition combining audio and transcripts
    Cho, Jaejin
    Pappagari, Raghavendra
    Kulkarni, Purva
    Villalba, Jesus
    Carmiel, Yishay
    Dehak, Najim
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 247 - 251
  • [35] Improving continuous speech recognition in Spanish by phone-class semicontinuous HMMs with pausing and multiple pronunciations
    Ferreiros, J
    Pardo, JM
    [J]. SPEECH COMMUNICATION, 1999, 29 (01) : 65 - 76
  • [36] EXPLOITING LSTM STRUCTURE IN DEEP NEURAL NETWORKS FOR SPEECH RECOGNITION
    He, Tianxing
    Droppo, Jasha
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5445 - 5449
  • [37] VERY DEEP CONVOLUTIONAL NEURAL NETWORKS FOR ROBUST SPEECH RECOGNITION
    Qian, Yanmin
    Woodland, Philip C.
    [J]. 2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016), 2016, : 481 - 488
  • [38] Automatic Recognition of Kazakh Speech Using Deep Neural Networks
    Mamyrbayev, Orken
    Turdalyuly, Mussa
    Mekebayev, Nurbapa
    Alimhan, Keylan
    Kydyrbekova, Aizat
    Turdalykyzy, Tolganay
    [J]. INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2019, PT II, 2019, 11432 : 465 - 474
  • [39] FSER: Deep Convolutional Neural Networks for Speech Emotion Recognition
    Dossou, Bonaventure F. P.
    Gbenou, Yeno K. S.
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3526 - 3531
  • [40] Acceleration Strategies for Speech Recognition based on Deep Neural Networks
    Tian, Chao
    Liu, Jia
    Peng, Zhaomeng
    [J]. MECHATRONICS ENGINEERING, COMPUTING AND INFORMATION TECHNOLOGY, 2014, 556-562 : 5181 - 5185