Reverberant speech recognition combining deep neural networks and deep autoencoders augmented with a phone-class feature

被引:0
|
作者
Masato Mimura
Shinsuke Sakai
Tatsuya Kawahara
机构
[1] Kyoto University,Academic Center for Computing and Media Studies
关键词
Reverberant speech recognition; Deep Neural Networks (DNN); Deep Autoencoder (DAE);
D O I
暂无
中图分类号
学科分类号
摘要
We propose an approach to reverberant speech recognition adopting deep learning in the front-end as well as b a c k-e n d o f a r e v e r b e r a n t s p e e c h r e c o g n i t i o n s y s t e m, a n d a n o v e l m e t h o d t o i m p r o v e t h e d e r e v e r b e r a t i o n p e r f o r m a n c e of the front-end network using phone-class information. At the front-end, we adopt a deep autoencoder (DAE) for enhancing the speech feature parameters, and speech recognition is performed in the back-end using DNN-HMM acoustic models trained on multi-condition data. The system was evaluated through the ASR task in the Reverb Challenge 2014. The DNN-HMM system trained on the multi-condition training set achieved a conspicuously higher word accuracy compared to the MLLR-adapted GMM-HMM system trained on the same data. Furthermore, feature enhancement with the deep autoencoder contributed to the improvement of recognition accuracy especially in the more adverse conditions. While the mapping between reverberant and clean speech in DAE-based dereverberation is conventionally conducted only with the acoustic information, we presume the mapping is also dependent on the phone information. Therefore, we propose a new scheme (pDAE), which augments a phone-class feature to the standard acoustic features as input. Two types of the phone-class feature are investigated. One is the hard recognition result of monophones, and the other is a soft representation derived from the posterior outputs of monophone DNN. The augmented feature in either type results in a significant improvement (7–8 % relative) from the standard DAE.
引用
收藏
相关论文
共 50 条
  • [1] Reverberant speech recognition combining deep neural networks and deep autoencoders augmented with a phone-class feature
    Mimura, Masato
    Sakai, Shinsuke
    Kawahara, Tatsuya
    [J]. EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2015,
  • [2] DEEP AUTOENCODERS AUGMENTED WITH PHONE-CLASS FEATURE FOR REVERBERANT SPEECH RECOGNITION
    Mimura, Masato
    Sakai, Shinsuke
    Kawahara, Tatsuya
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4365 - 4369
  • [3] EXPLORING DEEP NEURAL NETWORKS AND DEEP AUTOENCODERS IN REVERBERANT SPEECH RECOGNITION
    Mimura, Masato
    Sakai, Shinsuke
    Kawahara, Tatsuya
    [J]. 2014 4TH JOINT WORKSHOP ON HANDS-FREE SPEECH COMMUNICATION AND MICROPHONE ARRAYS (HSCMA), 2014, : 197 - 201
  • [4] SPEECH FEATURE DENOISING AND DEREVERBERATION VIA DEEP AUTOENCODERS FOR NOISY REVERBERANT SPEECH RECOGNITION
    Feng, Xue
    Zhang, Yaodong
    Glass, James
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [5] Deep Neural Networks with Linearly Augmented Rectifier Layers for Speech Recognition
    Toth, Laszlo
    [J]. 2018 IEEE 16TH WORLD SYMPOSIUM ON APPLIED MACHINE INTELLIGENCE AND INFORMATICS (SAMI 2018): DEDICATED TO THE MEMORY OF PIONEER OF ROBOTICS ANTAL (TONY) K. BEJCZY, 2018, : 189 - 193
  • [6] Deep Convolutional Neural Networks for Feature Extraction in Speech Emotion Recognition
    Heracleous, Panikos
    Mohammad, Yasser
    Yoneyama, Akio
    [J]. HUMAN-COMPUTER INTERACTION. RECOGNITION AND INTERACTION TECHNOLOGIES, HCI 2019, PT II, 2019, 11567 : 117 - 132
  • [7] Evaluation of Mixed Deep Neural Networks for Reverberant Speech Enhancement
    Gutierrez-Munoz, Michelle
    Gonzalez-Salazar, Astryd
    Coto-Jimenez, Marvin
    [J]. BIOMIMETICS, 2020, 5 (01)
  • [8] Binaural reverberant Speech separation based on deep neural networks
    Zhang, Xueliang
    Wang, DeLiang
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2018 - 2022
  • [9] A Performance Evaluation of Several Deep Neural Networks for Reverberant Speech Separation
    Liu, Qingju
    Wang, Wenwu
    Jackson, Philip J. B.
    Safavi, Saeid
    [J]. 2018 CONFERENCE RECORD OF 52ND ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS, AND COMPUTERS, 2018, : 689 - 693
  • [10] PHONE RECOGNITION WITH DEEP SPARSE RECTIFIER NEURAL NETWORKS
    Toth, Laszlo
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 6985 - 6989