Reverberant speech recognition combining deep neural networks and deep autoencoders augmented with a phone-class feature

被引：0

作者：

Masato Mimura

Shinsuke Sakai

Tatsuya Kawahara

机构：

[1] Kyoto University,Academic Center for Computing and Media Studies

来源：

EURASIP Journal on Advances in Signal Processing | / 2015卷

关键词：

Reverberant speech recognition; Deep Neural Networks (DNN); Deep Autoencoder (DAE);

D O I：

暂无

中图分类号：

学科分类号：

摘要：

We propose an approach to reverberant speech recognition adopting deep learning in the front-end as well as b a c k-e n d o f a r e v e r b e r a n t s p e e c h r e c o g n i t i o n s y s t e m, a n d a n o v e l m e t h o d t o i m p r o v e t h e d e r e v e r b e r a t i o n p e r f o r m a n c e of the front-end network using phone-class information. At the front-end, we adopt a deep autoencoder (DAE) for enhancing the speech feature parameters, and speech recognition is performed in the back-end using DNN-HMM acoustic models trained on multi-condition data. The system was evaluated through the ASR task in the Reverb Challenge 2014. The DNN-HMM system trained on the multi-condition training set achieved a conspicuously higher word accuracy compared to the MLLR-adapted GMM-HMM system trained on the same data. Furthermore, feature enhancement with the deep autoencoder contributed to the improvement of recognition accuracy especially in the more adverse conditions. While the mapping between reverberant and clean speech in DAE-based dereverberation is conventionally conducted only with the acoustic information, we presume the mapping is also dependent on the phone information. Therefore, we propose a new scheme (pDAE), which augments a phone-class feature to the standard acoustic features as input. Two types of the phone-class feature are investigated. One is the hard recognition result of monophones, and the other is a soft representation derived from the posterior outputs of monophone DNN. The augmented feature in either type results in a significant improvement (7–8 % relative) from the standard DAE.

引用

共 50 条

[1] Reverberant speech recognition combining deep neural networks and deep autoencoders augmented with a phone-class feature
Mimura, Masato
Sakai, Shinsuke
Kawahara, Tatsuya
[J]. EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2015,
[2] DEEP AUTOENCODERS AUGMENTED WITH PHONE-CLASS FEATURE FOR REVERBERANT SPEECH RECOGNITION
Mimura, Masato
Sakai, Shinsuke
Kawahara, Tatsuya
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4365 - 4369
[3] EXPLORING DEEP NEURAL NETWORKS AND DEEP AUTOENCODERS IN REVERBERANT SPEECH RECOGNITION
Mimura, Masato
Sakai, Shinsuke
Kawahara, Tatsuya
[J]. 2014 4TH JOINT WORKSHOP ON HANDS-FREE SPEECH COMMUNICATION AND MICROPHONE ARRAYS (HSCMA), 2014, : 197 - 201
[4] SPEECH FEATURE DENOISING AND DEREVERBERATION VIA DEEP AUTOENCODERS FOR NOISY REVERBERANT SPEECH RECOGNITION
Feng, Xue
Zhang, Yaodong
Glass, James
[J]. 2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
[5] Deep Neural Networks with Linearly Augmented Rectifier Layers for Speech Recognition
Toth, Laszlo
[J]. 2018 IEEE 16TH WORLD SYMPOSIUM ON APPLIED MACHINE INTELLIGENCE AND INFORMATICS (SAMI 2018): DEDICATED TO THE MEMORY OF PIONEER OF ROBOTICS ANTAL (TONY) K. BEJCZY, 2018, : 189 - 193
[6] Deep Convolutional Neural Networks for Feature Extraction in Speech Emotion Recognition
Heracleous, Panikos
Mohammad, Yasser
Yoneyama, Akio
[J]. HUMAN-COMPUTER INTERACTION. RECOGNITION AND INTERACTION TECHNOLOGIES, HCI 2019, PT II, 2019, 11567 : 117 - 132
[7] Evaluation of Mixed Deep Neural Networks for Reverberant Speech Enhancement
Gutierrez-Munoz, Michelle
Gonzalez-Salazar, Astryd
Coto-Jimenez, Marvin
[J]. BIOMIMETICS, 2020, 5 (01)
[8] Binaural reverberant Speech separation based on deep neural networks
Zhang, Xueliang
Wang, DeLiang
[J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2018 - 2022
[9] A Performance Evaluation of Several Deep Neural Networks for Reverberant Speech Separation
Liu, Qingju
Wang, Wenwu
Jackson, Philip J. B.
Safavi, Saeid
[J]. 2018 CONFERENCE RECORD OF 52ND ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS, AND COMPUTERS, 2018, : 689 - 693
[10] PHONE RECOGNITION WITH DEEP SPARSE RECTIFIER NEURAL NETWORKS
Toth, Laszlo
[J]. 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 6985 - 6989

← 1 2 3 4 5 →