Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition

被引：162

作者：

Sainath, Tara N. ^{[1
]}

Weiss, Ron J. ^{[1
]}

Wilson, Kevin W. ^{[1
]}

Li, Bo ^{[2
]}

Narayanan, Arun ^{[2
]}

Variani, Ehsan ^{[2
]}

Bacchiani, Michiel ^{[1
]}

Shafran, Izhak ^{[2
]}

Senior, Andrew ^{[1
]}

Chin, Kean ^{[2
]}

Misra, Ananya ^{[2
]}

Kim, Chanwoo ^{[2
]}

机构：

[1] Google, New York, NY 10011 USA

[2] Google Inc, Mountain View, CA 94043 USA

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2017年 / 25卷 / 05期

关键词：

Beamforming; deep learning; noise-robust speech recognition; ROBUST;

D O I：

10.1109/TASLP.2017.2672401

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Multichannel automatic speech recognition (ASR) systems commonly separate speech enhancement, including localization, beamforming, and postfiltering, from acoustic modeling. In this paper, we perform multichannel enhancement jointly with acoustic modeling in a deep neural network framework. Inspired by beamforming, which leverages differences in the fine time structure of the signal at different microphones to filter energy arriving from different directions, we explore modeling the raw time-domain waveform directly. We introduce a neural network architecture, which performs multichannel filtering in the first layer of the network, and show that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target speaker direction. Next, we show how performance can be improved by factoring the first layer to separate the multichannel spatial filtering operation from a single channel filter bank which computes a frequency decomposition. We also introduce an adaptive variant, which updates the spatial filter coefficients at each time frame based on the previous inputs. Finally, we demonstrate that these approaches can be implemented more efficiently in the frequency domain. Overall, we find that such multichannel neural networks give a relative word error rate improvement of more than 5% compared to a traditional beamforming-based multichannel ASR system and more than 10% compared to a single channel waveform model.

引用

页码：965 / 979

页数：15

共 50 条

[1] Automatic Speech Recognition with Deep Neural Networks for Impaired Speech
Espana-Bonet, Cristina
Fonollosa, Jose A. R.
[J]. ADVANCES IN SPEECH AND LANGUAGE TECHNOLOGIES FOR IBERIAN LANGUAGES, IBERSPEECH 2016, 2016, 10077 : 97 - 107
[2] Automatic Recognition of Kazakh Speech Using Deep Neural Networks
Mamyrbayev, Orken
Turdalyuly, Mussa
Mekebayev, Nurbapa
Alimhan, Keylan
Kydyrbekova, Aizat
Turdalykyzy, Tolganay
[J]. INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2019, PT II, 2019, 11432 : 465 - 474
[3] Deep Spiking Neural Networks for Large Vocabulary Automatic Speech Recognition
Wu, Jibin
Yilmaz, Emre
Zhang, Malu
Li, Haizhou
Tan, Kay Chen
[J]. FRONTIERS IN NEUROSCIENCE, 2020, 14
[4] ADAPTATION OF CONTEXT-DEPENDENT DEEP NEURAL NETWORKS FOR AUTOMATIC SPEECH RECOGNITION
Yao, Kaisheng
Yu, Dong
Seide, Frank
Su, Hang
Deng, Li
Gong, Yifan
[J]. 2012 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2012), 2012, : 366 - 369
[5] DEEP NEURAL NETWORKS BASED AUTOMATIC SPEECH RECOGNITION FOR FOUR ETHIOPIAN LANGUAGES
Abate, Solomon Teferra
Tachbelie, Martha Ylfiru
Schultz, Tanja
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 8274 - 8278
[6] Automatic Speech Recognition Based on Neural Networks
Schlueter, Ralf
Doetsch, Patrick
Golik, Pavel
Kitza, Markus
Menne, Tobias
Irie, Kazuki
Tueske, Zoltan
Zeyer, Albert
[J]. SPEECH AND COMPUTER, 2016, 9811 : 3 - 17
[7] DEEP MAXOUT NEURAL NETWORKS FOR SPEECH RECOGNITION
Cai, Meng
Shi, Yongzhe
Liu, Jia
[J]. 2013 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2013, : 291 - 296
[8] Deep Segmental Neural Networks for Speech Recognition
Abdel-Hamid, Ossama
Deng, Li
Yu, Dong
Jiang, Hui
[J]. 14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 1848 - 1852
[9] Deep Neural Networks in Russian Speech Recognition
Markovnikov, Nikita
Kipyatkova, Irina
Karpov, Alexey
Filchenkov, Andrey
[J]. ARTIFICIAL INTELLIGENCE AND NATURAL LANGUAGE, 2018, 789 : 54 - 67
[10] Binary Deep Neural Networks for Speech Recognition
Xiang, Xu
Qian, Yanmin
Yu, Kai
[J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 533 - 537

← 1 2 3 4 5 →