Acoustic Modeling with Deep Neural Networks Using Raw Time Signal for LVCSR

被引：0

作者：

Tueske, Zoltan ^{[1
]}

Golik, Pavel ^{[1
]}

Schluter, Ralf ^{[1
]}

Ney, Hermann ^{[1
,2
]}

机构：

[1] Rhein Westfal TH Aachen, Dept Comp Sci, Human Language Technol & Pattern Recognit, D-52056 Aachen, Germany

[2] LIMSI CNRS, Spoken Language Proc Grp, Paris, France

来源：

15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4 | 2014年

关键词：

acoustic modeling; raw signal; neural networks;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper we investigate how much feature extraction is required by a deep neural network (DNN) based acoustic model for automatic speech recognition (ASR). We decompose the feature extraction pipeline of a state-of-the-art ASR system step by step and evaluate acoustic models trained on standard MFCC features, critical band energies (CRBE), FFT magnitude spectrum and even on the raw time signal. The focus is put on raw time signal as input features, i.e. as much as zero feature extraction prior to DNN training. Noteworthy, the gap in recognition accuracy between MFCC and raw time signal decreases strongly once we switch from sigmoid activation function to rectified linear units, offering a real alternative to standard signal processing. The analysis of the first layer weights reveals that the DNN can discover multiple band pass filters in time domain. Therefore we try to improve the raw time signal based system by initializing the first hidden layer weights with impulse responses of an audiologically motivated filter bank. Inspired by the multi-resolutional analysis layer learned automatically from raw time signal input, we train the DNN on a combination of multiple short-term features. This illustrates how the DNN can learn from the little differences between MFCC, PLP and Gammatone features, suggesting that it is useful to present the DNN with different views on the underlying audio.

引用

页码：890 / 894

页数：5

共 50 条

[1] Convolutional Neural Networks for Acoustic Modeling of Raw Time Signal in LVCSR
Golik, Pavel
Tueske, Zoltan
Schlueter, Ralf
Ney, Hermann
[J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 26 - 30
[2] Improving Russian LVCSR Using Deep Neural Networks for Acoustic and Language Modeling
Kipyatkova, Irina
[J]. SPEECH AND COMPUTER (SPECOM 2018), 2018, 11096 : 291 - 300
[3] Phone duration modeling for LVCSR using neural networks
Hadian, Hossein
Povey, Daniel
Sameti, Hossein
Khudanpur, Sanjeev
[J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 518 - 522
[4] DEEP CONVOLUTIONAL NEURAL NETWORKS FOR LVCSR
Sainath, Tara N.
Mohamed, Abdel-rahman
Kingsbury, Brian
Ramabhadran, Bhuvana
[J]. 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 8614 - 8618
[5] Distinct Triphone Acoustic Modeling Using Deep Neural Networks
Chen, Dongpeng
Mak, Brian
[J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 2645 - 2649
[6] IMPROVEMENTS TO DEEP CONVOLUTIONAL NEURAL NETWORKS FOR LVCSR
Sainath, Tara N.
Kingsbury, Brian
Mohamed, Abdel-rahman
Dahl, George E.
Saon, George
Soltau, Hagen
Beran, Tomas
Aravkin, Aleksandr Y.
Ramabhadran, Bhuvana
[J]. 2013 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2013, : 315 - 320
[7] Improved Acoustic Feature Combination for LVCSR by Neural Networks
Plahl, Christian
Schlueter, Ralf
Ney, Hermann
[J]. 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 1244 - 1247
[8] Very Deep Convolutional Neural Networks for LVCSR
Bi, Mengxiao
Qian, Yanmin
Yu, Kai
[J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 3259 - 3263
[9] IMPROVING DEEP NEURAL NETWORKS FOR LVCSR USING DROPOUT AND SHRINKING STRUCTURE
Zhang, Shiliang
Bao, Yebo
Zhou, Pan
Jiang, Hui
Dai, Lirong
[J]. 2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
[10] ON THE COMPRESSION OF RECURRENT NEURAL NETWORKS WITH AN APPLICATION TO LVCSR ACOUSTIC MODELING FOR EMBEDDED SPEECH RECOGNITION
Prabhavalkar, Rohit
Alsharif, Ouais
Bruguier, Antoine
McGraw, Ian
[J]. 2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5970 - 5974

← 1 2 3 4 5 →