Acoustic Modeling with Deep Neural Networks Using Raw Time Signal for LVCSR

被引:0
|
作者
Tueske, Zoltan [1 ]
Golik, Pavel [1 ]
Schluter, Ralf [1 ]
Ney, Hermann [1 ,2 ]
机构
[1] Rhein Westfal TH Aachen, Dept Comp Sci, Human Language Technol & Pattern Recognit, D-52056 Aachen, Germany
[2] LIMSI CNRS, Spoken Language Proc Grp, Paris, France
关键词
acoustic modeling; raw signal; neural networks;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper we investigate how much feature extraction is required by a deep neural network (DNN) based acoustic model for automatic speech recognition (ASR). We decompose the feature extraction pipeline of a state-of-the-art ASR system step by step and evaluate acoustic models trained on standard MFCC features, critical band energies (CRBE), FFT magnitude spectrum and even on the raw time signal. The focus is put on raw time signal as input features, i.e. as much as zero feature extraction prior to DNN training. Noteworthy, the gap in recognition accuracy between MFCC and raw time signal decreases strongly once we switch from sigmoid activation function to rectified linear units, offering a real alternative to standard signal processing. The analysis of the first layer weights reveals that the DNN can discover multiple band pass filters in time domain. Therefore we try to improve the raw time signal based system by initializing the first hidden layer weights with impulse responses of an audiologically motivated filter bank. Inspired by the multi-resolutional analysis layer learned automatically from raw time signal input, we train the DNN on a combination of multiple short-term features. This illustrates how the DNN can learn from the little differences between MFCC, PLP and Gammatone features, suggesting that it is useful to present the DNN with different views on the underlying audio.
引用
收藏
页码:890 / 894
页数:5
相关论文
共 50 条
  • [1] Convolutional Neural Networks for Acoustic Modeling of Raw Time Signal in LVCSR
    Golik, Pavel
    Tueske, Zoltan
    Schlueter, Ralf
    Ney, Hermann
    [J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 26 - 30
  • [2] Improving Russian LVCSR Using Deep Neural Networks for Acoustic and Language Modeling
    Kipyatkova, Irina
    [J]. SPEECH AND COMPUTER (SPECOM 2018), 2018, 11096 : 291 - 300
  • [3] Phone duration modeling for LVCSR using neural networks
    Hadian, Hossein
    Povey, Daniel
    Sameti, Hossein
    Khudanpur, Sanjeev
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 518 - 522
  • [4] DEEP CONVOLUTIONAL NEURAL NETWORKS FOR LVCSR
    Sainath, Tara N.
    Mohamed, Abdel-rahman
    Kingsbury, Brian
    Ramabhadran, Bhuvana
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 8614 - 8618
  • [5] Distinct Triphone Acoustic Modeling Using Deep Neural Networks
    Chen, Dongpeng
    Mak, Brian
    [J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 2645 - 2649
  • [6] IMPROVEMENTS TO DEEP CONVOLUTIONAL NEURAL NETWORKS FOR LVCSR
    Sainath, Tara N.
    Kingsbury, Brian
    Mohamed, Abdel-rahman
    Dahl, George E.
    Saon, George
    Soltau, Hagen
    Beran, Tomas
    Aravkin, Aleksandr Y.
    Ramabhadran, Bhuvana
    [J]. 2013 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2013, : 315 - 320
  • [7] Improved Acoustic Feature Combination for LVCSR by Neural Networks
    Plahl, Christian
    Schlueter, Ralf
    Ney, Hermann
    [J]. 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 1244 - 1247
  • [8] Very Deep Convolutional Neural Networks for LVCSR
    Bi, Mengxiao
    Qian, Yanmin
    Yu, Kai
    [J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 3259 - 3263
  • [9] IMPROVING DEEP NEURAL NETWORKS FOR LVCSR USING DROPOUT AND SHRINKING STRUCTURE
    Zhang, Shiliang
    Bao, Yebo
    Zhou, Pan
    Jiang, Hui
    Dai, Lirong
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [10] ON THE COMPRESSION OF RECURRENT NEURAL NETWORKS WITH AN APPLICATION TO LVCSR ACOUSTIC MODELING FOR EMBEDDED SPEECH RECOGNITION
    Prabhavalkar, Rohit
    Alsharif, Ouais
    Bruguier, Antoine
    McGraw, Ian
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5970 - 5974