End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition

被引:75
|
作者
Palaz, Dimitri [1 ,2 ,3 ]
Magimai-Doss, Mathew [2 ]
Collobert, Ronan [2 ,4 ]
机构
[1] Speech Graph Ltd, Edinburgh, Midlothian, Scotland
[2] Idiap Res Inst, Martigny, Switzerland
[3] Ecole Polytech Fed Lausanne, Lausanne, Switzerland
[4] Facebook AI Res, Menlo Pk, CA USA
关键词
Automatic speech recognition; Hidden Markov models; Deep learning; Feature learning; Artificial neural networks; Convolution neural networks; Hybrid HMM/ANN; EXTRACTION;
D O I
10.1016/j.specom.2019.01.004
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In hidden Markov model (HMM) based automatic speech recognition (ASR) system, modeling the statistical relationship between the acoustic speech signal and the HMM states that represent linguistically motivated subword units such as phonemes is a crucial step. This is typically achieved by first extracting acoustic features from the speech signal based on prior knowledge such as, speech perception or/and speech production knowledge, and, then training a classifier such as artificial neural networks (ANN), Gaussian mixture model that estimates the emission probabilities of the HMM states. This paper investigates an end-to-end acoustic modeling approach using convolutional neural networks (CNNs), where the CNN takes as input raw speech signal and estimates the HMM states class conditional probabilities at the output. Alternately, as opposed to a divide and conquer strategy (i.e., separating feature extraction and statistical modeling steps), in the proposed acoustic modeling approach the relevant features and the classifier are jointly learned from the raw speech signal. Through ASR studies and analyses on multiple languages and multiple tasks, we show that: (a) the proposed approach yields consistently a better system with fewer parameters when compared to the conventional approach of cepstral feature extraction followed by ANN training, (b) unlike conventional method of speech processing, in the proposed approach the relevant feature representations are learned by first processing the input raw speech at the sub-segmental level (approximate to 2 ms). Specifically, through an analysis we show that the filters in the first convolution layer automatically learn "in-parts" formant-like information present in the sub-segmental speech, and (c) the intermediate feature representations obtained by subsequent filtering of the first convolution layer output are more discriminative compared to standard cepstral features and could be transferred across languages and domains.
引用
收藏
页码:15 / 32
页数:18
相关论文
共 50 条
  • [1] Quaternion Convolutional Neural Networks for End-to-End Automatic Speech Recognition
    Parcollet, Titouan
    Zhang, Ying
    Morchid, Mohamed
    Trabelsi, Chiheb
    Linares, Georges
    De Mori, Renato
    Bengio, Yoshua
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 22 - 26
  • [2] Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks
    Zhang, Ying
    Pezeshki, Mohammad
    Brakel, Philemon
    Zhang, Saizheng
    Laurent, Cesar
    Bengio, Yoshua
    Courville, Aaron
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 410 - 414
  • [3] Towards End-to-End Speech Recognition with Deep Multipath Convolutional Neural Networks
    Zhang, Wei
    Zhai, Minghao
    Huang, Zilong
    Liu, Chen
    Li, Wei
    Cao, Yi
    [J]. INTELLIGENT ROBOTICS AND APPLICATIONS, ICIRA 2019, PART VI, 2019, 11745 : 332 - 341
  • [4] End-to-End Text Recognition with Convolutional Neural Networks
    Wang, Tao
    Wu, David J.
    Coates, Adam
    Ng, Andrew Y.
    [J]. 2012 21ST INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR 2012), 2012, : 3304 - 3308
  • [5] Insertion-Based Modeling for End-to-End Automatic Speech Recognition
    Fujita, Yuya
    Watanabe, Shinji
    Omachi, Motoi
    Chang, Xuankai
    [J]. INTERSPEECH 2020, 2020, : 3660 - 3664
  • [6] Effect of Adding Positional Information on Convolutional Neural Networks for End-to-End Speech Recognition
    Park, Jinhwan
    Sung, Wonyong
    [J]. INTERSPEECH 2020, 2020, : 46 - 50
  • [7] VERY DEEP CONVOLUTIONAL NETWORKS FOR END-TO-END SPEECH RECOGNITION
    Zhang, Yu
    Chan, William
    Jaitly, Navdeep
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 4845 - 4849
  • [8] END-TO-END SPEECH EMOTION RECOGNITION USING DEEP NEURAL NETWORKS
    Tzirakis, Panagiotis
    Zhang, Jiehao
    Schuller, Bjoern W.
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5089 - 5093
  • [9] End-To-End Label Uncertainty Modeling for Speech-based Arousal Recognition Using Bayesian Neural Networks
    Prabhu, Navin Raj
    Carbajal, Guillaume
    Lehmann-Willenbrock, Nale
    Gerkmann, Timo
    [J]. INTERSPEECH 2022, 2022, : 151 - 155
  • [10] Towards End-to-End Speech Recognition with Recurrent Neural Networks
    Graves, Alex
    Jaitly, Navdeep
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 32 (CYCLE 2), 2014, 32 : 1764 - 1772