Multi-level region-of-interest CNNs for end to end speech recognition

被引:0
|
作者
Shubhanshi Singhal
Vishal Passricha
Pooja Sharma
Rajesh Kumar Aggarwal
机构
[1] Technology Education and Research Integrated Institute,
[2] National Institute of Technology,undefined
[3] Government College for Women,undefined
关键词
Cepstral features; End-to-end training; Feature extraction; Pooling; Raw speech; Spectral features;
D O I
暂无
中图分类号
学科分类号
摘要
Efficient and robust automatic speech recognition (ASR) systems are in high demand in the present scenario. Mostly ASR systems are generally fed with cepstral features like mel-frequency cepstral coefficients and perceptual linear prediction. However, some attempts are also made in speech recognition to shift on simple features like critical band energies or spectrogram using deep learning models. These approaches always claim that they have the ability to train directly with the raw signal. Such systems highly depend on the excellent discriminative power of ConvNet layers to separate two phonemes having nearly similar accents but they do not offer high recognition rate. The main reason for limited recognition rate is stride based pooling methods that performs sharp reduction in output dimensionality i.e. at least 75%. To improve the performance, region-based convolutional neural networks (R-CNNs) and Fast R-CNN were proposed but their performances did not meet the expected level. Therefore, a new pooling technique, multilevel region of interest (RoI) pooling is proposed which pools the multilevel information from multiple ConvNet layers. The newly proposed architecture is named as multilevel RoI convolutional neural network (MR-CNN). It is designed by simply placing RoI pooling layers after up to four coarsest layers. It improves extracted features using additional information from the multilevel ConvNet layers. Its performance is evaluated on TIMIT and Wall Street Journal (WSJ) datasets for phoneme recognition. Phoneme error-rate offered by this model on raw speech is 16.4% and 17.1% on TIMIT and WSJ datasets respectively which is slightly better than spectral features.
引用
下载
收藏
页码:4615 / 4624
页数:9
相关论文
共 50 条
  • [41] Translational calibration in region-of-interest localization for palmprint recognition
    Liao, Fengxiang
    Wu, Tengfei
    Gao, Fumeng
    Leng, Lu
    VISUAL COMPUTER, 2024, 40 (10): : 7293 - 7305
  • [42] Speech Emotion Recognition via Multi-Level Attention Network
    Liu, Ke
    Wang, Dekui
    Wu, Dongya
    Liu, Yutao
    Feng, Jun
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2278 - 2282
  • [43] Multi-Level Adaptive Network for Accented Mandarin Speech Recognition
    Wang, Huiyong
    Wang, Lan
    Liu, Xunying
    2014 4TH IEEE INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND TECHNOLOGY (ICIST), 2014, : 602 - 605
  • [44] Multi-Level Open End Windings Multi-Motor Drives
    Foti, Salvatore
    Testa, Antonio
    De Caro, Salvatore
    Scimone, Tommaso
    Scelba, Giacomo
    Scarcella, Giuseppe
    ENERGIES, 2019, 12 (05):
  • [45] SPEECH ENHANCEMENT USING END-TO-END SPEECH RECOGNITION OBJECTIVES
    Subramanian, Aswin Shanmugam
    Wang, Xiaofei
    Baskar, Murali Karthick
    Watanabe, Shinji
    Taniguchi, Toru
    Tran, Dung
    Fujita, Yuya
    2019 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA), 2019, : 234 - 238
  • [46] UTTERANCE-LEVEL NEURAL CONFIDENCE MEASURE FOR END-TO-END CHILDREN SPEECH RECOGNITION
    Liu, Wei
    Lee, Tan
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 449 - 456
  • [47] System Level Design and Implementation for Region-of-Interest Segmentation
    Tsai, Tsung-Han
    Lin, Chung-Yuan
    Lin, Yu-Fong
    JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2011, 62 (01): : 97 - 112
  • [48] System Level Design and Implementation for Region-of-Interest Segmentation
    Tsung-Han Tsai
    Chung-Yuan Lin
    Yu-Fong Lin
    Journal of Signal Processing Systems, 2011, 62 : 97 - 112
  • [49] AN EVALUATION OF WORD-LEVEL CONFIDENCE ESTIMATION FOR END-TO-END AUTOMATIC SPEECH RECOGNITION
    Oneata, Dan
    Caranica, Alexandru
    Stan, Adriana
    Cucu, Horia
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 258 - 265
  • [50] IMPROVING UNSUPERVISED STYLE TRANSFER IN END-TO-END SPEECH SYNTHESIS WITH END-TO-END SPEECH RECOGNITION
    Liu, Da-Rong
    Yang, Chi-Yu
    Wu, Szu-Lin
    Lee, Hung-Yi
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 640 - 647