Multi-level region-of-interest CNNs for end to end speech recognition

被引:0
|
作者
Shubhanshi Singhal
Vishal Passricha
Pooja Sharma
Rajesh Kumar Aggarwal
机构
[1] Technology Education and Research Integrated Institute,
[2] National Institute of Technology,undefined
[3] Government College for Women,undefined
关键词
Cepstral features; End-to-end training; Feature extraction; Pooling; Raw speech; Spectral features;
D O I
暂无
中图分类号
学科分类号
摘要
Efficient and robust automatic speech recognition (ASR) systems are in high demand in the present scenario. Mostly ASR systems are generally fed with cepstral features like mel-frequency cepstral coefficients and perceptual linear prediction. However, some attempts are also made in speech recognition to shift on simple features like critical band energies or spectrogram using deep learning models. These approaches always claim that they have the ability to train directly with the raw signal. Such systems highly depend on the excellent discriminative power of ConvNet layers to separate two phonemes having nearly similar accents but they do not offer high recognition rate. The main reason for limited recognition rate is stride based pooling methods that performs sharp reduction in output dimensionality i.e. at least 75%. To improve the performance, region-based convolutional neural networks (R-CNNs) and Fast R-CNN were proposed but their performances did not meet the expected level. Therefore, a new pooling technique, multilevel region of interest (RoI) pooling is proposed which pools the multilevel information from multiple ConvNet layers. The newly proposed architecture is named as multilevel RoI convolutional neural network (MR-CNN). It is designed by simply placing RoI pooling layers after up to four coarsest layers. It improves extracted features using additional information from the multilevel ConvNet layers. Its performance is evaluated on TIMIT and Wall Street Journal (WSJ) datasets for phoneme recognition. Phoneme error-rate offered by this model on raw speech is 16.4% and 17.1% on TIMIT and WSJ datasets respectively which is slightly better than spectral features.
引用
下载
收藏
页码:4615 / 4624
页数:9
相关论文
共 50 条
  • [31] END-TO-END AUDIOVISUAL SPEECH RECOGNITION
    Petridis, Stavros
    Stafylakis, Themos
    Ma, Pingchuan
    Cai, Feipeng
    Tzimiropoulos, Georgios
    Pantic, Maja
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6548 - 6552
  • [32] End-to-end relation extraction based on bootstrapped multi-level distant supervision
    He, Ying
    Li, Zhixu
    Yang, Qiang
    Chen, Zhigang
    Liu, An
    Zhao, Lei
    Zhou, Xiaofang
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2020, 23 (05): : 2933 - 2956
  • [33] CNNs with Multi-Level Attention for Domain Generalization
    Ballas, Aristotelis
    Diou, Cristos
    PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, : 592 - 596
  • [34] End-to-end response selection based on multi-level context response matching
    Boussaha, Basma El Amel
    Hernandez, Nicolas
    Jacquin, Christine
    Morin, Emmanuel
    COMPUTER SPEECH AND LANGUAGE, 2020, 63
  • [35] End-to-end relation extraction based on bootstrapped multi-level distant supervision
    Ying He
    Zhixu Li
    Qiang Yang
    Zhigang Chen
    An Liu
    Lei Zhao
    Xiaofang Zhou
    World Wide Web, 2020, 23 : 2933 - 2956
  • [37] END-TO-END MULTI-PERSON AUDIO/VISUAL AUTOMATIC SPEECH RECOGNITION
    Braga, Otavio
    Makino, Takaki
    Siohan, Olivier
    Liao, Hank
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6994 - 6998
  • [38] End-to-End Speech Recognition Technology Based on Multi-Stream CNN
    Xiao, Hao
    Qiu, Yuan
    Fei, Rong
    Chen, Xiongbo
    Liu, Zuo
    Wu, Zongling
    2022 IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, 2022, : 1310 - 1315
  • [39] END-TO-END MULTI-ACCENT SPEECH RECOGNITION WITH UNSUPERVISED ACCENT MODELLING
    Li, Song
    Ouyang, Beibei
    Liao, Dexin
    Xia, Shipeng
    Li, Lin
    Hong, Qingyang
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6418 - 6422
  • [40] Multi-Task End-to-End Model for Telugu Dialect and Speech Recognition
    Yadavalli, Aditya
    Mirishkar, Ganesh S.
    Vuppala, Anil Kumar
    INTERSPEECH 2022, 2022, : 1387 - 1391