Multi-level region-of-interest CNNs for end to end speech recognition

被引：0

作者：

Shubhanshi Singhal

Vishal Passricha

Pooja Sharma

Rajesh Kumar Aggarwal

机构：

[1] Technology Education and Research Integrated Institute,

[2] National Institute of Technology,undefined

[3] Government College for Women,undefined

来源：

Journal of Ambient Intelligence and Humanized Computing | 2019年 / 10卷

关键词：

Cepstral features; End-to-end training; Feature extraction; Pooling; Raw speech; Spectral features;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Efficient and robust automatic speech recognition (ASR) systems are in high demand in the present scenario. Mostly ASR systems are generally fed with cepstral features like mel-frequency cepstral coefficients and perceptual linear prediction. However, some attempts are also made in speech recognition to shift on simple features like critical band energies or spectrogram using deep learning models. These approaches always claim that they have the ability to train directly with the raw signal. Such systems highly depend on the excellent discriminative power of ConvNet layers to separate two phonemes having nearly similar accents but they do not offer high recognition rate. The main reason for limited recognition rate is stride based pooling methods that performs sharp reduction in output dimensionality i.e. at least 75%. To improve the performance, region-based convolutional neural networks (R-CNNs) and Fast R-CNN were proposed but their performances did not meet the expected level. Therefore, a new pooling technique, multilevel region of interest (RoI) pooling is proposed which pools the multilevel information from multiple ConvNet layers. The newly proposed architecture is named as multilevel RoI convolutional neural network (MR-CNN). It is designed by simply placing RoI pooling layers after up to four coarsest layers. It improves extracted features using additional information from the multilevel ConvNet layers. Its performance is evaluated on TIMIT and Wall Street Journal (WSJ) datasets for phoneme recognition. Phoneme error-rate offered by this model on raw speech is 16.4% and 17.1% on TIMIT and WSJ datasets respectively which is slightly better than spectral features.

引用

下载

页码：4615 / 4624

页数：9

共 50 条

[41] Translational calibration in region-of-interest localization for palmprint recognition
Liao, Fengxiang
Wu, Tengfei
Gao, Fumeng
Leng, Lu
VISUAL COMPUTER, 2024, 40 (10): : 7293 - 7305
[42] Speech Emotion Recognition via Multi-Level Attention Network
Liu, Ke
Wang, Dekui
Wu, Dongya
Liu, Yutao
Feng, Jun
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2278 - 2282
[43] Multi-Level Adaptive Network for Accented Mandarin Speech Recognition
Wang, Huiyong
Wang, Lan
Liu, Xunying
2014 4TH IEEE INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND TECHNOLOGY (ICIST), 2014, : 602 - 605
[44] Multi-Level Open End Windings Multi-Motor Drives
Foti, Salvatore
Testa, Antonio
De Caro, Salvatore
Scimone, Tommaso
Scelba, Giacomo
Scarcella, Giuseppe
ENERGIES, 2019, 12 (05):
[45] SPEECH ENHANCEMENT USING END-TO-END SPEECH RECOGNITION OBJECTIVES
Subramanian, Aswin Shanmugam
Wang, Xiaofei
Baskar, Murali Karthick
Watanabe, Shinji
Taniguchi, Toru
Tran, Dung
Fujita, Yuya
2019 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA), 2019, : 234 - 238
[46] UTTERANCE-LEVEL NEURAL CONFIDENCE MEASURE FOR END-TO-END CHILDREN SPEECH RECOGNITION
Liu, Wei
Lee, Tan
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 449 - 456
[47] System Level Design and Implementation for Region-of-Interest Segmentation
Tsai, Tsung-Han
Lin, Chung-Yuan
Lin, Yu-Fong
JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2011, 62 (01): : 97 - 112
[48] System Level Design and Implementation for Region-of-Interest Segmentation
Tsung-Han Tsai
Chung-Yuan Lin
Yu-Fong Lin
Journal of Signal Processing Systems, 2011, 62 : 97 - 112
[49] AN EVALUATION OF WORD-LEVEL CONFIDENCE ESTIMATION FOR END-TO-END AUTOMATIC SPEECH RECOGNITION
Oneata, Dan
Caranica, Alexandru
Stan, Adriana
Cucu, Horia
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 258 - 265
[50] IMPROVING UNSUPERVISED STYLE TRANSFER IN END-TO-END SPEECH SYNTHESIS WITH END-TO-END SPEECH RECOGNITION
Liu, Da-Rong
Yang, Chi-Yu
Wu, Szu-Lin
Lee, Hung-Yi
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 640 - 647

← 1 2 3 4 5 →