Effect of Adding Positional Information on Convolutional Neural Networks for End-to-End Speech Recognition

被引：1

作者：

Park, Jinhwan ^{[1
]}

Sung, Wonyong ^{[1
]}

机构：

[1] Seoul Natl Univ, Dept Elect & Comp Engn, Seoul, South Korea

来源：

INTERSPEECH 2020 | 2020年

基金：

新加坡国家研究基金会;

关键词：

speech recognition; convolutional networks; positional encoding;

D O I：

10.21437/Interspeech.2020-3163

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Attention-based models with convolutional encoders enable faster training and inference than recurrent neural network-based ones. However, convolutional models often require a very large receptive field to achieve high recognition accuracy, which not only increases the parameter size but also the computational cost and run-time memory footprint. A convolutional encoder with a short receptive field length can suffer from looping or skipping problems when the input utterance contains the same words as nearby sentences. We believe that this is due to the insufficient receptive field length, and try to remedy this problem by adding positional information to the convolution-based encoder. It is shown that the word error rate (WER) of a convolutional encoder with a short receptive field size can be reduced significantly by augmenting it with positional information. Visualization results are presented to demonstrate the effectiveness of adding positional information. The proposed method improves the accuracy of attention models with a convolutional encoder and achieves a WER of 10.60% on TED-LIUMv2 for an end-to-end speech recognition task.

引用

页码：46 / 50

页数：5

共 50 条

[1] Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks
Zhang, Ying
Pezeshki, Mohammad
Brakel, Philemon
Zhang, Saizheng
Laurent, Cesar
Bengio, Yoshua
Courville, Aaron
[J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 410 - 414
[2] Quaternion Convolutional Neural Networks for End-to-End Automatic Speech Recognition
Parcollet, Titouan
Zhang, Ying
Morchid, Mohamed
Trabelsi, Chiheb
Linares, Georges
De Mori, Renato
Bengio, Yoshua
[J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 22 - 26
[3] Towards End-to-End Speech Recognition with Deep Multipath Convolutional Neural Networks
Zhang, Wei
Zhai, Minghao
Huang, Zilong
Liu, Chen
Li, Wei
Cao, Yi
[J]. INTELLIGENT ROBOTICS AND APPLICATIONS, ICIRA 2019, PART VI, 2019, 11745 : 332 - 341
[4] End-to-End Text Recognition with Convolutional Neural Networks
Wang, Tao
Wu, David J.
Coates, Adam
Ng, Andrew Y.
[J]. 2012 21ST INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR 2012), 2012, : 3304 - 3308
[5] VERY DEEP CONVOLUTIONAL NETWORKS FOR END-TO-END SPEECH RECOGNITION
Zhang, Yu
Chan, William
Jaitly, Navdeep
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 4845 - 4849
[6] Segmental Recurrent Neural Networks for End-to-end Speech Recognition
Lu, Liang
Kong, Lingpeng
Dyer, Chris
Smith, Noah A.
Renals, Steve
[J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 385 - 389
[7] Towards End-to-End Speech Recognition with Recurrent Neural Networks
Graves, Alex
Jaitly, Navdeep
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 32 (CYCLE 2), 2014, 32 : 1764 - 1772
[8] END-TO-END SPEECH EMOTION RECOGNITION USING DEEP NEURAL NETWORKS
Tzirakis, Panagiotis
Zhang, Jiehao
Schuller, Bjoern W.
[J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5089 - 5093
[9] End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition
Palaz, Dimitri
Magimai-Doss, Mathew
Collobert, Ronan
[J]. SPEECH COMMUNICATION, 2019, 108 : 15 - 32
[10] End-To-End Speech Emotion Recognition Based on Time and Frequency Information Using Deep Neural Networks
Bakhshi, Ali
Wong, Aaron S. W.
Chalup, Stephan
[J]. ECAI 2020: 24TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, 325 : 969 - 975

← 1 2 3 4 5 →