Investigating Joint CTC-Attention Models for End-to-End Russian Speech Recognition

被引:3
|
作者
Markovnikov, Nikita [1 ,2 ]
Kipyatkova, Irina [1 ,3 ]
机构
[1] Russian Acad Sci SPIIRAS, St Petersburg Inst Informat & Automat, St Petersburg, Russia
[2] ITMO Univ, St Petersburg, Russia
[3] St Petersburg State Univ Aerosp Instrumentat SUAI, St Petersburg, Russia
来源
基金
俄罗斯基础研究基金会;
关键词
End-to-end models; Attention mechanism; Deep learning; Russian speech; Speech recognition;
D O I
10.1007/978-3-030-26061-3_35
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We propose an application of attention-based models for automatic recognition of continuous Russian speech. We experimented with three types of attention mechanism, data augmentation based on a tempo and pitch perturbations, and a beam search pruning method. Moreover we propose a using of sparsemax function for our task as a probability distribution generator for an attention mechanism. We experimented with a joint CTC-Attention encoder-decoders using deep convolutional networks to compress input features or waveform spectrograms. Also we experimented with Highway LSTM model as an encoder. We performed experiments with a small dataset of Russian speech with total duration of more than 60 h. We got the recognition accuracy improvement by using proposed methods and showed better performance in terms of speech decoding speed using the beam search optimization method.
引用
收藏
页码:337 / 347
页数:11
相关论文
共 50 条
  • [31] End-to-End Neural Segmental Models for Speech Recognition
    Tang, Hao
    Lu, Liang
    Kong, Lingpeng
    Gimpel, Kevin
    Livescu, Karen
    Dyer, Chris
    Smith, Noah A.
    Renals, Steve
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1254 - 1264
  • [32] An End-to-end Speech Recognition Algorithm based on Attention Mechanism
    Chen, Jia-nan
    Gao, Shuang
    Sun, Han-zhe
    Liu, Xiao-hui
    Wang, Zi-ning
    Zheng, Yan
    [J]. PROCEEDINGS OF THE 39TH CHINESE CONTROL CONFERENCE, 2020, : 2935 - 2940
  • [33] Combination of end-to-end and hybrid models for speech recognition
    Wong, Jeremy H. M.
    Gaur, Yashesh
    Zhao, Rui
    Lu, Liang
    Sun, Eric
    Li, Jinyu
    Gong, Yifan
    [J]. INTERSPEECH 2020, 2020, : 1783 - 1787
  • [34] Self-Attention Transducers for End-to-End Speech Recognition
    Tian, Zhengkun
    Yi, Jiangyan
    Tao, Jianhua
    Bai, Ye
    Wen, Zhengqi
    [J]. INTERSPEECH 2019, 2019, : 4395 - 4399
  • [35] AN INVESTIGATION OF END-TO-END MODELS FOR ROBUST SPEECH RECOGNITION
    Prasad, Archiki
    Jyothi, Preethi
    Velmurugan, Rajbabu
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6893 - 6897
  • [36] Multi-channel Attention for End-to-End Speech Recognition
    Braun, Stefan
    Neil, Daniel
    Anumula, Jithendar
    Ceolini, Enea
    Liu, Shih-Chii
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 17 - 21
  • [37] Adversarial joint training with self-attention mechanism for robust end-to-end speech recognition
    Li, Lujun
    Kang, Yikai
    Shi, Yuchen
    Kurzinger, Ludwig
    Watzel, Tobias
    Rigoll, Gerhard
    [J]. EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2021, 2021 (01)
  • [38] Adversarial joint training with self-attention mechanism for robust end-to-end speech recognition
    Lujun Li
    Yikai Kang
    Yuchen Shi
    Ludwig Kürzinger
    Tobias Watzel
    Gerhard Rigoll
    [J]. EURASIP Journal on Audio, Speech, and Music Processing, 2021
  • [39] SPEAKER ADAPTATION FOR END-TO-END CTC MODELS
    Li, Ke
    Li, Jinyu
    Zhao, Yong
    Kumar, Kshitiz
    Gong, Yifan
    [J]. 2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 542 - 549
  • [40] End-to-End Speech Recognition with Auditory Attention for Multi-Microphone Distance Speech Recognition
    Kim, Suyoun
    Lane, Ian
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3867 - 3871