End-to-End Speech Recognition Technology Based on Multi-Stream CNN

被引:0
|
作者
Xiao, Hao [1 ]
Qiu, Yuan [1 ]
Fei, Rong [1 ]
Chen, Xiongbo [2 ]
Liu, Zuo [2 ]
Wu, Zongling [1 ]
机构
[1] Xian Univ Technol, Coll Comp Sci & Engn, Xian, Peoples R China
[2] Xian Univ Technol, Guangxi CAIH Smart Telecom Tech Co Ltd, Xian, Guangxi, Peoples R China
关键词
Speech Recognition; MCNN; Transformer; CTC;
D O I
10.1109/TrustCom56396.2022.00183
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
At a time when end-to-end speech recognition technology is becoming more and more popular, we conduct research on various end-to-end speech technologies, and use the Transformer-based speech framework to study and find that its multi-head attention is not effective in local feature acquisition. And in the face of noise problems in real scenes, the training convergence speed is too slow. In order to solve the problems caused by Transformer, a new speech recognition framework based on MCNN-Transformer-CTC speech recognition method is proposed. Through MCNN (multi-stream convolutional neural network) in the pre-acoustic unit through multiple parallel channels Local feature extraction is carried out in terms of time width and spectral capability, which makes up for the lack of selfattention mechanism in local feature extraction, and the multitask learning method is used to add CTC structure to make up for the problem of slow training convergence. The training effect of this model on the Aishell1 dataset has reached a CER of 6.23%, which is a further improvement compared to the Transformer model.
引用
收藏
页码:1310 / 1315
页数:6
相关论文
共 50 条
  • [1] Multi-Stream End-to-End Speech Recognition
    Li, Ruizhi
    Wang, Xiaofei
    Mallidi, Sri Harish
    Watanabe, Shinji
    Hori, Takaaki
    Hermansky, Hynek
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 (646-655) : 646 - 655
  • [2] A PRACTICAL TWO-STAGE TRAINING STRATEGY FOR MULTI-STREAM END-TO-END SPEECH RECOGNITION
    Li, Ruizhi
    Sell, Gregory
    Wang, Xiaofei
    Watanabe, Shinji
    Hermansky, Hynek
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7014 - 7018
  • [3] STREAM ATTENTION-BASED MULTI-ARRAY END-TO-END SPEECH RECOGNITION
    Wang, Xiaofei
    Li, Ruizhi
    Mallidi, Sri Harish
    Hori, Takaaki
    Watanabe, Shinji
    Hermansky, Hynek
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7105 - 7109
  • [4] End-to-End Mandarin Speech Recognition Combining CNN and BLSTM
    Wang, Dong
    Wang, Xiaodong
    Lv, Shaohe
    [J]. SYMMETRY-BASEL, 2019, 11 (05):
  • [5] CNN-based Multichannel End-to-End Speech Recognition for Everyday Home Environments
    Yalta, Nelson
    Watanabe, Shinji
    Hori, Takaaki
    Nakadai, Kazuhiro
    Ogata, Tetsuya
    [J]. 2019 27TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2019,
  • [6] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION
    Settle, Shane
    Le Roux, Jonathan
    Hori, Takaaki
    Watanabe, Shinji
    Hershey, John R.
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4819 - 4823
  • [7] Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition
    Lohrenz, Timo
    Li, Zhengyang
    Fingscheidt, Tim
    [J]. INTERSPEECH 2021, 2021, : 2846 - 2850
  • [8] Multi-Head Decoder for End-to-End Speech Recognition
    Hayashi, Tomoki
    Watanabe, Shinji
    Toda, Tomoki
    Takeda, Kazuya
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 801 - 805
  • [9] END-TO-END MULTI-CHANNEL TRANSFORMER FOR SPEECH RECOGNITION
    Chang, Feng-Ju
    Radfar, Martin
    Mouchtaris, Athanasios
    King, Brian
    Kunzmann, Siegfried
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5884 - 5888
  • [10] END-TO-END MULTI-TALKER OVERLAPPING SPEECH RECOGNITION
    Tripathi, Anshuman
    Lu, Han
    Sak, Hasim
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6129 - 6133