End-to-End Speech Recognition Technology Based on Multi-Stream CNN

被引：0

作者：

Xiao, Hao ^{[1
]}

Qiu, Yuan ^{[1
]}

Fei, Rong ^{[1
]}

Chen, Xiongbo ^{[2
]}

Liu, Zuo ^{[2
]}

Wu, Zongling ^{[1
]}

机构：

[1] Xian Univ Technol, Coll Comp Sci & Engn, Xian, Peoples R China

[2] Xian Univ Technol, Guangxi CAIH Smart Telecom Tech Co Ltd, Xian, Guangxi, Peoples R China

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM | 2022年

关键词：

Speech Recognition; MCNN; Transformer; CTC;

D O I：

10.1109/TrustCom56396.2022.00183

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

At a time when end-to-end speech recognition technology is becoming more and more popular, we conduct research on various end-to-end speech technologies, and use the Transformer-based speech framework to study and find that its multi-head attention is not effective in local feature acquisition. And in the face of noise problems in real scenes, the training convergence speed is too slow. In order to solve the problems caused by Transformer, a new speech recognition framework based on MCNN-Transformer-CTC speech recognition method is proposed. Through MCNN (multi-stream convolutional neural network) in the pre-acoustic unit through multiple parallel channels Local feature extraction is carried out in terms of time width and spectral capability, which makes up for the lack of selfattention mechanism in local feature extraction, and the multitask learning method is used to add CTC structure to make up for the problem of slow training convergence. The training effect of this model on the Aishell1 dataset has reached a CER of 6.23%, which is a further improvement compared to the Transformer model.

引用

页码：1310 / 1315

页数：6

共 50 条

[1] Multi-Stream End-to-End Speech Recognition
Li, Ruizhi
Wang, Xiaofei
Mallidi, Sri Harish
Watanabe, Shinji
Hori, Takaaki
Hermansky, Hynek
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 (646-655) : 646 - 655
[2] A PRACTICAL TWO-STAGE TRAINING STRATEGY FOR MULTI-STREAM END-TO-END SPEECH RECOGNITION
Li, Ruizhi
Sell, Gregory
Wang, Xiaofei
Watanabe, Shinji
Hermansky, Hynek
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7014 - 7018
[3] STREAM ATTENTION-BASED MULTI-ARRAY END-TO-END SPEECH RECOGNITION
Wang, Xiaofei
Li, Ruizhi
Mallidi, Sri Harish
Hori, Takaaki
Watanabe, Shinji
Hermansky, Hynek
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7105 - 7109
[4] End-to-End Mandarin Speech Recognition Combining CNN and BLSTM
Wang, Dong
Wang, Xiaodong
Lv, Shaohe
[J]. SYMMETRY-BASEL, 2019, 11 (05):
[5] CNN-based Multichannel End-to-End Speech Recognition for Everyday Home Environments
Yalta, Nelson
Watanabe, Shinji
Hori, Takaaki
Nakadai, Kazuhiro
Ogata, Tetsuya
[J]. 2019 27TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2019,
[6] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION
Settle, Shane
Le Roux, Jonathan
Hori, Takaaki
Watanabe, Shinji
Hershey, John R.
[J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4819 - 4823
[7] Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition
Lohrenz, Timo
Li, Zhengyang
Fingscheidt, Tim
[J]. INTERSPEECH 2021, 2021, : 2846 - 2850
[8] Multi-Head Decoder for End-to-End Speech Recognition
Hayashi, Tomoki
Watanabe, Shinji
Toda, Tomoki
Takeda, Kazuya
[J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 801 - 805
[9] END-TO-END MULTI-CHANNEL TRANSFORMER FOR SPEECH RECOGNITION
Chang, Feng-Ju
Radfar, Martin
Mouchtaris, Athanasios
King, Brian
Kunzmann, Siegfried
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5884 - 5888
[10] END-TO-END MULTI-TALKER OVERLAPPING SPEECH RECOGNITION
Tripathi, Anshuman
Lu, Han
Sak, Hasim
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6129 - 6133

← 1 2 3 4 5 →