TWO-STAGE PRE-TRAINING FOR SEQUENCE TO SEQUENCE SPEECH RECOGNITION

被引:0
|
作者
Fan, Zhiyun [1 ,2 ]
Zhou, Shiyu [1 ]
Xu, Bo [1 ]
机构
[1] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
关键词
pre-training; speech recognition; encoder-decoder; sequence-to-sequence;
D O I
10.1109/IJCNN52387.2021.9534170
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The attention-based encoder-decoder structure is popular in automatic speech recognition (ASR). However, it relies heavily on transcribed data. In this paper, we propose a novel pre-training strategy for the encoder-decoder sequence-to-sequence (seq2seq) model by utilizing unpaired speech and transcripts. The pre-training process consists of two stages, acoustic pre-training and linguistic pre-training. In the acoustic pre-training stage, we use a large amount of speech to pre-train the encoder by predicting masked speech feature chunks with their contexts. In the linguistic pre-training stage, we first generate synthesized speech from a large number of transcripts using a text-to-speech (TTS) system and then use the synthesized paired data to pretrain the decoder. The two-stage pre-training is conducted on the AISHELL-2 dataset, and we apply this pre-trained model to multiple subsets of AISHELL-1 and HKUST for post-training. As the size of the subset increases, we obtain relative character error rate reduction (CERR) from 38.24% to 7.88% on AISHELL-1 and from 12.00% to 1.20% on HKUST.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] TWO-STAGE TRAINING METHOD FOR JAPANESE ELECTROLARYNGEAL SPEECH ENHANCEMENT BASED ON SEQUENCE-TO-SEQUENCE VOICE CONVERSION
    Ma, Ding
    Violeta, Lester Phillip
    Kobayashi, Kazuhiro
    Toda, Tomoki
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 949 - 954
  • [2] MASS: Masked Sequence to Sequence Pre-training for Language Generation
    Song, Kaitao
    Tan, Xu
    Qin, Tao
    Lu, Jianfeng
    Liu, Tie-Yan
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [3] Improving AMR Parsing with Sequence-to-Sequence Pre-training
    Xu, Dongqin
    Li, Junhui
    Zhu, Muhua
    Min Zhang
    Zhou, Guodong
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2501 - 2511
  • [4] Improving Sequence-to-Sequence Pre-training via Sequence Span Rewriting
    Zhou, Wangchunshu
    Ge, Tao
    Xu, Canwen
    Xu, Ke
    Wei, Furu
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 571 - 582
  • [5] Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training
    Zhou, Kun
    Sisman, Berrak
    Li, Haizhou
    [J]. INTERSPEECH 2021, 2021, : 811 - 815
  • [6] Denoising based Sequence-to-Sequence Pre-training for Text Generation
    Wang, Liang
    Zhao, Wei
    Jia, Ruoyu
    Li, Sujian
    Liu, Jingming
    [J]. 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 4003 - 4015
  • [7] SELF-TRAINING AND PRE-TRAINING ARE COMPLEMENTARY FOR SPEECH RECOGNITION
    Xu, Qiantong
    Baevski, Alexei
    Likhomanenko, Tatiana
    Tomasello, Paden
    Conneau, Alexis
    Collobert, Ronan
    Synnaeve, Gabriel
    Auli, Michael
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3030 - 3034
  • [8] Unified Speech-Text Pre-training for Speech Translation and Recognition
    Tang, Yun
    Gong, Hongyu
    Dong, Ning
    Wang, Changhan
    Hsu, Wei-Ning
    Gu, Jiatao
    Baevski, Alexei
    Li, Xian
    Mohamed, Abdelrahman
    Auli, Michael
    Pino, Juan
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 1488 - 1499
  • [9] Hierarchical Pre-training for Sequence Labelling in Spoken Dialog
    Chapuis, Emile
    Colombo, Pierre
    Manica, Matteo
    Labeau, Matthieu
    Clavel, Chloe
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 2636 - 2648
  • [10] JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation
    Mao, Zhuoyuan
    Cromieres, Fabien
    Dabre, Raj
    Song, Haiyue
    Kurohashi, Sadao
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3683 - 3691