COMBINING UNSUPERVISED AND TEXT AUGMENTED SEMI-SUPERVISED LEARNING FOR LOW RESOURCED AUTOREGRESSIVE SPEECH RECOGNITION

被引:1
|
作者
Li, Chak-Fai [1 ]
Keith, Francis [1 ]
Hartmann, William [1 ]
Snover, Matthew [1 ]
机构
[1] Raytheon BBN Technol, Cambridge, MA 02138 USA
关键词
seq2seq; unsupervised learning; semi-supervised training; domain adaptation; REPRESENTATION;
D O I
10.1109/ICASSP43922.2022.9747005
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recent advances in unsupervised representation learning have demonstrated the impact of pretraining on large amounts of read speech. We adapt these techniques for domain adaptation in low-resource-both in terms of data and compute-conversational and broadcast domains. Moving beyond CTC, we pretrain state-of-the-art Conformer models in an unsupervised manner. While the unsupervised approach outperforms traditional semi-supervised training, the techniques are complementary. Combining the techniques is a 5% absolute improvement in WER, averaged over all conditions, compared to semi-supervised training alone. Additional text data is incorporated through external language models. By using CTC-based decoding, we are better able to take advantage of the additional text data. When used as a transcription model, it allows the Conformer model to better incorporate the knowledge from the language model through semi-supervised training than shallow fusion. Final performance is an additional 2% better absolute when using CTC-based decoding for semi-supervised training compared to shallow fusion.
引用
收藏
页码:6892 / 6896
页数:5
相关论文
共 50 条
  • [1] Unsupervised and semi-supervised adaptation of a hybrid speech recognition system
    Trmal, Jan
    Zelinka, Jan
    Mueller, Ludek
    [J]. PROCEEDINGS OF 2012 IEEE 11TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP) VOLS 1-3, 2012, : 527 - 530
  • [2] Combining Active and Semi-supervised Learning for Homograph Disambiguation in Mandarin Text-to-Speech Synthesis
    Shen, Binbin
    Wu, Zhiyong
    Wang, Yongxin
    Cai, Lianhong
    [J]. 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 2176 - 2179
  • [3] Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition
    Zhang, Sheng
    Chen, Min
    Chen, Jincai
    Li, Yuan-Fang
    Wu, Yiling
    Li, Minglei
    Zhu, Chuanbo
    [J]. KNOWLEDGE-BASED SYSTEMS, 2021, 229
  • [4] Semi-Supervised Scene Text Recognition
    Gao, Yunze
    Chen, Yingying
    Wang, Jinqiao
    Lu, Hanqing
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 3005 - 3016
  • [5] USING COLLECTIVE INFORMATION IN SEMI-SUPERVISED LEARNING FOR SPEECH RECOGNITION
    Varadarajan, Balakrishnan
    Yu, Dong
    Deng, Li
    Acero, Alex
    [J]. 2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 4633 - +
  • [6] Regularized Urdu Speech Recognition with Semi-Supervised Deep Learning
    Humayun, Mohammad Ali
    Hameed, Ibrahim A.
    Shah, Syed Muslim
    Khan, Sohaib Hassan
    Zafar, Irfan
    Bin Ahmed, Saad
    Shuja, Junaid
    [J]. APPLIED SCIENCES-BASEL, 2019, 9 (09):
  • [7] Semi-supervised and unsupervised discriminative language model training for automatic speech recognition
    Dikici, Erinc
    Saraclar, Murat
    [J]. SPEECH COMMUNICATION, 2016, 83 : 54 - 63
  • [8] Semi-Supervised Learning of Speech Sounds
    Jansen, Aren
    Niyogi, Partha
    [J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2264 - 2267
  • [9] Heterogeneous Network Based Semi-supervised Learning for Scene Text Recognition
    Jiang, Qianyi
    Song, Qi
    Li, Nan
    Zhang, Rui
    Wei, Xiaolin
    [J]. DOCUMENT ANALYSIS AND RECOGNITION, ICDAR 2021, PT IV, 2021, 12824 : 64 - 78
  • [10] Semi-Supervised and Unsupervised Extreme Learning Machines
    Huang, Gao
    Song, Shiji
    Gupta, Jatinder N. D.
    Wu, Cheng
    [J]. IEEE TRANSACTIONS ON CYBERNETICS, 2014, 44 (12) : 2405 - 2417