DEEP CONTEXTUALIZED ACOUSTIC REPRESENTATIONS FOR SEMI-SUPERVISED SPEECH RECOGNITION

被引:0
|
作者
Ling, Shaoshi [1 ]
Liu, Yuzong [1 ]
Salazar, Julian [1 ]
Kirchhoff, Katrin [1 ]
机构
[1] Amazon AWS AI, Seattle, WA 98109 USA
关键词
speech recognition; acoustic representation learning; semi-supervised learning; FRAMEWORK;
D O I
10.1109/icassp40776.2020.9053176
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We propose a novel approach to semi-supervised automatic speech recognition (ASR). We first exploit a large amount of unlabeled audio data via representation learning, where we reconstruct a temporal slice of filterbank features from past and future context frames. The resulting deep contextualized acoustic representations (DeCoAR) are then used to train a CTC-based end-to-end ASR system using a smaller amount of labeled audio data. In our experiments, we show that systems trained on DeCoAR consistently outperform ones trained on conventional filterbank features, giving 42% and 19% relative improvement over the baseline on WSJ eval92 and LibriSpeech test-clean, respectively. Our approach can drastically reduce the amount of labeled data required; unsupervised training on LibriSpeech then supervision with 100 hours of labeled data achieves performance on par with training on all 960 hours directly.
引用
收藏
页码:6429 / 6433
页数:5
相关论文
共 50 条
  • [31] Deep Co-Training for Semi-Supervised Image Recognition
    Qiao, Siyuan
    Shen, Wei
    Zhang, Zhishuai
    Wang, Bo
    Yuille, Alan
    COMPUTER VISION - ECCV 2018, PT 15, 2018, 11219 : 142 - 159
  • [32] Automatic Leaf Recognition Based on Deep Semi-Supervised Learning
    Wu H.
    Xiao F.
    Shi Z.
    Wen Z.
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2023, 35 (10): : 1469 - 1478
  • [33] Confidence Measures in Speech Emotion Recognition Based on Semi-supervised Learning
    Deng, Jun
    Schuller, Bjoern
    13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 2223 - 2226
  • [34] Speech Emotion Recognition Using Semi-supervised Learning with Ladder Networks
    Huang, Jian
    Li, Ya
    Tao, Jianhua
    Lian, Zheng
    Niu, Mingyue
    Yi, Jiangyan
    2018 FIRST ASIAN CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII ASIA), 2018,
  • [35] INCREMENTAL SEMI-SUPERVISED LEARNING FOR MULTI-GENRE SPEECH RECOGNITION
    Khonglah, Banriskhem
    Madikeri, Srikanth
    Dey, Subhadeep
    Bourlard, Herve
    Motlicek, Petr
    Billa, Jayadev
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7419 - 7423
  • [36] Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech Recognition
    Zhu H.
    Gao D.
    Cheng G.
    Povey D.
    Zhang P.
    Yan Y.
    IEEE/ACM Transactions on Audio Speech and Language Processing, 2023, 31 : 3320 - 3330
  • [37] Deep Semi-Supervised Learning
    Hailat, Zeyad
    Komarichev, Artem
    Chen, Xue-Wen
    2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 2154 - 2159
  • [38] Semi-Supervised Learning of Speech Sounds
    Jansen, Aren
    Niyogi, Partha
    INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2264 - 2267
  • [39] PARSE: Pairwise Alignment of Representations in Semi-Supervised EEG Learning for Emotion Recognition
    Zhang, Guangyi
    Davoodnia, Vandad
    Etemad, Ali
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2022, 13 (04) : 2185 - 2200
  • [40] Lithuanian Broadcast Speech Transcription using Semi-supervised Acoustic Model Training
    Lileikyte, Rasa
    Gorin, Arseniy
    Lamel, Lori
    Gauvain, Jean-Luc
    Fraga-Silva, Thiago
    SLTU-2016 5TH WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGIES FOR UNDER-RESOURCED LANGUAGES, 2016, 81 : 107 - 113