DEEP CONTEXTUALIZED ACOUSTIC REPRESENTATIONS FOR SEMI-SUPERVISED SPEECH RECOGNITION

被引:0
|
作者
Ling, Shaoshi [1 ]
Liu, Yuzong [1 ]
Salazar, Julian [1 ]
Kirchhoff, Katrin [1 ]
机构
[1] Amazon AWS AI, Seattle, WA 98109 USA
关键词
speech recognition; acoustic representation learning; semi-supervised learning; FRAMEWORK;
D O I
10.1109/icassp40776.2020.9053176
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We propose a novel approach to semi-supervised automatic speech recognition (ASR). We first exploit a large amount of unlabeled audio data via representation learning, where we reconstruct a temporal slice of filterbank features from past and future context frames. The resulting deep contextualized acoustic representations (DeCoAR) are then used to train a CTC-based end-to-end ASR system using a smaller amount of labeled audio data. In our experiments, we show that systems trained on DeCoAR consistently outperform ones trained on conventional filterbank features, giving 42% and 19% relative improvement over the baseline on WSJ eval92 and LibriSpeech test-clean, respectively. Our approach can drastically reduce the amount of labeled data required; unsupervised training on LibriSpeech then supervision with 100 hours of labeled data achieves performance on par with training on all 960 hours directly.
引用
收藏
页码:6429 / 6433
页数:5
相关论文
共 50 条
  • [21] Speech emotion recognition using semi-supervised discriminant analysis
    Zhao, L. (zhaoli@seu.edu.cn), 1600, Southeast University (30):
  • [22] Unsupervised and semi-supervised adaptation of a hybrid speech recognition system
    Trmal, Jan
    Zelinka, Jan
    Mueller, Ludek
    PROCEEDINGS OF 2012 IEEE 11TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP) VOLS 1-3, 2012, : 527 - 530
  • [23] Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition
    Higuchi, Yosuke
    Moritz, Niko
    Le Roux, Jonathan
    Hori, Takaaki
    INTERSPEECH 2021, 2021, : 726 - 730
  • [24] Correction to: Semi-supervised Ladder Networks for Speech Emotion Recognition
    Jian-Hua Tao
    Jian Huang
    Ya Li
    Zheng Lian
    Ming-Yue Niu
    International Journal of Automation and Computing, 2021, 18 : 680 - 680
  • [25] Semi-supervised cross-lingual speech emotion recognition
    Agarla, Mirko
    Bianco, Simone
    Celona, Luigi
    Napoletano, Paolo
    Petrovsky, Alexey
    Piccoli, Flavio
    Schettini, Raimondo
    Shanin, Ivan
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 237
  • [26] Semi-supervised parallel shared encoders for speech emotion recognition
    Pourebrahim, Yousef
    Razzazi, Farbod
    Sameti, Hossein
    DIGITAL SIGNAL PROCESSING, 2021, 118
  • [27] Acoustic model training using committee-based active and semi-supervised learning for speech recognition
    Tsutaoka, Takuya
    Shinoda, Koichi
    2012 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2012,
  • [28] Improved low-resource Somali speech recognition by semi-supervised acoustic and language model training
    Biswas, Astik
    Menon, Raghav
    van der Westhuizen, Ewald
    Niesler, Thomas
    INTERSPEECH 2019, 2019, : 3008 - 3012
  • [29] Semi-supervised acoustic model training for speech with code-switching
    Yilmaz, Emre
    McLaren, Mitchell
    van den Heuvel, Henk
    van Leeuwen, David A.
    SPEECH COMMUNICATION, 2018, 105 : 12 - 22
  • [30] Lightly supervised vs. semi-supervised training of acoustic model on Luxembourgish for low-resource automatic speech recognition
    Vesely, Karel
    Segura, Carlos
    Szoke, Igor
    Luque, Jordi
    Cernocky, Jan Honza
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2883 - 2887