DEEP CONTEXTUALIZED ACOUSTIC REPRESENTATIONS FOR SEMI-SUPERVISED SPEECH RECOGNITION

被引:0
|
作者
Ling, Shaoshi [1 ]
Liu, Yuzong [1 ]
Salazar, Julian [1 ]
Kirchhoff, Katrin [1 ]
机构
[1] Amazon AWS AI, Seattle, WA 98109 USA
关键词
speech recognition; acoustic representation learning; semi-supervised learning; FRAMEWORK;
D O I
10.1109/icassp40776.2020.9053176
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We propose a novel approach to semi-supervised automatic speech recognition (ASR). We first exploit a large amount of unlabeled audio data via representation learning, where we reconstruct a temporal slice of filterbank features from past and future context frames. The resulting deep contextualized acoustic representations (DeCoAR) are then used to train a CTC-based end-to-end ASR system using a smaller amount of labeled audio data. In our experiments, we show that systems trained on DeCoAR consistently outperform ones trained on conventional filterbank features, giving 42% and 19% relative improvement over the baseline on WSJ eval92 and LibriSpeech test-clean, respectively. Our approach can drastically reduce the amount of labeled data required; unsupervised training on LibriSpeech then supervision with 100 hours of labeled data achieves performance on par with training on all 960 hours directly.
引用
收藏
页码:6429 / 6433
页数:5
相关论文
共 50 条
  • [1] Regularized Urdu Speech Recognition with Semi-Supervised Deep Learning
    Humayun, Mohammad Ali
    Hameed, Ibrahim A.
    Shah, Syed Muslim
    Khan, Sohaib Hassan
    Zafar, Irfan
    Bin Ahmed, Saad
    Shuja, Junaid
    APPLIED SCIENCES-BASEL, 2019, 9 (09):
  • [2] Semi-Supervised Speech Recognition Acoustic Model Training Using Policy Gradient
    Chung, Hoon
    Lee, Sung Joo
    Jeon, Hyeong Bae
    Park, Jeon Gue
    APPLIED SCIENCES-BASEL, 2020, 10 (10):
  • [3] Semi-supervised Model for Emotion Recognition in Speech
    Pereira, Ingryd
    Santos, Diego
    Maciel, Alexandre
    Barros, Pablo
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2018, PT I, 2018, 11139 : 791 - 800
  • [4] Semi-Supervised Learning for Spanish Speech Recognition Using Deep Neural Networks
    Rosario Campomanes-Alvarez, Blanca
    Quiros, Pelayo
    Fernandez, Bernardo
    APPLICATIONS OF INTELLIGENT SYSTEMS, 2018, 310 : 19 - 29
  • [5] Semi-Supervised Multichannel Speech Enhancement With a Deep Speech Prior
    Sekiguchi, Kouhei
    Bando, Yoshiaki
    Nugraha, Aditya Arie
    Yoshii, Kazuyoshi
    Kawahara, Tatsuya
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (12) : 2197 - 2212
  • [6] Semi-Supervised Training of DNN-Based Acoustic Model for ATC Speech Recognition
    Smidl, Lubos
    Svec, Jan
    Prazak, Ales
    Trmal, Jan
    SPEECH AND COMPUTER (SPECOM 2018), 2018, 11096 : 646 - 655
  • [7] Semi-supervised Ladder Networks for Speech Emotion Recognition
    Jian-Hua Tao
    Jian Huang
    Ya Li
    Zheng Lian
    Ming-Yue Niu
    International Journal of Automation and Computing, 2019, 16 : 437 - 448
  • [8] Semi-Supervised Speech Emotion Recognition With Ladder Networks
    Parthasarathy, Srinivas
    Busso, Carlos
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 2697 - 2709
  • [9] Semi-supervised Ladder Networks for Speech Emotion Recognition
    Tao, Jian-Hua
    Huang, Jian
    Li, Ya
    Lian, Zheng
    Niu, Ming-Yue
    INTERNATIONAL JOURNAL OF AUTOMATION AND COMPUTING, 2019, 16 (04) : 437 - 448
  • [10] Semi-Supervised Speech Emotion Recognition with Ladder Networks
    Parthasarathy, Srinivas
    Busso, Carlos
    IEEE/ACM Transactions on Audio Speech and Language Processing, 2020, 28 : 2697 - 2709