Discriminative Learning of Filterbank Layer within Deep Neural Network Based Speech Recognition for Speaker Adaptation

被引:4
|
作者
Seki, Hiroshi [1 ]
Yamamoto, Kazumasa [2 ]
Akiba, Tomoyosi [1 ]
Nakagawa, Seiichi [1 ,2 ]
机构
[1] Toyohashi Univ Technol, Dept Comp Sci & Engn, Toyohashi, Aichi 4418580, Japan
[2] Chubu Univ, Dept Comp Sci, Kasugai, Aichi 4878501, Japan
基金
日本学术振兴会;
关键词
speech recognition; deep neural network; acoustic model; speaker adaptation; filterbank learning; FEATURES; MODEL;
D O I
10.1587/transinf.2018EDP7252
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Deep neural networks (DNNs) have achieved significant success in the field of automatic speech recognition. One main advantage of DNNs is automatic feature extraction without human intervention. However, adaptation under limited available data remains a major challenge for DNN-based systems because of their enormous free parameters. In this paper, we propose a filterbank-incorporated DNN that incorporates a filterbank layer that presents the filter shape/center frequency and a DNN-based acoustic model. The filterbank layer and the following networks of the proposed model are trained jointly by exploiting the advantages of the hierarchical feature extraction, while most systems use pre-defined mel-scale filterbank features as input acoustic features to DNNs. Filters in the filterbank layer are parameterized to represent speaker characteristics while minimizing a number of parameters. The optimization of one type of parameters corresponds to the Vocal Tract Length Normalization (VTLN), and another type corresponds to feature-space Maximum Linear Likelihood Regression (fMLLR) and feature-space Discriminative Linear Regression (fDLR). Since the filterbank layer consists of just a few parameters, it is advantageous in adaptation under limited available data. In the experiment, filterbank-incorporated DNNs showed effectiveness in speaker/gender adaptations under limited adaptation data. Experimental results on CSJ task demonstrate that the adaptation of proposed model showed 5.8% word error reduction ratio with 10 utterances against the un-adapted model.
引用
收藏
页码:364 / 374
页数:11
相关论文
共 50 条
  • [1] RAPID SPEAKER ADAPTATION OF NEURAL NETWORK BASED FILTERBANK LAYER FOR AUTOMATIC SPEECH RECOGNITION
    Seki, Hiroshi
    Yamamoto, Kazumasa
    Akiba, Tomoyosi
    Nakagawa, Seiichi
    [J]. 2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 574 - 580
  • [2] A DEEP NEURAL NETWORK INTEGRATED WITH FILTERBANK LEARNING FOR SPEECH RECOGNITION
    Seki, Hiroshi
    Yamamoto, Kazumasa
    Nakagawa, Seiichi
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5480 - 5484
  • [3] UNSUPERVISED SPEAKER ADAPTATION OF DEEP NEURAL NETWORK BASED ON THE COMBINATION OF SPEAKER CODES AND SINGULAR VALUE DECOMPOSITION FOR SPEECH RECOGNITION
    Xue, Shaofei
    Jiang, Hui
    Dai, Lirong
    Liu, Qingfeng
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4555 - 4559
  • [4] FAST SPEAKER ADAPTATION OF HYBRID NN/HMM MODEL FOR SPEECH RECOGNITION BASED ON DISCRIMINATIVE LEARNING OF SPEAKER CODE
    Abdel-Hamid, Ossama
    Jiang, Hui
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7942 - 7946
  • [5] IMPROVEMENTS TO FILTERBANK AND DELTA LEARNING WITHIN A DEEP NEURAL NETWORK FRAMEWORK
    Sainath, Tara N.
    Kingsbury, Brian
    Mohamed, Abdel-rahman
    Saon, George
    Ramabhadran, Bhuvana
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [6] Rapid and Effective Speaker Adaptation of Convolutional Neural Network Based Models for Speech Recognition
    Abdel-Hamid, Ossama
    Jiang, Hui
    [J]. 14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 1247 - 1251
  • [7] Discriminative speaker adaptation in Persian continuous speech recognition systems
    Pirhosseinloo, Shadi
    Ganj, Farshad Almas
    [J]. 4TH INTERNATIONAL CONFERENCE OF COGNITIVE SCIENCE, 2012, 32 : 296 - 301
  • [8] A unified approach to transfer learning of deep neural networks with applications to speaker adaptation in automatic speech recognition
    School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta
    GA
    30332, United States
    不详
    Sicily, Italy
    [J]. Neurocomputing, (448-459):
  • [9] A unified approach to transfer learning of deep neural networks with applications to speaker adaptation in automatic speech recognition
    Huang, Zhen
    Siniscalchi, Sabato Marco
    Lee, Chin-Hui
    [J]. NEUROCOMPUTING, 2016, 218 : 448 - 459
  • [10] Fast Adaptation of Deep Neural Network Based on Discriminant Codes for Speech Recognition
    Xue, Shaofei
    Abdel-Hamid, Ossama
    Jiang, Hui
    Dai, Lirong
    Liu, Qingfeng
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (12) : 1713 - 1725