Waveform Modeling and Generation Using Hierarchical Recurrent Neural Networks for Speech Bandwidth Extension

被引:42
|
作者
Ling, Zhen-Hua [1 ]
Ai, Yang [1 ]
Gu, Yu [1 ,2 ]
Dai, Li-Rong [1 ]
机构
[1] Univ Sci & Technol China, Natl Engn Lab Speech & Language Informat Proc, Hefei 230027, Anhui, Peoples R China
[2] Baidu Speech Dept, Baidu Technol Pk, Beijing 100193, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
Speech bandwidth extension; recurrent neural networks; dilated convolutional neural networks; bottleneck features; VOICE CONVERSION; ENHANCEMENT;
D O I
10.1109/TASLP.2018.2798811
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper presents a waveform modeling and generation method using hierarchical recurrent neural networks (HRNN) for speech bandwidth extension (BWE). Different from conventional BWE methods that predict spectral parameters for reconstructing wideband speech waveforms, this BWE method models and predicts waveform samples directly without using vocoders. Inspired by SampleRNN, which is an unconditional neural audio generator, the HRNN model represents the distribution of each wideband or high-frequency waveform sample conditioned on the input narrowband waveform samples using a neural network composed of long short-term memory (LSTM) layers and feed-forward layers. The LSTM layers forma hierarchical structure and each layer operates at a specific temporal resolution to efficiently capture long-span dependencies between temporal sequences. Furthermore, additional conditions, such as the bottleneck features derived from narrowband speech using a deep neural network based state classifier, are employed as auxiliary input to further improve the quality of generated wideband speech. The experimental results of comparing several waveform modeling methods show that the HRNN-based method can achieve better speech quality and run-time efficiency than the dilated convolutional neural network based method and the plain sample-level recurrent neural network based method. Our proposed method also outperforms the conventional vocoder-based BWE method using LSTM-RNNs in terms of the subjective quality of the reconstructed wideband speech.
引用
收藏
页码:883 / 894
页数:12
相关论文
共 50 条
  • [21] Arabic speech recognition using recurrent neural networks
    El Choubassi, MM
    El Khoury, HE
    Alagha, CEJ
    Skaf, JA
    Al-Alaoui, MA
    [J]. PROCEEDINGS OF THE 3RD IEEE INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND INFORMATION TECHNOLOGY, 2003, : 543 - 547
  • [22] Separation and deconvolution of speech using recurrent neural networks
    Li, Y
    Powers, D
    Wen, P
    [J]. IC-AI'2001: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS I-III, 2001, : 1303 - 1309
  • [23] Bandwidth extension of narrowband speech in log spectra domain using neural network
    Pourmohammadi, Sara
    Vali, Mansour
    Ghadyani, Mohsen
    [J]. TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2015, 23 (02) : 433 - 446
  • [24] CHARACTER-LEVEL LANGUAGE MODELING WITH HIERARCHICAL RECURRENT NEURAL NETWORKS
    Hwang, Kyuyeon
    Sung, Wonyong
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5720 - 5724
  • [25] On Filter Generalization for Music Bandwidth Extension Using Deep Neural Networks
    Sulun, Serkan
    Davies, Matthew E. P.
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2021, 15 (01) : 132 - 142
  • [26] A Hierarchical Predictor of Synthetic Speech Naturalness Using Neural Networks
    Yoshimura, Takenori
    Henter, Gustav Eje
    Watts, Oliver
    Wester, Mirjam
    Yamagishi, Junichi
    Tokuda, Keiichi
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 342 - 346
  • [27] Automatic playlist generation using Convolutional Neural Networks and Recurrent Neural Networks
    Irene, Rosilde Tatiana
    Borrelli, Clara
    Zanoni, Massimiliano
    Buccoli, Michele
    Sarti, Augusto
    [J]. 2019 27TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2019,
  • [28] Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks
    Yu, Haonan
    Wang, Jiang
    Huang, Zhiheng
    Yang, Yi
    Xu, Wei
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 4584 - 4593
  • [29] Subcycle Waveform Modeling of Traffic Intersections Using Recurrent Attention Networks
    Karnati, Yashaswi
    Sengupta, Rahul
    Rangarajan, Anand
    Ranka, Sanjay
    [J]. IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2022, 23 (03) : 2538 - 2548
  • [30] A Novel Unified Framework for Speech Enhancement and Bandwidth Extension Based on Jointly Trained Neural Networks
    Liu, Bin
    Tao, Jianhua
    Zheng, Yibin
    [J]. 2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 11 - 15