Waveform Modeling and Generation Using Hierarchical Recurrent Neural Networks for Speech Bandwidth Extension

被引：42

作者：

Ling, Zhen-Hua ^{[1
]}

Ai, Yang ^{[1
]}

Gu, Yu ^{[1
,2
]}

Dai, Li-Rong ^{[1
]}

机构：

[1] Univ Sci & Technol China, Natl Engn Lab Speech & Language Informat Proc, Hefei 230027, Anhui, Peoples R China

[2] Baidu Speech Dept, Baidu Technol Pk, Beijing 100193, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2018年 / 26卷 / 05期

基金：

中国国家自然科学基金; 国家重点研发计划;

关键词：

Speech bandwidth extension; recurrent neural networks; dilated convolutional neural networks; bottleneck features; VOICE CONVERSION; ENHANCEMENT;

D O I：

10.1109/TASLP.2018.2798811

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper presents a waveform modeling and generation method using hierarchical recurrent neural networks (HRNN) for speech bandwidth extension (BWE). Different from conventional BWE methods that predict spectral parameters for reconstructing wideband speech waveforms, this BWE method models and predicts waveform samples directly without using vocoders. Inspired by SampleRNN, which is an unconditional neural audio generator, the HRNN model represents the distribution of each wideband or high-frequency waveform sample conditioned on the input narrowband waveform samples using a neural network composed of long short-term memory (LSTM) layers and feed-forward layers. The LSTM layers forma hierarchical structure and each layer operates at a specific temporal resolution to efficiently capture long-span dependencies between temporal sequences. Furthermore, additional conditions, such as the bottleneck features derived from narrowband speech using a deep neural network based state classifier, are employed as auxiliary input to further improve the quality of generated wideband speech. The experimental results of comparing several waveform modeling methods show that the HRNN-based method can achieve better speech quality and run-time efficiency than the dilated convolutional neural network based method and the plain sample-level recurrent neural network based method. Our proposed method also outperforms the conventional vocoder-based BWE method using LSTM-RNNs in terms of the subjective quality of the reconstructed wideband speech.

引用

页码：883 / 894

页数：12

共 50 条

[21] Arabic speech recognition using recurrent neural networks
El Choubassi, MM
El Khoury, HE
Alagha, CEJ
Skaf, JA
Al-Alaoui, MA
[J]. PROCEEDINGS OF THE 3RD IEEE INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND INFORMATION TECHNOLOGY, 2003, : 543 - 547
[22] Separation and deconvolution of speech using recurrent neural networks
Li, Y
Powers, D
Wen, P
[J]. IC-AI'2001: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS I-III, 2001, : 1303 - 1309
[23] Bandwidth extension of narrowband speech in log spectra domain using neural network
Pourmohammadi, Sara
Vali, Mansour
Ghadyani, Mohsen
[J]. TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2015, 23 (02) : 433 - 446
[24] CHARACTER-LEVEL LANGUAGE MODELING WITH HIERARCHICAL RECURRENT NEURAL NETWORKS
Hwang, Kyuyeon
Sung, Wonyong
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5720 - 5724
[25] On Filter Generalization for Music Bandwidth Extension Using Deep Neural Networks
Sulun, Serkan
Davies, Matthew E. P.
[J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2021, 15 (01) : 132 - 142
[26] A Hierarchical Predictor of Synthetic Speech Naturalness Using Neural Networks
Yoshimura, Takenori
Henter, Gustav Eje
Watts, Oliver
Wester, Mirjam
Yamagishi, Junichi
Tokuda, Keiichi
[J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 342 - 346
[27] Automatic playlist generation using Convolutional Neural Networks and Recurrent Neural Networks
Irene, Rosilde Tatiana
Borrelli, Clara
Zanoni, Massimiliano
Buccoli, Michele
Sarti, Augusto
[J]. 2019 27TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2019,
[28] Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks
Yu, Haonan
Wang, Jiang
Huang, Zhiheng
Yang, Yi
Xu, Wei
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 4584 - 4593
[29] Subcycle Waveform Modeling of Traffic Intersections Using Recurrent Attention Networks
Karnati, Yashaswi
Sengupta, Rahul
Rangarajan, Anand
Ranka, Sanjay
[J]. IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2022, 23 (03) : 2538 - 2548
[30] A Novel Unified Framework for Speech Enhancement and Bandwidth Extension Based on Jointly Trained Neural Networks
Liu, Bin
Tao, Jianhua
Zheng, Yibin
[J]. 2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 11 - 15

← 1 2 3 4 5 →