High-quality Voice Conversion Using Spectrogram-Based WaveNet Vocoder

被引:0
|
作者
Chen, Kuan [1 ]
Chen, Bo [1 ]
Lai, Jiahao [1 ]
Yu, Kai [1 ]
机构
[1] Shanghai Jiao Tong Univ, Key Lab Shanghai Educ Commiss Intelligent Interac, Brain Sci & Technol Res Ctr, SpeechLab,Dept Comp Sci & Engn, Shanghai, Peoples R China
关键词
voice conversion; WaveNet vocoder; mel-frequency spectrogram; LSTM-RNN; SYSTEM; TIME;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Waveform generator is a key component in voice conversion. Recently, WaveNet waveform generator conditioned on the Mel-cepstrum (Mcep) has shown better quality over standard vocoder. In this paper, an enhanced WaveNet model based on spectrogram is proposed to further improve voice conversion performance. Here, Mel-frequency spectrogram is converted from source speaker to target speaker using an LSTMRNN based frame-to-frame feature mapping. To evaluate the performance, the proposed approach is compared to an Mcep based LSTM-RNN voice conversion system. Both STRAIGHT vocoder and Mcep-based WaveNet vocoder are elected to produce the converted speech for Mcep conversion system. The fundamental frequency (F-0) of the converted speech in different systems is analyzed. The naturalness, similarity and intelligibility are evaluated in subjective measures. Results show that the spectrogram based WaveNet waveform generator can achieve better voice conversion quality compared to traditional WaveNet approaches. The Mel-spectrogram based voice conversion can achieve significant improvement in speaker similarity and inherent F-0 conversion.
引用
收藏
页码:1993 / 1997
页数:5
相关论文
共 50 条
  • [21] Statistical voice conversion with WaveNet-based waveform generation
    Kobayashi, Kazuhiro
    Hayashi, Tomoki
    Tamamori, Akira
    Toda, Tomoki
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1138 - 1142
  • [22] Mel spectrogram-based audio forgery detection using CNN
    Ustubioglu, Arda
    Ustubioglu, Beste
    Ulutas, Guzin
    [J]. SIGNAL IMAGE AND VIDEO PROCESSING, 2023, 17 (05) : 2211 - 2219
  • [23] Mel spectrogram-based audio forgery detection using CNN
    Arda Ustubioglu
    Beste Ustubioglu
    Guzin Ulutas
    [J]. Signal, Image and Video Processing, 2023, 17 : 2211 - 2219
  • [24] ATTENTION-BASED WAVENET AUTOENCODER FOR UNIVERSAL VOICE CONVERSION
    Polyak, Adam
    Wolf, Lior
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6800 - 6804
  • [25] Spectrogram-Based Automatic Modulation Recognition Using Convolutional Neural Network
    Jeong, Sinjin
    Lee, Uhyeon
    Kim, Suk Chan
    [J]. 2018 TENTH INTERNATIONAL CONFERENCE ON UBIQUITOUS AND FUTURE NETWORKS (ICUFN 2018), 2018, : 843 - 845
  • [26] Frame Selection in SI-DNN Phonetic Space with WaveNet Vocoder for Voice Conversion without Parallel Training Data
    Xie, Feng-Long
    Soong, Frank K.
    Wang, Xi
    He, Lei
    Li, Haifeng
    [J]. 2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 56 - 60
  • [27] Emotional sounds of crowds: spectrogram-based analysis using deep learning
    Franzoni, Valentina
    Biondi, Giulio
    Milani, Alfredo
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (47-48) : 36063 - 36075
  • [28] Classification of Affective Sound Quality Characteristics of Spectrogram-based Vehicle Driving Sounds Using Data Augmentation
    Ha Kim, Dong
    Lee, Jongsoo
    [J]. TRANSACTIONS OF THE KOREAN SOCIETY OF MECHANICAL ENGINEERS A, 2022, 46 (05) : 487 - 494
  • [29] Spectrogram-based Simultaneous Heartbeat and Blink Detection Using Doppler Sensor
    Yamamoto, Kohei
    Toyoda, Kentaroh
    Ohtsuki, Tomoaki
    [J]. ICC 2019 - 2019 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2019,
  • [30] Emotional sounds of crowds: spectrogram-based analysis using deep learning
    Valentina Franzoni
    Giulio Biondi
    Alfredo Milani
    [J]. Multimedia Tools and Applications, 2020, 79 : 36063 - 36075