Frame Selection in SI-DNN Phonetic Space with WaveNet Vocoder for Voice Conversion without Parallel Training Data

被引:0
|
作者
Xie, Feng-Long [1 ,2 ,4 ]
Soong, Frank K. [2 ]
Wang, Xi [3 ]
He, Lei [3 ]
Li, Haifeng [1 ]
机构
[1] Harbin Inst Technol, Harbin, Heilongjiang, Peoples R China
[2] Microsoft Res Asia, Beijing, Peoples R China
[3] Microsoft Cloud & AI, Beijing, Peoples R China
[4] Microsoft Res Asia, Speech Grp, Beijing, Peoples R China
关键词
deep neural network; Kullback-Leibler divergence; WaveNet vocoder; voice conversion; NEURAL-NETWORKS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose a frame selection approach to voice conversion with speaker independent deep neural network (SI-DNN) and Kullback-Leibler divergence (KLD). The acoustic difference between source and target speaker is equalized with SI-DNN in the ASR senone phonetic space. KLD is used as an ideal distortion measure to select the corresponding target frame given the source frame. Acoustic trajectory of the selected frames is rendered with maximum probability trajectory generation algorithm. WaveNet based vocoder is applied on the converted acoustic trajectory to get the final speech waveform. From the subjective results we find that 1) the proposed method can achieve better performance than the phonetic cluster based selection method [ 16]; 2) by applying WaveNet vocoder the naturalness and speaker similarity can be significantly improved compared with linear predictive coding (LPC) based vocoder; 3) WaveNet vocoder trained only with spectral features i.e., line spectrum pairs (LSP) can better maintain the pitch pattern towards target speaker than WaveNet vocoder trained with both spectral features i.e., LSP and prosodic features (F0 and Unvoiced/Voiced flag).
引用
收藏
页码:56 / 60
页数:5
相关论文
共 8 条
  • [1] Voice conversion with SI-DNN and KL divergence based mapping without parallel training data
    Xie, Feng-Long
    Soong, Frank K.
    Li, Haifeng
    [J]. SPEECH COMMUNICATION, 2019, 106 : 57 - 67
  • [2] WaveNet Vocoder with Limited Training Data for Voice Conversion
    Liu, Li-Juan
    Ling, Zhen-Hua
    Yuan-Jiang
    Ming-Zhou
    Dai, Li-Rong
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1983 - 1987
  • [3] Jointly Trained Conversion Model and WaveNet Vocoder for Non-parallel Voice Conversion using Mel-spectrograms and Phonetic Posteriorgrams
    Liu, Songxiang
    Cao, Yuewen
    Wu, Xixin
    Sun, Lifa
    Liu, Xunying
    Meng, Helen
    [J]. INTERSPEECH 2019, 2019, : 714 - 718
  • [4] AN IMPROVED FRAME-UNIT-SELECTION BASED VOICE CONVERSION SYSTEM WITHOUT PARALLEL TRAINING DATA
    Xie, Feng-Long
    Li, Xin-Hui
    Liu, Bo
    Zheng, Yi-Bin
    Meng, Li
    Lu, Li
    Soong, Frank K.
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7754 - 7758
  • [5] PHONETIC POSTERIORGRAMS FOR MANY-TO-ONE VOICE CONVERSION WITHOUT PARALLEL DATA TRAINING
    Sun, Lifa
    Li, Kun
    Wang, Hao
    Kang, Shiyin
    Meng, Helen
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO (ICME), 2016,
  • [6] SPARSE REPRESENTATION OF PHONETIC FEATURES FOR VOICE CONVERSION WITH AND WITHOUT PARALLEL DATA
    Sisman, Berrak
    Li, Haizhou
    Tan, Kay Chen
    [J]. 2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2017, : 677 - 684
  • [7] A KL Divergence and DNN-based Approach to Voice Conversion without Parallel Training Sentences
    Xie, Feng-Long
    Soong, Frank K.
    Li, Haifeng
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 287 - 291
  • [8] ARVC: An Auto-Regressive Voice Conversion System Without Parallel Training Data
    Lian, Zheng
    Wen, Zhengqi
    Zhou, Xinyong
    Pu, Songbai
    Zhang, Shengkai
    Tao, Jianhua
    [J]. INTERSPEECH 2020, 2020, : 4706 - 4710