Voice conversion with SI-DNN and KL divergence based mapping without parallel training data

被引:4
|
作者
Xie, Feng-Long [1 ,2 ]
Soong, Frank K. [2 ]
Li, Haifeng [1 ]
机构
[1] Harbin Inst Technol, Harbin, Heilongjiang, Peoples R China
[2] Microsoft Res Asia, Beijing, Peoples R China
关键词
Voice conversion; Kullback-Leibler divergence; Deep neural nets; NEURAL-NETWORKS; ADAPTATION;
D O I
10.1016/j.specom.2018.11.007
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We propose a Speaker Independent Deep Neural Net (SI-DNN) and Kullback- Leibler Divergence (KLD) based mapping approach to voice conversion without using parallel training data. The acoustic difference between source and target speakers is equalized with SI-DNN via its estimated output posteriors, which serve as a probabilistic mapping from acoustic input frames to the corresponding symbols in the phonetic space. KLD is chosen as an ideal distortion measure to find an appropriate mapping from each input source speaker's frame to that of the target speaker. The mapped acoustic segments of the target speaker form the construction bases for voice conversion. With or without word transcriptions of the target speaker's training data, the approach can be either supervised or unsupervised. In a supervised mode where adequate training data can be utilized to train a conventional, statistical parametric TTS of the target speaker, each input frame of the source speaker is converted to its nearest sub-phonemic "senone". In an unsupervised mode, the frame is converted to the nearest clustered phonetic centroid or a raw speech frame, in the minimum KLD sense. The acoustic trajectory of the converted voice is rendered with the maximum probability trajectory generation algorithm. Both objective and subjective measures used for evaluating voice conversion performance show that the new algorithm performs better than the sequential error minimization based DNN baseline trained with parallel training data.
引用
收藏
页码:57 / 67
页数:11
相关论文
共 36 条
  • [1] A KL Divergence and DNN-based Approach to Voice Conversion without Parallel Training Sentences
    Xie, Feng-Long
    Soong, Frank K.
    Li, Haifeng
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 287 - 291
  • [2] Frame Selection in SI-DNN Phonetic Space with WaveNet Vocoder for Voice Conversion without Parallel Training Data
    Xie, Feng-Long
    Soong, Frank K.
    Wang, Xi
    He, Lei
    Li, Haifeng
    [J]. 2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 56 - 60
  • [3] AN IMPROVED FRAME-UNIT-SELECTION BASED VOICE CONVERSION SYSTEM WITHOUT PARALLEL TRAINING DATA
    Xie, Feng-Long
    Li, Xin-Hui
    Liu, Bo
    Zheng, Yi-Bin
    Meng, Li
    Lu, Li
    Soong, Frank K.
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7754 - 7758
  • [4] ARVC: An Auto-Regressive Voice Conversion System Without Parallel Training Data
    Lian, Zheng
    Wen, Zhengqi
    Zhou, Xinyong
    Pu, Songbai
    Zhang, Shengkai
    Tao, Jianhua
    [J]. INTERSPEECH 2020, 2020, : 4706 - 4710
  • [5] PHONETIC POSTERIORGRAMS FOR MANY-TO-ONE VOICE CONVERSION WITHOUT PARALLEL DATA TRAINING
    Sun, Lifa
    Li, Kun
    Wang, Hao
    Kang, Shiyin
    Meng, Helen
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO (ICME), 2016,
  • [6] Mapping Frames with DNN-HMM Recognizer for Non-parallel Voice Conversion
    Dong, Minghui
    Yang, Chenyu
    Lu, Yanfeng
    Ehnes, Jochen Walter
    Huang, Dongyan
    Ming, Huaiping
    Tong, Rong
    Lee, Siu Wa
    Li, Haizhou
    [J]. 2015 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2015, : 488 - 494
  • [7] DeepConversion: Voice conversion with limited parallel training data
    Zhang, Mingyang
    Sisman, Berrak
    Zhao, Li
    Li, Haizhou
    [J]. SPEECH COMMUNICATION, 2020, 122 : 31 - 43
  • [8] Many-to-Many and Completely Parallel-Data-Free Voice Conversion Based on Eigenspace DNN
    Hashimoto, Tetsuya
    Saito, Daisuke
    Minematsu, Nobuaki
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (02) : 332 - 341
  • [9] SPARSE REPRESENTATION OF PHONETIC FEATURES FOR VOICE CONVERSION WITH AND WITHOUT PARALLEL DATA
    Sisman, Berrak
    Li, Haizhou
    Tan, Kay Chen
    [J]. 2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2017, : 677 - 684
  • [10] On the Use of I-vectors and Average Voice Model for Voice Conversion without Parallel Data
    Wu, Jie
    Wu, Zhizheng
    Xie, Lei
    [J]. 2016 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2016,