DeepConversion: Voice conversion with limited parallel training data

被引:13
|
作者
Zhang, Mingyang [1 ,2 ]
Sisman, Berrak [2 ,3 ]
Zhao, Li [1 ]
Li, Haizhou [2 ]
机构
[1] Southeast Univ, Sch Informat Sci & Engn, Nanjing, Peoples R China
[2] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore, Singapore
[3] Singapore Univ Technol & Design, Informat Syst Technol & Design Pillar, Singapore, Singapore
基金
新加坡国家研究基金会;
关键词
Voice conversion; Limited data; Deep learning; Wavenet; SPEECH SYNTHESIS; SPARSE REPRESENTATION; SPEAKER ADAPTATION; COMPENSATION; ALGORITHMS; SPECTRUM; PROSODY;
D O I
10.1016/j.specom.2020.05.004
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
A deep neural network approach to voice conversion usually depends on a large amount of parallel training data from source and target speakers. In this paper, we propose a novel conversion pipeline, DeepConversion, that leverages a large amount of non-parallel, multi-speaker data, but requires only a small amount of parallel training data. It is believed that we can represent the shared characteristics of speakers by training a speaker independent general model on a large amount of publicly available, non-parallel, multi-speaker speech data. Such general model can then be used to learn the mapping between source and target speaker more effectively from a limited amount of parallel training data. We also propose a strategy to make full use of the parallel data in all models along the pipeline. In particular, the parallel data is used to adapt the general model towards the source-target speaker pair to achieve a coarse grained conversion, and to develop a compact Error Reduction Network (ERN) for a fine-grained conversion. The parallel data is also used to adapt the WaveNet vocoder towards the source-target pair. The experiments show that DeepConversion that only uses a limited amount of parallel training data, consistently outperforms the traditional approaches that use a large amount of parallel training data, in both objective and subjective evaluations. (C) 2020 The Authors. Published by Elsevier B.V.
引用
收藏
页码:31 / 43
页数:13
相关论文
共 50 条
  • [1] WaveNet Vocoder with Limited Training Data for Voice Conversion
    Liu, Li-Juan
    Ling, Zhen-Hua
    Yuan-Jiang
    Ming-Zhou
    Dai, Li-Rong
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1983 - 1987
  • [2] Parallel voice conversion with limited training data using stochastic variational deep kernel learning
    Jafaryani, Mohamadreza
    Sheikhzadeh, Hamid
    Pourahmadi, Vahid
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2022, 115
  • [3] Voice conversion based on feature combination with limited training data
    Ghorbandoost, Mostafa
    Sayadiyan, Abolghasem
    Ahangar, Mohsen
    Sheikhzadeh, Hamid
    Shahrebabaki, Abdoreza Sabzi
    Amini, Jamal
    SPEECH COMMUNICATION, 2015, 67 : 113 - 128
  • [4] Voice conversion based on Gaussian processes by coherent and asymmetric training with limited training data
    Xu, Ning
    Tang, Yibing
    Bao, Jingyi
    Jiang, Aiming
    Liu, Xiaofeng
    Yang, Zhen
    SPEECH COMMUNICATION, 2014, 58 : 124 - 138
  • [5] Factorized WaveNet for voice conversion with limited data
    Du, Hongqiang
    Tian, Xiaohai
    Xie, Lei
    Li, Haizhou
    SPEECH COMMUNICATION, 2021, 130 : 45 - 54
  • [6] VAW-GAN for Singing Voice Conversion with Non-parallel Training Data
    Lu, Junchen
    Zhou, Kun
    Sisman, Berrak
    Li, Haizhou
    2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 514 - 519
  • [7] ARVC: An Auto-Regressive Voice Conversion System Without Parallel Training Data
    Lian, Zheng
    Wen, Zhengqi
    Zhou, Xinyong
    Pu, Songbai
    Zhang, Shengkai
    Tao, Jianhua
    INTERSPEECH 2020, 2020, : 4706 - 4710
  • [8] PHONETIC POSTERIORGRAMS FOR MANY-TO-ONE VOICE CONVERSION WITHOUT PARALLEL DATA TRAINING
    Sun, Lifa
    Li, Kun
    Wang, Hao
    Kang, Shiyin
    Meng, Helen
    2016 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO (ICME), 2016,
  • [9] Part-Syllable Transformation-Based Voice Conversion with Very Limited Training Data
    Mohammad Javad Jannati
    Abolghasem Sayadiyan
    Circuits, Systems, and Signal Processing, 2018, 37 : 1935 - 1957
  • [10] Part-Syllable Transformation-Based Voice Conversion with Very Limited Training Data
    Jannati, Mohammad Javad
    Sayadiyan, Abolghasem
    CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2018, 37 (05) : 1935 - 1957