DeepConversion: Voice conversion with limited parallel training data

被引:13
|
作者
Zhang, Mingyang [1 ,2 ]
Sisman, Berrak [2 ,3 ]
Zhao, Li [1 ]
Li, Haizhou [2 ]
机构
[1] Southeast Univ, Sch Informat Sci & Engn, Nanjing, Peoples R China
[2] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore, Singapore
[3] Singapore Univ Technol & Design, Informat Syst Technol & Design Pillar, Singapore, Singapore
基金
新加坡国家研究基金会;
关键词
Voice conversion; Limited data; Deep learning; Wavenet; SPEECH SYNTHESIS; SPARSE REPRESENTATION; SPEAKER ADAPTATION; COMPENSATION; ALGORITHMS; SPECTRUM; PROSODY;
D O I
10.1016/j.specom.2020.05.004
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
A deep neural network approach to voice conversion usually depends on a large amount of parallel training data from source and target speakers. In this paper, we propose a novel conversion pipeline, DeepConversion, that leverages a large amount of non-parallel, multi-speaker data, but requires only a small amount of parallel training data. It is believed that we can represent the shared characteristics of speakers by training a speaker independent general model on a large amount of publicly available, non-parallel, multi-speaker speech data. Such general model can then be used to learn the mapping between source and target speaker more effectively from a limited amount of parallel training data. We also propose a strategy to make full use of the parallel data in all models along the pipeline. In particular, the parallel data is used to adapt the general model towards the source-target speaker pair to achieve a coarse grained conversion, and to develop a compact Error Reduction Network (ERN) for a fine-grained conversion. The parallel data is also used to adapt the WaveNet vocoder towards the source-target pair. The experiments show that DeepConversion that only uses a limited amount of parallel training data, consistently outperforms the traditional approaches that use a large amount of parallel training data, in both objective and subjective evaluations. (C) 2020 The Authors. Published by Elsevier B.V.
引用
收藏
页码:31 / 43
页数:13
相关论文
共 50 条
  • [21] SPARSE REPRESENTATION OF PHONETIC FEATURES FOR VOICE CONVERSION WITH AND WITHOUT PARALLEL DATA
    Sisman, Berrak
    Li, Haizhou
    Tan, Kay Chen
    2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2017, : 677 - 684
  • [22] On the Use of I-vectors and Average Voice Model for Voice Conversion without Parallel Data
    Wu, Jie
    Wu, Zhizheng
    Xie, Lei
    2016 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2016,
  • [23] Frame Selection in SI-DNN Phonetic Space with WaveNet Vocoder for Voice Conversion without Parallel Training Data
    Xie, Feng-Long
    Soong, Frank K.
    Wang, Xi
    He, Lei
    Li, Haifeng
    2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 56 - 60
  • [24] Many-to-Many Voice Conversion based on Bottleneck Features with Variational Autoencoder for Non-parallel Training Data
    Li, Yanping
    Lee, Kong Aik
    Yuan, Yougen
    Li, Haizhou
    Yang, Zhen
    2018 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2018, : 829 - 833
  • [25] Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training
    Zhou, Kun
    Sisman, Berrak
    Li, Haizhou
    INTERSPEECH 2021, 2021, : 811 - 815
  • [26] Non-Parallel Training in Voice Conversion Using an Adaptive Restricted Boltzmann Machine
    Nakashika, Toru
    Takiguchi, Tetsuya
    Minami, Yasuhiro
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (11) : 2032 - 2045
  • [27] TONGUE SHAPE CONVERSION WITH NON-PARALLEL TRAINING DATA
    Li, Hao
    Yang, Minghao
    Tao, Jianhua
    2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [28] Parallel data free singing voice conversion with cycle-consistent BEGAN
    Yousuf, Assila
    George, David Solomon
    MATERIALS TODAY-PROCEEDINGS, 2022, 58 : 157 - 161
  • [29] A Speaker-Dependent WaveNet for Voice Conversion with Non-Parallel Data
    Tian, Xiaohai
    Chng, Eng Siong
    Li, Haizhou
    INTERSPEECH 2019, 2019, : 201 - 205
  • [30] Taco-VC: A Single Speaker Tacotron based Voice Conversion with Limited Data
    Levy-Leshem, Roee
    Giryes, Raja
    28TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2020), 2021, : 391 - 395