DeepConversion: Voice conversion with limited parallel training data

被引:13
|
作者
Zhang, Mingyang [1 ,2 ]
Sisman, Berrak [2 ,3 ]
Zhao, Li [1 ]
Li, Haizhou [2 ]
机构
[1] Southeast Univ, Sch Informat Sci & Engn, Nanjing, Peoples R China
[2] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore, Singapore
[3] Singapore Univ Technol & Design, Informat Syst Technol & Design Pillar, Singapore, Singapore
基金
新加坡国家研究基金会;
关键词
Voice conversion; Limited data; Deep learning; Wavenet; SPEECH SYNTHESIS; SPARSE REPRESENTATION; SPEAKER ADAPTATION; COMPENSATION; ALGORITHMS; SPECTRUM; PROSODY;
D O I
10.1016/j.specom.2020.05.004
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
A deep neural network approach to voice conversion usually depends on a large amount of parallel training data from source and target speakers. In this paper, we propose a novel conversion pipeline, DeepConversion, that leverages a large amount of non-parallel, multi-speaker data, but requires only a small amount of parallel training data. It is believed that we can represent the shared characteristics of speakers by training a speaker independent general model on a large amount of publicly available, non-parallel, multi-speaker speech data. Such general model can then be used to learn the mapping between source and target speaker more effectively from a limited amount of parallel training data. We also propose a strategy to make full use of the parallel data in all models along the pipeline. In particular, the parallel data is used to adapt the general model towards the source-target speaker pair to achieve a coarse grained conversion, and to develop a compact Error Reduction Network (ERN) for a fine-grained conversion. The parallel data is also used to adapt the WaveNet vocoder towards the source-target pair. The experiments show that DeepConversion that only uses a limited amount of parallel training data, consistently outperforms the traditional approaches that use a large amount of parallel training data, in both objective and subjective evaluations. (C) 2020 The Authors. Published by Elsevier B.V.
引用
收藏
页码:31 / 43
页数:13
相关论文
共 50 条
  • [31] Voice Conversion Based Data Augmentation to Improve Children's Speech Recognition in Limited Data Scenario
    Shahnawazuddin, S.
    Adiga, Nagaraj
    Kumar, Kunal
    Poddar, Aayushi
    Ahmad, Waquar
    INTERSPEECH 2020, 2020, : 4382 - 4386
  • [32] A KL Divergence and DNN-based Approach to Voice Conversion without Parallel Training Sentences
    Xie, Feng-Long
    Soong, Frank K.
    Li, Haifeng
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 287 - 291
  • [33] SPEAKER ADAPTIVE MODEL BASED ON BOLTZMANN MACHINE FOR NON-PARALLEL TRAINING IN VOICE CONVERSION
    Nakashika, Torsi
    Minami, Yasuhiro
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5530 - 5534
  • [34] TTS-Guided Training for Accent Conversion Without Parallel Data
    Zhou, Yi
    Wu, Zhizheng
    Zhang, Mingyang
    Tian, Xiaohai
    Li, Haizhou
    IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 533 - 537
  • [35] VOICE VERIFICATION USING I-VECTORS AND NEURAL NETWORKS WITH LIMITED TRAINING DATA
    Mamyrbayev, O. Zh.
    Othman, M.
    Akhmediyarova, A. T.
    Kydyrbekova, A. S.
    Mekebayev, N. O.
    BULLETIN OF THE NATIONAL ACADEMY OF SCIENCES OF THE REPUBLIC OF KAZAKHSTAN, 2019, (03): : 36 - 43
  • [36] Training data selection for voice conversion using speaker selection and vector field smoothing
    Hashimoto, M
    Higuchi, N
    ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1397 - 1400
  • [37] Improving the performance of MGM-based voice conversion by preparing training data method
    Zuo, GY
    Liu, WJ
    Ruan, XG
    2004 INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, 2004, : 181 - 184
  • [38] Adaptive Training for Voice Conversion Based on Eigenvoices
    Ohtani, Yamato
    Toda, Tomoki
    Saruwatari, Hiroshi
    Shikano, Kiyohiro
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2010, E93D (06): : 1589 - 1598
  • [39] Non-parallel Voice Conversion with Fewer Labeled Data by Conditional Generative Adversarial Networks
    Chen, Minchuan
    Hou, Weijian
    Ma, Jun
    Wang, Shaojun
    Xiao, Jing
    INTERSPEECH 2020, 2020, : 4716 - 4720
  • [40] Parallel Voice Conversion Based on a Continuous Sinusoidal Model
    Al-Radhi, Mohammed Salah
    Csapo, Tamas Gabor
    Nemeth, Geza
    2019 10TH INTERNATIONAL CONFERENCE ON SPEECH TECHNOLOGY AND HUMAN-COMPUTER DIALOGUE (SPED), 2019,