IMPROVING RECOGNITION-SYNTHESIS BASED ANY-TO-ONE VOICE CONVERSION WITH CYCLIC TRAINING

被引:5
|
作者
Chen, Yan-Nian [1 ]
Liu, Li-Juan [2 ]
Hu, Ya-Jun [2 ]
Jiang, Yuan [1 ,2 ]
Ling, Zhen-Hua [1 ]
机构
[1] Univ Sci & Technol China, Natl Engn Lab Speech & Language Informat Proc, Hefei, Peoples R China
[2] iFLYTEK Co Ltd, iFLYTEK Res, Hefei, Peoples R China
关键词
Voice conversion; any-to-one; recognition-synthesis; cyclic training; NETWORKS;
D O I
10.1109/ICASSP43922.2022.9747140
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In recognition-synthesis based any-to-one voice conversion (VC), an automatic speech recognition (ASR) model is employed to extract content-related features and a synthesizer is built to predict the acoustic features of the target speaker from the content-related features of any source speakers at the conversion stage. Since source speakers are unknown at the training stage, we have to use the content-related features of the target speaker to estimate the parameters of the synthesizer. This inconsistency between conversion and training stages constrains the speaker similarity of converted speech. To address this issue, a cyclic training method is proposed in this paper. This method designs pseudo-source acoustic features, which are generated by converting the training data of the target speaker towards multiple speakers in a reference corpus. Then, these pseudo-source acoustic features are used as the input of the synthesizer at the training stage to predict the acoustic features of the target speaker and a cyclic reconstruction loss is derived. Experimental results show that our proposed method achieved more consistent accuracy of acoustic feature prediction for various source speakers than the baseline method. It also achieved better similarity of converted speech, especially for the pairs of source and target speakers with distant speaker characteristics.
引用
收藏
页码:7007 / 7011
页数:5
相关论文
共 50 条
  • [1] A Study on Low-Latency Recognition-Synthesis-Based Any-to-One Voice Conversion
    Ding, Yi-Yang
    Liu, Li-Juan
    Hu, Yu
    Ling, Zhen-Hua
    [J]. PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 455 - 460
  • [2] Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning
    Zhang, Jing-Xuan
    Ling, Zhen-Hua
    Dai, Li-Rong
    [J]. INTERSPEECH 2020, 2020, : 771 - 775
  • [3] Enriching Source Style Transfer in Recognition-Synthesis based Non-Parallel Voice Conversion
    Wang, Zhichao
    Zhou, Xinyong
    Yang, Fengyu
    Li, Tao
    Du, Hongqiang
    Xie, Lei
    Gan, Wendong
    Chen, Haitao
    Li, Hai
    [J]. INTERSPEECH 2021, 2021, : 831 - 835
  • [4] Any-to-One Non-Parallel Voice Conversion System Using an Autoregressive Conversion Model and LPCNet Vocoder
    Ezzine, Kadria
    Di Martino, Joseph
    Frikha, Mondher
    [J]. APPLIED SCIENCES-BASEL, 2023, 13 (21):
  • [5] Jointly Trained Conversion Model With LPCNet for Any-to-One Voice Conversion Using Speaker-Independent Linguistic Features
    Himawan, Ivan
    Wang, Ruizhe
    Sridharan, Sridha
    Fookes, Clinton
    [J]. IEEE ACCESS, 2022, 10 : 134029 - 134037
  • [6] ANY-TO-ONE SEQUENCE-TO-SEQUENCE VOICE CONVERSION USING SELF-SUPERVISED DISCRETE SPEECH REPRESENTATIONS
    Huang, Wen-Chin
    Wu, Yi-Chiao
    Hayashi, Tomoki
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5944 - 5948
  • [7] Any-to-one Face Reenactment Based on Conditional Generative Adversarial Network
    Ma, Tianxiang
    Peng, Bo
    Wang, Wei
    Dong, Jing
    [J]. 2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1657 - 1664
  • [8] Iteratively Improving Speech Recognition and Voice Conversion
    Singh, Mayank Kumar
    Takahashi, Naoya
    Onoe, Naoyuki
    [J]. INTERSPEECH 2023, 2023, : 206 - 210
  • [9] Joint Adversarial Training of Speech Recognition and Synthesis Models for Many-to-One Voice Conversion Using Phonetic Posteriorgrams
    Saito, Yuki
    Akuzawa, Kei
    Tachibana, Kentaro
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2020, E103D (09) : 1978 - 1987
  • [10] IMPROVING VOICE QUALITY OF HMM-BASED SPEECH SYNTHESIS USING VOICE CONVERSION METHOD
    Jiao, Yishan
    Xie, Xiang
    Na, Xingyu
    Tu, Ming
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,