Non-parallel Sequence-to-Sequence Voice Conversion for Arbitrary Speakers

被引:1
|
作者
Zhang, Ying [1 ]
Che, Hao [1 ]
Wang, Xiaorui [1 ]
机构
[1] Kwai, Beijing, Peoples R China
关键词
voice conversion; sequence-to-sequence; connectionist temporal classification; limited data;
D O I
10.1109/ISCSLP49672.2021.9362095
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Voice conversion (VC) aims to modify the speaker's tone while preserving the linguistic information. Recent works show that voice conversion has made great progress on non-parallel data by introducing phonetic posteriorgrams (PPGs). However, once the prosody of source and target speaker differ significantly, it causes noticeable quality degradation of the converted speech. To alleviate the impact of the prosody of the source speaker, we propose a sequence-to-sequence voice conversion (Seq2Seq-VC) method, which utilizes connectionist temporal classification PPGs (CTC-PPGs) as inputs and models the non-linear length mapping between CTC-PPGs and frame-level acoustic features. CTC-PPGs are extracted by the CTC based automatic speech recognition (CTC-ASR) model and used to replace time-aligned PPGs. The blank token is introduced in CTC-ASR outputs to identify fewer information frames and get around consecutive repeating characters. After removing blank tokens, the left CTC-PPGs only contain linguistic information, and the phone duration information of the source speech is removed. Thus, phone durations of the converted speech are more faithful to the target speaker, which means higher similarity to the target and less interference from different source speakers. Experimental results show our Seq2Seq-VC model achieves higher scores in similarity and naturalness tests than the baseline method. What's more, we expand our seq2seq-VC approach to voice conversion towards arbitrary speakers with limited data. The experimental results demonstrate that our Seq2Seq-VC model can transfer to a new speaker using 100 utterances (about 5 minutes).
引用
收藏
页数:5
相关论文
共 50 条
  • [1] Non-Parallel Sequence-to-Sequence Voice Conversion With Disentangled Linguistic and Speaker Representations
    Zhang, Jing-Xuan
    Ling, Zhen-Hua
    Dai, Li-Rong
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 540 - 552
  • [2] Investigation of Text-to-Speech-based Synthetic Parallel Data for Sequence-to-Sequence Non-Parallel Voice Conversion
    Ma, Ding
    Huang, Wen-Chin
    Toda, Tomoki
    [J]. 2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 870 - 877
  • [3] NON-AUTOREGRESSIVE SEQUENCE-TO-SEQUENCE VOICE CONVERSION
    Hayashi, Tomoki
    Huang, Wen-Chin
    Kobayashi, Kazuhiro
    Toda, Tomoki
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7068 - 7072
  • [4] Sequence-to-Sequence Acoustic Modeling for Voice Conversion
    Zhang, Jing-Xuan
    Ling, Zhen-Hua
    Liu, Li-Juan
    Jiang, Yuan
    Dai, Li-Rong
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (03) : 631 - 644
  • [5] Pretraining Techniques for Sequence-to-Sequence Voice Conversion
    Huang, Wen-Chin
    Hayashi, Tomoki
    Wu, Yi-Chiao
    Kameoka, Hirokazu
    Toda, Tomoki
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 745 - 755
  • [6] AN INVESTIGATION OF STREAMING NON-AUTOREGRESSIVE SEQUENCE-TO-SEQUENCE VOICE CONVERSION
    Hayashi, Tomoki
    Kobayashi, Kazuhiro
    Toda, Tomoki
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6802 - 6806
  • [7] An Overview & Analysis of Sequence-to-Sequence Emotional Voice Conversion
    Yang, Zijiang
    Jing, Xin
    Triantafyllopoulos, Andreas
    Song, Meishu
    Aslan, Ilhan
    Schuller, Bjoern W.
    [J]. INTERSPEECH 2022, 2022, : 4915 - 4919
  • [8] Sequence-to-Sequence Emotional Voice Conversion With Strength Control
    Choi, Heejin
    Hahn, Minsoo
    [J]. IEEE ACCESS, 2021, 9 : 42674 - 42687
  • [9] DISTILLING SEQUENCE-TO-SEQUENCE VOICE CONVERSION MODELS FOR STREAMING CONVERSION APPLICATIONS
    Tanaka, Kou
    Kameoka, Hirokazu
    Kaneko, Takuhiro
    Seki, Shogo
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 1022 - 1028
  • [10] MANDARIN ELECTROLARYNGEAL SPEECH VOICE CONVERSION WITH SEQUENCE-TO-SEQUENCE MODELING
    Yen, Ming-Chi
    Huang, Wen-Chin
    Kobayashi, Kazuhiro
    Peng, Yu-Huai
    Tsai, Shu-Wei
    Tsao, Yu
    Toda, Tomoki
    Jang, Jyh-Shing Roger
    Wang, Hsin-Min
    [J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 650 - 657