Non-parallel Sequence-to-Sequence Voice Conversion for Arbitrary Speakers

被引：1

作者：

Zhang, Ying ^{[1
]}

Che, Hao ^{[1
]}

Wang, Xiaorui ^{[1
]}

机构：

[1] Kwai, Beijing, Peoples R China

来源：

2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP) | 2021年

关键词：

voice conversion; sequence-to-sequence; connectionist temporal classification; limited data;

D O I：

10.1109/ISCSLP49672.2021.9362095

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Voice conversion (VC) aims to modify the speaker's tone while preserving the linguistic information. Recent works show that voice conversion has made great progress on non-parallel data by introducing phonetic posteriorgrams (PPGs). However, once the prosody of source and target speaker differ significantly, it causes noticeable quality degradation of the converted speech. To alleviate the impact of the prosody of the source speaker, we propose a sequence-to-sequence voice conversion (Seq2Seq-VC) method, which utilizes connectionist temporal classification PPGs (CTC-PPGs) as inputs and models the non-linear length mapping between CTC-PPGs and frame-level acoustic features. CTC-PPGs are extracted by the CTC based automatic speech recognition (CTC-ASR) model and used to replace time-aligned PPGs. The blank token is introduced in CTC-ASR outputs to identify fewer information frames and get around consecutive repeating characters. After removing blank tokens, the left CTC-PPGs only contain linguistic information, and the phone duration information of the source speech is removed. Thus, phone durations of the converted speech are more faithful to the target speaker, which means higher similarity to the target and less interference from different source speakers. Experimental results show our Seq2Seq-VC model achieves higher scores in similarity and naturalness tests than the baseline method. What's more, we expand our seq2seq-VC approach to voice conversion towards arbitrary speakers with limited data. The experimental results demonstrate that our Seq2Seq-VC model can transfer to a new speaker using 100 utterances (about 5 minutes).

引用

页数：5

共 50 条

[1] Non-Parallel Sequence-to-Sequence Voice Conversion With Disentangled Linguistic and Speaker Representations
Zhang, Jing-Xuan
Ling, Zhen-Hua
Dai, Li-Rong
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 540 - 552
[2] Investigation of Text-to-Speech-based Synthetic Parallel Data for Sequence-to-Sequence Non-Parallel Voice Conversion
Ma, Ding
Huang, Wen-Chin
Toda, Tomoki
[J]. 2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 870 - 877
[3] NON-AUTOREGRESSIVE SEQUENCE-TO-SEQUENCE VOICE CONVERSION
Hayashi, Tomoki
Huang, Wen-Chin
Kobayashi, Kazuhiro
Toda, Tomoki
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7068 - 7072
[4] Sequence-to-Sequence Acoustic Modeling for Voice Conversion
Zhang, Jing-Xuan
Ling, Zhen-Hua
Liu, Li-Juan
Jiang, Yuan
Dai, Li-Rong
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (03) : 631 - 644
[5] Pretraining Techniques for Sequence-to-Sequence Voice Conversion
Huang, Wen-Chin
Hayashi, Tomoki
Wu, Yi-Chiao
Kameoka, Hirokazu
Toda, Tomoki
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 745 - 755
[6] AN INVESTIGATION OF STREAMING NON-AUTOREGRESSIVE SEQUENCE-TO-SEQUENCE VOICE CONVERSION
Hayashi, Tomoki
Kobayashi, Kazuhiro
Toda, Tomoki
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6802 - 6806
[7] An Overview & Analysis of Sequence-to-Sequence Emotional Voice Conversion
Yang, Zijiang
Jing, Xin
Triantafyllopoulos, Andreas
Song, Meishu
Aslan, Ilhan
Schuller, Bjoern W.
[J]. INTERSPEECH 2022, 2022, : 4915 - 4919
[8] Sequence-to-Sequence Emotional Voice Conversion With Strength Control
Choi, Heejin
Hahn, Minsoo
[J]. IEEE ACCESS, 2021, 9 : 42674 - 42687
[9] DISTILLING SEQUENCE-TO-SEQUENCE VOICE CONVERSION MODELS FOR STREAMING CONVERSION APPLICATIONS
Tanaka, Kou
Kameoka, Hirokazu
Kaneko, Takuhiro
Seki, Shogo
[J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 1022 - 1028
[10] MANDARIN ELECTROLARYNGEAL SPEECH VOICE CONVERSION WITH SEQUENCE-TO-SEQUENCE MODELING
Yen, Ming-Chi
Huang, Wen-Chin
Kobayashi, Kazuhiro
Peng, Yu-Huai
Tsai, Shu-Wei
Tsao, Yu
Toda, Tomoki
Jang, Jyh-Shing Roger
Wang, Hsin-Min
[J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 650 - 657

← 1 2 3 4 5 →