GENERATING MULTILINGUAL VOICES USING SPEAKER SPACE TRANSLATION BASED ON BILINGUAL SPEAKER DATA

被引：0

作者：

Maiti, Soumi ^{[1
,2
]}

Marchi, Erik ^{[1
]}

Conkie, Alistair ^{[1
]}

机构：

[1] Apple, Cupertino, CA USA

[2] CUNY, Grad Ctr, New York, NY 10021 USA

来源：

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年

关键词：

cross-lingual transfer; d-vector; speaker space manipulation; bilingual speaker; text-to-speech synthesis;

D O I：

10.1109/icassp40776.2020.9054305

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

We present progress towards bilingual Text-to-Speech which is able to transform a monolingual voice to speak a second language while preserving speaker voice quality. We demonstrate that a bilingual speaker embedding space contains a separate distribution for each language and that a simple transform in speaker space generated by the speaker embedding can be used to control the degree of accent of a synthetic voice in a language. The same transform can be applied even to monolingual speakers. In our experiments speaker data from an English-Spanish (Mexican) bilingual speaker was used, and the goal was to enable English speakers to speak Spanish and Spanish speakers to speak English. We found that the simple transform was sufficient to convert a voice from one language to the other with a high degree of naturalness. In one case the transformed voice outperformed a native language voice in listening tests. Experiments further indicated that the transform preserved many of the characteristics of the original voice. The degree of accent present can be controlled and naturalness is relatively consistent across a range of accent values.

引用

页码：7624 / 7628

页数：5

共 50 条

[1] Multilingual speaker recognition using ANFIS
Department of Information Technology, ABV-Indian Institute of Information Technology and Management, Gwalior, India
[J]. ICSPS - Proc. Int. Conf. Signal Process. Syst., 1600, (V3714-V3718):
[2] Fast Speaker Idntification Based on Speaker Metric Space
Feng Yong
Guo Jichuan
Cao Junhua
Zhu Lei
[J]. 2015 IEEE ADVANCED INFORMATION TECHNOLOGY, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (IAEAC), 2015, : 1167 - 1171
[3] Speaker identification using multilingual phone strings
Jin, Q
Schultz, T
Waibel, A
[J]. 2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 145 - 148
[4] Speaker adaptation for telephony data using speaker clustering
Wu, C
Lubensky, D
Wang, ZH
[J]. 2000 5TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS, VOLS I-III, 2000, : 768 - 771
[5] Tensor-based Speaker Space Construction for Arbitrary Speaker Conversion
Saito, Daisuke
Minematsu, Nobuaki
Hirose, Keikichi
[J]. PROCEEDINGS OF 2012 IEEE 11TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP) VOLS 1-3, 2012, : 595 - 598
[6] GMM-UBM based speaker verification in multilingual environments
Bhattacharjee, Utpal
Sarmah, Kshirod
[J]. International Journal of Computer Science Issues, 2012, 9 (6 6-2): : 373 - 380
[7] Bilingual Speech Recognition by Estimating Speaker Geometry from Video Data
Tapia, Luis Sanchez
Gomez, Antonio
Esparza, Mario
Jatla, Venkatesh
Pattichis, Marios
Celedon-Pattichis, Sylvia
Leiva, Carlos Lopez
[J]. COMPUTER ANALYSIS OF IMAGES AND PATTERNS, CAIP 2021, PT 1, 2021, 13052 : 79 - 89
[8] SPEAKER CHARACTERIZATION USING TDNN-LSTM BASED SPEAKER EMBEDDING
Chen, Chia-Ping
Zhang, Su-Yu
Yeh, Chih-Ting
Wang, Jia-Ching
Wang, Tenghui
Huang, Chien-Lin
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6211 - 6215
[9] Arbitrary speaker conversion based on speaker space bases constructed by deep neural networks
Hashimoto, Tetsuya
Saito, Daisuke
Minematsu, Nobuaki
[J]. 2016 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2016,
[10] Stream-based speaker segmentation using speaker factors and eigenvoices
Castaldo, Fabio
Colibro, Daniele
Dalmasso, Emanuele
Laface, Pietro
Vair, Claudio
[J]. 2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 4133 - +

← 1 2 3 4 5 →