Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS

被引：1

作者：

Udagawa, Kenta ^{[1
]}

Saito, Yuki ^{[1
]}

Saruwatari, Hiroshi ^{[1
]}

机构：

[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo, Japan

来源：

INTERSPEECH 2022 | 2022年

关键词：

DNN-based multi-speaker TTS; speaker adaptation; human-computer interaction; Bayesian optimization;

D O I：

10.21437/Interspeech.2022-257

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper proposes a human-in-the-loop speaker-adaptation method for multi-speaker text-to-speech. With a conventional speaker-adaptation method, a target speaker's embedding vector is extracted from his/her reference speech using a speaker encoder trained on a speaker-discriminative task. However, this method cannot obtain an embedding vector for the target speaker when the reference speech is unavailable. Our method is based on a human-in-the-loop optimization framework, which incorporates a user to explore the speaker-embedding space to find the target speaker's embedding. The proposed method uses a sequential line search algorithm that repeatedly asks a user to select a point on a line segment in the embedding space. To efficiently choose the best speech sample from multiple stimuli, we also developed a system in which a user can switch between multiple speakers' voices for each phoneme. Experimental results indicate that the proposed method can achieve comparable performance to the conventional one in objective and subjective evaluations even if reference speech is not used as the input of a speaker encoder directly.

引用

页码：2968 / 2972

页数：5

共 50 条

[21] TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS
Zhang, Xulong
Wang, Jianzong
Cheng, Ning
Xiao, Jing
[J]. 2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
[22] Keyword-based speaker localization: Localizing a target speaker in a multi-speaker environment
Sivasankaran, Sunit
Vincent, Emmanuel
Fohr, Dominique
[J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2703 - 2707
[23] PHONEME DEPENDENT SPEAKER EMBEDDING AND MODEL FACTORIZATION FOR MULTI-SPEAKER SPEECH SYNTHESIS AND ADAPTATION
Fu, Ruibo
Tao, Jianhua
Wen, Zhengqi
Zheng, Yibin
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6930 - 6934
[24] Improving multi-speaker TTS prosody variance with a residual encoder and normalizing flows
Valles-Perez, Ivan
Roth, Julian
Beringer, Grzegorz
Barra-Chicote, Roberto
Droppo, Jasha
[J]. INTERSPEECH 2021, 2021, : 3131 - 3135
[25] Sparse DNN-based speaker segmentation using side information
Ma, Yong
Bao, Chang-Chun
[J]. ELECTRONICS LETTERS, 2015, 51 (08) : 651 - 653
[26] An Investigation of DNN-Based Speech Synthesis Using Speaker Codes
Hojo, Nobukatsu
Ijima, Yusuke
Mizuno, Hideyuki
[J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2278 - 2282
[27] Improving Multi-Speaker Tacotron with Speaker Gating Mechanisms
Zhao, Wei
Xu, Li
He, Ting
[J]. PROCEEDINGS OF THE 39TH CHINESE CONTROL CONFERENCE, 2020, : 7498 - 7503
[28] Zero-shot multi-speaker accent TTS with limited accent data
Zhang, Mingyang
Zhou, Yi
Wu, Zhizheng
Li, Haizhou
[J]. 2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1931 - 1936
[29] A hybrid approach to speaker recognition in multi-speaker environment
Trivedi, J
Maitra, A
Mitra, SK
[J]. PATTERN RECOGNITION AND MACHINE INTELLIGENCE, PROCEEDINGS, 2005, 3776 : 272 - 275
[30] CAN WE USE COMMON VOICE TO TRAIN A MULTI-SPEAKER TTS SYSTEM?
Ogun, Sewade
Colotte, Vincent
Vincent, Emmanuel
[J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 900 - 905

← 1 2 3 4 5 →