Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS

被引：1

作者：

Udagawa, Kenta ^{[1
]}

Saito, Yuki ^{[1
]}

Saruwatari, Hiroshi ^{[1
]}

机构：

[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo, Japan

来源：

INTERSPEECH 2022 | 2022年

关键词：

DNN-based multi-speaker TTS; speaker adaptation; human-computer interaction; Bayesian optimization;

D O I：

10.21437/Interspeech.2022-257

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper proposes a human-in-the-loop speaker-adaptation method for multi-speaker text-to-speech. With a conventional speaker-adaptation method, a target speaker's embedding vector is extracted from his/her reference speech using a speaker encoder trained on a speaker-discriminative task. However, this method cannot obtain an embedding vector for the target speaker when the reference speech is unavailable. Our method is based on a human-in-the-loop optimization framework, which incorporates a user to explore the speaker-embedding space to find the target speaker's embedding. The proposed method uses a sequential line search algorithm that repeatedly asks a user to select a point on a line segment in the embedding space. To efficiently choose the best speech sample from multiple stimuli, we also developed a system in which a user can switch between multiple speakers' voices for each phoneme. Experimental results indicate that the proposed method can achieve comparable performance to the conventional one in objective and subjective evaluations even if reference speech is not used as the input of a speaker encoder directly.

引用

页码：2968 / 2972

页数：5

共 50 条

[1] MULTI-SPEAKER MODELING AND SPEAKER ADAPTATION FOR DNN-BASED TTS SYNTHESIS
Fan, Yuchen
Qian, Yao
Soong, Frank K.
He, Lei
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4475 - 4479
[2] UNSUPERVISED SPEAKER ADAPTATION FOR DNN-BASED TTS SYNTHESIS
Fan, Yuchen
Qian, Yao
Soong, Frank K.
He, Lei
[J]. 2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5135 - 5139
[3] SPEAKER AND LANGUAGE FACTORIZATION IN DNN-BASED TTS SYNTHESIS
Fan, Yuchen
Qian, Yao
Soong, Frank K.
He, Lei
[J]. 2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5540 - 5544
[4] Hierarchical Transfer Learning for Multilingual, Multi-Speaker, and Style Transfer DNN-Based TTS on Low-Resource Languages
Azizah, Kurniawati
Adriani, Mirna
Jatmiko, Wisnu
[J]. IEEE ACCESS, 2020, 8 : 179798 - 179812
[5] DNN-based speaker clustering for speaker diarisation
Milner, Rosanna
Hain, Thomas
[J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2185 - 2189
[6] A DNN-based emotional speech synthesis by speaker adaptation
Yang, Hongwu
Zhang, Weizhao
Zhi, Pengpeng
[J]. 2018 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2018, : 633 - 637
[7] A study of speaker adaptation for DNN-based speech synthesis
Wu, Zhizheng
Swietojanski, Pawel
Veaux, Christophe
Renals, Steve
King, Simon
[J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 879 - 883
[8] DNN based multi-speaker speech synthesis with temporal auxiliary speaker ID embedding
Lee, Junmo
Song, Kwangsub
Noh, Kyoungjin
Park, Tae-Jun
Chang, Joon-Hyuk
[J]. 2019 INTERNATIONAL CONFERENCE ON ELECTRONICS, INFORMATION, AND COMMUNICATION (ICEIC), 2019, : 61 - 64
[9] Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?
Cooper, Erica
Lai, Cheng-, I
Yasuda, Yusuke
Yamagishi, Junichi
[J]. INTERSPEECH 2020, 2020, : 3979 - 3983
[10] Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations
Jeon, Yejin
Kim, Yunsu
Lee, Gary Geunbae
[J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18336 - 18344

← 1 2 3 4 5 →