Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS

被引:1
|
作者
Udagawa, Kenta [1 ]
Saito, Yuki [1 ]
Saruwatari, Hiroshi [1 ]
机构
[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo, Japan
来源
关键词
DNN-based multi-speaker TTS; speaker adaptation; human-computer interaction; Bayesian optimization;
D O I
10.21437/Interspeech.2022-257
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper proposes a human-in-the-loop speaker-adaptation method for multi-speaker text-to-speech. With a conventional speaker-adaptation method, a target speaker's embedding vector is extracted from his/her reference speech using a speaker encoder trained on a speaker-discriminative task. However, this method cannot obtain an embedding vector for the target speaker when the reference speech is unavailable. Our method is based on a human-in-the-loop optimization framework, which incorporates a user to explore the speaker-embedding space to find the target speaker's embedding. The proposed method uses a sequential line search algorithm that repeatedly asks a user to select a point on a line segment in the embedding space. To efficiently choose the best speech sample from multiple stimuli, we also developed a system in which a user can switch between multiple speakers' voices for each phoneme. Experimental results indicate that the proposed method can achieve comparable performance to the conventional one in objective and subjective evaluations even if reference speech is not used as the input of a speaker encoder directly.
引用
收藏
页码:2968 / 2972
页数:5
相关论文
共 50 条
  • [21] TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS
    Zhang, Xulong
    Wang, Jianzong
    Cheng, Ning
    Xiao, Jing
    [J]. 2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [22] Keyword-based speaker localization: Localizing a target speaker in a multi-speaker environment
    Sivasankaran, Sunit
    Vincent, Emmanuel
    Fohr, Dominique
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2703 - 2707
  • [23] PHONEME DEPENDENT SPEAKER EMBEDDING AND MODEL FACTORIZATION FOR MULTI-SPEAKER SPEECH SYNTHESIS AND ADAPTATION
    Fu, Ruibo
    Tao, Jianhua
    Wen, Zhengqi
    Zheng, Yibin
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6930 - 6934
  • [24] Improving multi-speaker TTS prosody variance with a residual encoder and normalizing flows
    Valles-Perez, Ivan
    Roth, Julian
    Beringer, Grzegorz
    Barra-Chicote, Roberto
    Droppo, Jasha
    [J]. INTERSPEECH 2021, 2021, : 3131 - 3135
  • [25] Sparse DNN-based speaker segmentation using side information
    Ma, Yong
    Bao, Chang-Chun
    [J]. ELECTRONICS LETTERS, 2015, 51 (08) : 651 - 653
  • [26] An Investigation of DNN-Based Speech Synthesis Using Speaker Codes
    Hojo, Nobukatsu
    Ijima, Yusuke
    Mizuno, Hideyuki
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2278 - 2282
  • [27] Improving Multi-Speaker Tacotron with Speaker Gating Mechanisms
    Zhao, Wei
    Xu, Li
    He, Ting
    [J]. PROCEEDINGS OF THE 39TH CHINESE CONTROL CONFERENCE, 2020, : 7498 - 7503
  • [28] Zero-shot multi-speaker accent TTS with limited accent data
    Zhang, Mingyang
    Zhou, Yi
    Wu, Zhizheng
    Li, Haizhou
    [J]. 2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1931 - 1936
  • [29] A hybrid approach to speaker recognition in multi-speaker environment
    Trivedi, J
    Maitra, A
    Mitra, SK
    [J]. PATTERN RECOGNITION AND MACHINE INTELLIGENCE, PROCEEDINGS, 2005, 3776 : 272 - 275
  • [30] CAN WE USE COMMON VOICE TO TRAIN A MULTI-SPEAKER TTS SYSTEM?
    Ogun, Sewade
    Colotte, Vincent
    Vincent, Emmanuel
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 900 - 905