Normalization Driven Zero-shot Multi-Speaker Speech Synthesis

被引:5
|
作者
Kumar, Neeraj [1 ,2 ]
Goel, Srishti [1 ]
Narang, Ankur [1 ]
Lall, Brejesh [2 ]
机构
[1] Hike Private Ltd, New Delhi, India
[2] Indian Inst Technol, Delhi, India
来源
关键词
Speech synthesis; normalization; transfer learning; wav2vec2.0 based speaker encoder; angular softmax;
D O I
10.21437/Interspeech.2021-441
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
In this paper, we present a novel zero-shot multi-speaker speech synthesis approach (ZSM-SS) that leverages the normalization architecture and speaker encoder with non-autoregressive multi-head attention driven encoder-decoder architecture.Given an input text and a reference speech sample of an unseen person, ZSM-SS can generate speech in that person's style in a zero-shot manner. Additionally, we demonstrate how the affine parameters of normalization help in capturing the prosodic features such as energy and fundamental frequency in a disentangled fashion and can be used to generate morphed speech output. We demonstrate the efficacy of our proposed architecture on multi-speaker VCTK[1] and LibriTTS [2] datasets, using multiple quantitative metrics that measure generated speech distortion and MOS, along with speaker embedding analysis of the proposed speaker encoder model.
引用
收藏
页码:1354 / 1358
页数:5
相关论文
共 50 条
  • [1] Zero-Shot Normalization Driven Multi-Speaker Text to Speech Synthesis
    Kumar, Neeraj
    Narang, Ankur
    Lall, Brejesh
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1679 - 1693
  • [2] Towards Zero-Shot Multi-Speaker Multi-Accent Text-to-Speech Synthesis
    Zhang, Mingyang
    Zhou, Xuehao
    Wu, Zhizheng
    Li, Haizhou
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 947 - 951
  • [3] Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations
    Jeon, Yejin
    Kim, Yunsu
    Lee, Gary Geunbae
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18336 - 18344
  • [4] YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone
    Casanova, Edresson
    Weber, Julian
    Shulby, Christopher
    Candido Junior, Arnaldo
    Goelge, Eren
    Ponti, Moacir Antonelli
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [5] ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH WITH STATE-OF-THE-ART NEURAL SPEAKER EMBEDDINGS
    Cooper, Erica
    Lai, Cheng-, I
    Yasuda, Yusuke
    Fang, Fuming
    Wang, Xin
    Chen, Nanxin
    Yamagishi, Junichi
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6184 - 6188
  • [6] Effective Zero-Shot Multi-Speaker Text-to-Speech Technique Using Information Perturbation and a Speaker Encoder
    Bang, Chae-Woon
    Chun, Chanjun
    [J]. SENSORS, 2023, 23 (23)
  • [7] SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model
    Casanova, Edresson
    Shulby, Christopher
    Golge, Eren
    Muller, Nicolas Michael
    de Oliveira, Frederico Santos
    Candido Junior, Arnaldo
    Soares, Anderson da Silva
    Aluisio, Sandra Maria
    Ponti, Moacir Antonelli
    [J]. INTERSPEECH 2021, 2021, : 3645 - 3649
  • [8] Multi-Scale Speaker Vectors for Zero-Shot Speech Synthesis
    Cory, Tristin
    Iqbal, Razib
    [J]. 2022 IEEE 46TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE (COMPSAC 2022), 2022, : 496 - 501
  • [9] NNSPEECH: SPEAKER-GUIDED CONDITIONAL VARIATIONAL AUTOENCODER FOR ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH
    Zhao, Botao
    Zhang, Xulong
    Wang, Jianzong
    Cheng, Ning
    Xiao, Jing
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4293 - 4297
  • [10] Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech
    Choi, Byoung Jin
    Jeong, Myeonghun
    Kim, Minchan
    Mun, Sung Hwan
    Kim, Nam Soo
    [J]. PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1708 - 1712