Towards Zero-Shot Multi-Speaker Multi-Accent Text-to-Speech Synthesis

被引:2
|
作者
Zhang, Mingyang [1 ]
Zhou, Xuehao [2 ]
Wu, Zhizheng [1 ]
Li, Haizhou [1 ,2 ]
机构
[1] Chinese Univ Hong Kong, Shenzhen Res Inst Big Data, Sch Data Sci, Shenzhen 518172, Peoples R China
[2] Natl Univ Singapore, Singapore 117583, Singapore
基金
中国国家自然科学基金;
关键词
Accent speech synthesis; limited data; multi accent modelling; text-to-speech;
D O I
10.1109/LSP.2023.3292740
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
This letter presents a framework towards multi-accent neural text-to-speech synthesis for zero-shot multi-speaker, which employs an encoder-decoder architecture and an accent classifier to control the pronunciation variation from the encoder. The encoder and decoder are pre-trained on a large-scale multi-speaker corpus. The accent-informed encoder outputs are taken by the attention-based decoder to generate accented prosody. This framework allows for fine-tuning with limited training data from multiple accents, and is able to generate accented speech for unseen speakers. Both objective and subjective evaluations confirm the effectiveness of the proposed framework.
引用
收藏
页码:947 / 951
页数:5
相关论文
共 50 条
  • [1] Zero-Shot Normalization Driven Multi-Speaker Text to Speech Synthesis
    Kumar, Neeraj
    Narang, Ankur
    Lall, Brejesh
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1679 - 1693
  • [2] ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH WITH STATE-OF-THE-ART NEURAL SPEAKER EMBEDDINGS
    Cooper, Erica
    Lai, Cheng-, I
    Yasuda, Yusuke
    Fang, Fuming
    Wang, Xin
    Chen, Nanxin
    Yamagishi, Junichi
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6184 - 6188
  • [3] SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model
    Casanova, Edresson
    Shulby, Christopher
    Golge, Eren
    Muller, Nicolas Michael
    de Oliveira, Frederico Santos
    Candido Junior, Arnaldo
    Soares, Anderson da Silva
    Aluisio, Sandra Maria
    Ponti, Moacir Antonelli
    [J]. INTERSPEECH 2021, 2021, : 3645 - 3649
  • [4] Effective Zero-Shot Multi-Speaker Text-to-Speech Technique Using Information Perturbation and a Speaker Encoder
    Bang, Chae-Woon
    Chun, Chanjun
    [J]. SENSORS, 2023, 23 (23)
  • [5] NNSPEECH: SPEAKER-GUIDED CONDITIONAL VARIATIONAL AUTOENCODER FOR ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH
    Zhao, Botao
    Zhang, Xulong
    Wang, Jianzong
    Cheng, Ning
    Xiao, Jing
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4293 - 4297
  • [6] Normalization Driven Zero-shot Multi-Speaker Speech Synthesis
    Kumar, Neeraj
    Goel, Srishti
    Narang, Ankur
    Lall, Brejesh
    [J]. INTERSPEECH 2021, 2021, : 1354 - 1358
  • [7] Zero-shot multi-speaker accent TTS with limited accent data
    Zhang, Mingyang
    Zhou, Yi
    Wu, Zhizheng
    Li, Haizhou
    [J]. 2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1931 - 1936
  • [8] Transfer Learning for Low-Resource, Multi-Lingual, and Zero-Shot Multi-Speaker Text-to-Speech
    Jeong, Myeonghun
    Kim, Minchan
    Choi, Byoung Jin
    Yoon, Jaesam
    Jang, Won
    Kim, Nam Soo
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1519 - 1530
  • [9] Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech
    Choi, Byoung Jin
    Jeong, Myeonghun
    Kim, Minchan
    Mun, Sung Hwan
    Kim, Nam Soo
    [J]. PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1708 - 1712
  • [10] SC-CNN: Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems
    Yoon, Hyungchan
    Kim, Changhwan
    Um, Seyun
    Yoon, Hyun-Wook
    Kang, Hong-Goo
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 593 - 597