Enhancing Word-Level Semantic Representation via Dependency Structure for Expressive Text-to-Speech Synthesis

被引:2
|
作者
Zhou, Yixuan [1 ,4 ]
Song, Changhe [1 ]
Li, Jingbei [1 ]
Wu, Zhiyong [1 ,2 ]
Bian, Yanyao [3 ]
Su, Dan [3 ]
Meng, Helen [2 ]
机构
[1] Tsinghua Univ, Shenzhen Int Grad Sch, Shenzhen, Peoples R China
[2] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[3] Tencent, Tencent AI Lab, Shenzhen, Peoples R China
[4] Tencent, Shenzhen, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
expressive speech synthesis; semantic representation enhancing; dependency parsing; graph neural network;
D O I
10.21437/Interspeech.2022-10061
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Exploiting rich linguistic information in raw text is crucial for expressive text-to-speech (TTS). As large scale pre-trained text representation develops, bidirectional encoder representations from Transformers (BERT) has been proven to embody semantic information and employed to TTS recently. However, original or simply fine-tuned BERT embeddings still cannot provide sufficient semantic knowledge that expressive TTS models should take into account. In this paper, we propose a word-level semantic representation enhancing method based on dependency structure and pre-trained BERT embedding. The BERT embedding of each word is reprocessed considering its specific dependencies and related words in the sentence, to generate more effective semantic representation for TTS. To better utilize the dependency structure, relational gated graph network (RGGN) is introduced to make semantic information flow and aggregate through the dependency structure. The experimental results show that the proposed method can further improve the naturalness and expressiveness of synthesized speeches on both Mandarin and English datasets(1).
引用
收藏
页码:5518 / 5522
页数:5
相关论文
共 47 条
  • [31] Enhancing Local Dependencies for Transformer-Based Text-to-Speech via Hybrid Lightweight Convolution
    Zhao, Wei
    He, Ting
    Xu, Li
    IEEE ACCESS, 2021, 9 : 42762 - 42770
  • [32] ZERO-SHOT TEXT-TO-SPEECH SYNTHESIS CONDITIONED USING SELF-SUPERVISED SPEECH REPRESENTATION MODEL
    Fujita, Kenichi
    Ashihara, Takanori
    Kanagawa, Hiroki
    Moriya, Takafumi
    Ijima, Yusuke
    2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
  • [33] Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation
    Tu, Tao
    Chen, Yuan-Jui
    Liu, Alexander H.
    Lee, Hung-yi
    INTERSPEECH 2020, 2020, : 3191 - 3195
  • [34] MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis
    Guan, Wenhao
    Li, Yishuang
    Li, Tao
    Huang, Hukai
    Wang, Feng
    Lin, Jiayan
    Huang, Lingyan
    Li, Lin
    Hong, Qingyang
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18117 - 18125
  • [35] NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality
    Tan, Xu
    Chen, Jiawei
    Liu, Haohe
    Cong, Jian
    Zhang, Chen
    Liu, Yanqing
    Wang, Xi
    Leng, Yichong
    Yi, Yuanhao
    He, Lei
    Zhao, Sheng
    Qin, Tao
    Soong, Frank
    Liu, Tie-Yan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (06) : 4234 - 4245
  • [36] Low-level articulatory synthesis: A working text-to-speech solution and a linguistic tool
    Hill, David R.
    Taube-Schock, Craig R.
    Manzara, Leonard
    CANADIAN JOURNAL OF LINGUISTICS-REVUE CANADIENNE DE LINGUISTIQUE, 2017, 62 (03): : 371 - 410
  • [37] Learning emotions latent representation with CVAE for text-driven expressive audiovisual speech synthesis
    Dahmani, Sara
    Colotte, Vincent
    Girard, Valerian
    Ouni, Slim
    NEURAL NETWORKS, 2021, 141 (141) : 315 - 329
  • [38] Which Resemblance is Useful to Predict Phrase Boundary Rise Labels for Japanese Expressive Text-to-speech Synthesis, Numerically-Expressed Stylistic or Distribution-based Semantic?
    Nakajima, Hideharu
    Mizuno, Hideyuki
    Yoshioka, Osamu
    Takahashi, Satoshi
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 1046 - 1050
  • [39] Investigation of Using Continuous Representation of Various Linguistic Units in Neural Network Based Text-to-Speech Synthesis
    Wang, Xin
    Takaki, Shinji
    Yamagishi, Junichi
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2016, E99D (10): : 2471 - 2480
  • [40] Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis
    Ribeiro, Manuel Sam
    Watts, Oliver
    Yamagishi, Junichi
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 3186 - 3190