Speech synthesis with various emotional expressions and speaking styles by style interpolation and morphing

被引:66
|
作者
Tachibana, M [1 ]
Yamagishi, J [1 ]
Masuko, T [1 ]
Kobayashi, T [1 ]
机构
[1] Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Yokohama, Kanagawa 2268502, Japan
来源
关键词
HMM-based speech synthesis; speaking style; emotional expression; style interpolation; style morphing; hidden semi-Markov model (HSMM);
D O I
10.1093/ietisy/e88-d.11.2484
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper describes an approach to generating speech with emotional expressivity and speaking style variability. The approach is based on a speaking style and emotional expression modeling technique for HMM-based speech synthesis. We first model several representative styles, each of which is a speaking style and/or an emotional expression, in an HMM-based speech synthesis framework. Then, to generate synthetic speech with an intermediate style from representative ones, we synthesize speech from a model obtained by interpolating representative style models using a model interpolation technique. We assess the style interpolation technique with subjective evaluation tests using four representative styles, i.e., neutral, joyful, sad, and rough in read speech and synthesized speech from models obtained by interpolating models for all combinations of two styles. The results show that speech synthesized from the interpolated model has a style in between the two representative ones. Moreover, we can control the degree of expressivity for speaking styles or emotions in synthesized speech by changing the interpolation ratio in interpolation between neutral and other representative styles. We also show that we can achieve style morphing in speech synthesis, namely, changing style smoothly from one representative style to another by gradually changing the interpolation ratio.
引用
收藏
页码:2484 / 2491
页数:8
相关论文
共 29 条
  • [21] PHOTOREALISTIC ADAPTATION AND INTERPOLATION OF FACIAL EXPRESSIONS USING HMMS AND AAMS FOR AUDIO-VISUAL SPEECH SYNTHESIS
    Filntisis, Panagiotis P.
    Katsamanis, Athanasios
    Maragos, Petros
    2017 24TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2017, : 2941 - 2945
  • [22] Towards Multi-Scale Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis
    Lei, Shun
    Zhou, Yixuan
    Chen, Liyang
    Hu, Jiankun
    Wu, Zhiyong
    Kang, Shiyin
    Meng, Helen
    INTERSPEECH 2022, 2022, : 5523 - 5527
  • [23] CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis
    Meng, Yi
    Li, Xiang
    Wu, Zhiyong
    Li, Tingtian
    Sun, Zixun
    Xiao, Xinyu
    Sun, Chi
    Zhan, Hui
    Meng, Helen
    INTERSPEECH 2022, 2022, : 5533 - 5537
  • [24] Speaking style adaptation using context clustering decision tree for HMM-based speech synthesis
    Yamagishi, J
    Tachibana, M
    Masuko, T
    Kobayashi, T
    2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 5 - 8
  • [25] An Effective Style Token Weight Control Technique for End-to-End Emotional Speech Synthesis
    Kwon, Ohsung
    Jang, Inseon
    Ahn, ChungHyun
    Kang, Hong-Goo
    IEEE SIGNAL PROCESSING LETTERS, 2019, 26 (09) : 1383 - 1387
  • [26] End-to-End Emotional Speech Synthesis Using Style Tokens and Semi-Supervised Training
    Wu, Pengfei
    Ling, Zhenhua
    Liu, Lijuan
    Jiang, Yuan
    Wu, Hongchuan
    Dai, Lirong
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 623 - 627
  • [27] ENHANCING SPEAKING STYLES IN CONVERSATIONAL TEXT-TO-SPEECH SYNTHESIS WITH GRAPH-BASED MULTI-MODAL CONTEXT MODELING
    Li, Jingbei
    Meng, Yi
    Li, Chenyi
    Wu, Zhiyong
    Meng, Helen
    Weng, Chao
    Su, Dan
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7917 - 7921
  • [28] A SIMPLE AND EFFECTIVE PITCH RE-ESTIMATION METHOD FOR RICH PROSODY AND SPEAKING STYLES IN HMM-BASED SPEECH SYNTHESIS
    Lin, Cheng-Yuan
    Huang, Chien-Hung
    Kuo, Chih-Chung
    2012 8TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, 2012, : 286 - 290
  • [29] MIST-Tacotron: End-to-End Emotional Speech Synthesis Using Mel-Spectrogram Image Style Transfer
    Moon, Sungwoo
    Kim, Sunghyun
    Choi, Yong-Hoon
    IEEE ACCESS, 2022, 10 : 25455 - 25463