Speech synthesis with various emotional expressions and speaking styles by style interpolation and morphing

被引：66

作者：

Tachibana, M ^{[1
]}

Yamagishi, J ^{[1
]}

Masuko, T ^{[1
]}

Kobayashi, T ^{[1
]}

机构：

[1] Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Yokohama, Kanagawa 2268502, Japan

来源：

IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS | 2005年 / E88D卷 / 11期

关键词：

HMM-based speech synthesis; speaking style; emotional expression; style interpolation; style morphing; hidden semi-Markov model (HSMM);

D O I：

10.1093/ietisy/e88-d.11.2484

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper describes an approach to generating speech with emotional expressivity and speaking style variability. The approach is based on a speaking style and emotional expression modeling technique for HMM-based speech synthesis. We first model several representative styles, each of which is a speaking style and/or an emotional expression, in an HMM-based speech synthesis framework. Then, to generate synthetic speech with an intermediate style from representative ones, we synthesize speech from a model obtained by interpolating representative style models using a model interpolation technique. We assess the style interpolation technique with subjective evaluation tests using four representative styles, i.e., neutral, joyful, sad, and rough in read speech and synthesized speech from models obtained by interpolating models for all combinations of two styles. The results show that speech synthesized from the interpolated model has a style in between the two representative ones. Moreover, we can control the degree of expressivity for speaking styles or emotions in synthesized speech by changing the interpolation ratio in interpolation between neutral and other representative styles. We also show that we can achieve style morphing in speech synthesis, namely, changing style smoothly from one representative style to another by gradually changing the interpolation ratio.

引用

页码：2484 / 2491

页数：8

共 29 条

[21] PHOTOREALISTIC ADAPTATION AND INTERPOLATION OF FACIAL EXPRESSIONS USING HMMS AND AAMS FOR AUDIO-VISUAL SPEECH SYNTHESIS
Filntisis, Panagiotis P.
Katsamanis, Athanasios
Maragos, Petros
2017 24TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2017, : 2941 - 2945
[22] Towards Multi-Scale Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis
Lei, Shun
Zhou, Yixuan
Chen, Liyang
Hu, Jiankun
Wu, Zhiyong
Kang, Shiyin
Meng, Helen
INTERSPEECH 2022, 2022, : 5523 - 5527
[23] CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis
Meng, Yi
Li, Xiang
Wu, Zhiyong
Li, Tingtian
Sun, Zixun
Xiao, Xinyu
Sun, Chi
Zhan, Hui
Meng, Helen
INTERSPEECH 2022, 2022, : 5533 - 5537
[24] Speaking style adaptation using context clustering decision tree for HMM-based speech synthesis
Yamagishi, J
Tachibana, M
Masuko, T
Kobayashi, T
2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 5 - 8
[25] An Effective Style Token Weight Control Technique for End-to-End Emotional Speech Synthesis
Kwon, Ohsung
Jang, Inseon
Ahn, ChungHyun
Kang, Hong-Goo
IEEE SIGNAL PROCESSING LETTERS, 2019, 26 (09) : 1383 - 1387
[26] End-to-End Emotional Speech Synthesis Using Style Tokens and Semi-Supervised Training
Wu, Pengfei
Ling, Zhenhua
Liu, Lijuan
Jiang, Yuan
Wu, Hongchuan
Dai, Lirong
2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 623 - 627
[27] ENHANCING SPEAKING STYLES IN CONVERSATIONAL TEXT-TO-SPEECH SYNTHESIS WITH GRAPH-BASED MULTI-MODAL CONTEXT MODELING
Li, Jingbei
Meng, Yi
Li, Chenyi
Wu, Zhiyong
Meng, Helen
Weng, Chao
Su, Dan
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7917 - 7921
[28] A SIMPLE AND EFFECTIVE PITCH RE-ESTIMATION METHOD FOR RICH PROSODY AND SPEAKING STYLES IN HMM-BASED SPEECH SYNTHESIS
Lin, Cheng-Yuan
Huang, Chien-Hung
Kuo, Chih-Chung
2012 8TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, 2012, : 286 - 290
[29] MIST-Tacotron: End-to-End Emotional Speech Synthesis Using Mel-Spectrogram Image Style Transfer
Moon, Sungwoo
Kim, Sunghyun
Choi, Yong-Hoon
IEEE ACCESS, 2022, 10 : 25455 - 25463

← 1 2 3 →