WHISPERED AND LOMBARD NEURAL SPEECH SYNTHESIS

被引：8

作者：

Hu, Qiong ^{[1
]}

Bleisch, Tobias ^{[1
]}

Petkov, Petko ^{[1
]}

Raitio, Tuomo ^{[1
]}

Marchi, Erik ^{[1
]}

Lakshminarasimhan, Varun ^{[1
]}

机构：

[1] Apple Inc, Cupertino, CA 95014 USA

来源：

2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年

关键词：

speech synthesis; speaker adaptation; multi-speaker training; Lombard speech; whisper speech; TEXT-TO-SPEECH;

D O I：

10.1109/SLT48900.2021.9383454

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

It is desirable for a text-to-speech system to take into account the environment where synthetic speech is presented, and provide appropriate context-dependent output to the user. In this paper, we present and compare various approaches for generating different speaking styles, namely, normal, Lombard, and whisper speech, using only limited data. The following systems are proposed and assessed: 1) Pre-training and fine-tuning a model for each style. 2) Lombard and whisper speech conversion through a signal processing based approach. 3) Multi-style generation using a single model based on a speaker verification model. Our mean opinion score and AB preference listening tests show that 1) we can generate high quality speech through the pre-training/fine-tuning approach for all speaking styles. 2) Although our speaker verification (SV) model is not explicitly trained to discriminate different speaking styles, and no Lombard and whisper voice is used for pretrain this system, SV model can be used as style encoder for generating different style embeddings as input for Tacotron system. We also show that the resulting synthetic Lombard speech has a significant positive impact on intelligibility gain.

引用

页码：454 / 461

页数：8

共 50 条

[21] Segregation of whispered speech interleaved with noise or speech maskers
Iyer, Nandini
Brungart, Douglas S.
Simpson, Brian D.
12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 36 - +
[22] An algorithm for formant estimation of whispered speech
Gong Chenghui
Zhao Heming
Lu Gang
Liu Jianxin
2006 8TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, VOLS 1-4, 2006, : 724 - +
[23] Kinematics of Loud, Soft, and Whispered Speech
Dromey, Christopher
Peacock, Mendocino
FOLIA PHONIATRICA ET LOGOPAEDICA, 2024, 76 (05)
[24] Design of a whispered Chinese speech database
Department of Electronic Engineering, Beijing Institute of Technology, Beijing 100081, China
Qinghua Daxue Xuebao, 2008, SUPPL. 1 (725-729):
[25] REALIZATION OF PROSODIC FEATURES IN WHISPERED SPEECH
MEYEREPPLER, W
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1956, 28 (04): : 760 - 760
[26] Acoustic analysis of consonants in whispered speech
Jovicic, Slobodan T.
Saric, Zoran
JOURNAL OF VOICE, 2008, 22 (03) : 263 - 274
[27] Acoustic analysis and recognition of whispered speech
Itoh, T
Takeda, K
Itakura, F
2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 389 - 392
[28] VERTICAL LARYNX POSITION IN WHISPERED SPEECH
RIORDAN, C
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1977, 61 : S32 - S32
[29] Nasal coarticulation in Lombard speech
Lo, Justin J. H.
SPEECH COMMUNICATION, 2025, 169
[30] Study on the Emotion Recognition of Whispered Speech
Jin, Yun
Zhao, Yan
Huang, Chengwei
Zhao, Li
PROCEEDINGS OF THE 2009 WRI GLOBAL CONGRESS ON INTELLIGENT SYSTEMS, VOL III, 2009, : 242 - 246

← 1 2 3 4 5 →