WHISPERED AND LOMBARD NEURAL SPEECH SYNTHESIS

被引：8

作者：

Hu, Qiong ^{[1
]}

Bleisch, Tobias ^{[1
]}

Petkov, Petko ^{[1
]}

Raitio, Tuomo ^{[1
]}

Marchi, Erik ^{[1
]}

Lakshminarasimhan, Varun ^{[1
]}

机构：

[1] Apple Inc, Cupertino, CA 95014 USA

来源：

2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年

关键词：

speech synthesis; speaker adaptation; multi-speaker training; Lombard speech; whisper speech; TEXT-TO-SPEECH;

D O I：

10.1109/SLT48900.2021.9383454

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

It is desirable for a text-to-speech system to take into account the environment where synthetic speech is presented, and provide appropriate context-dependent output to the user. In this paper, we present and compare various approaches for generating different speaking styles, namely, normal, Lombard, and whisper speech, using only limited data. The following systems are proposed and assessed: 1) Pre-training and fine-tuning a model for each style. 2) Lombard and whisper speech conversion through a signal processing based approach. 3) Multi-style generation using a single model based on a speaker verification model. Our mean opinion score and AB preference listening tests show that 1) we can generate high quality speech through the pre-training/fine-tuning approach for all speaking styles. 2) Although our speaker verification (SV) model is not explicitly trained to discriminate different speaking styles, and no Lombard and whisper voice is used for pretrain this system, SV model can be used as style encoder for generating different style embeddings as input for Tacotron system. We also show that the resulting synthetic Lombard speech has a significant positive impact on intelligibility gain.

引用

页码：454 / 461

页数：8

共 50 条

[31] Deterioration of Intelligibility in Whispered Japanese Speech
Konno, Hideaki
Sato, Rinako
Imai, Hideyuki
Kudo, Mineichi
2014 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2014,
[32] Acoustic analysis and recognition of whispered speech
Itoh, T
Takeda, K
Itakura, F
ASRU 2001: IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING, CONFERENCE PROCEEDINGS, 2001, : 429 - 432
[33] An Attempt to Create Speech Synthesis Model That Retains Lombard Effect Characteristics
Korvel, Grazina
Kurasova, Olga
Kostek, Bozena
PROCEEDINGS OF THE 16TH INTERNATIONAL JOINT CONFERENCE ON E-BUSINESS AND TELECOMMUNICATIONS, VOL 1: DCNET, ICE-B, OPTICS, SIGMAP AND WINSYS (ICETE), 2019, : 280 - 289
[34] Normal-to-Lombard adaptation of speech synthesis using long short-term memory recurrent neural networks
Bollepalli, Bajibabu
Juvela, Lauri
Airaksinen, Manu
Valentini-Botinhao, Cassia
Alku, Paavo
SPEECH COMMUNICATION, 2019, 110 : 64 - 75
[35] VOWEL DURATION IN WHISPERED AND IN NORMAL SPEECH
SHARF, DJ
LANGUAGE AND SPEECH, 1964, 7 (02) : 89 - 97
[36] Tone Recognition of Chinese Whispered Speech
Gong Chenghui
Zhao Heming
PACIIA: 2008 PACIFIC-ASIA WORKSHOP ON COMPUTATIONAL INTELLIGENCE AND INDUSTRIAL APPLICATION, VOLS 1-3, PROCEEDINGS, 2008, : 401 - +
[37] THE LOMBARD EFFECT ON ALARYNGEAL SPEECH
ZEINE, L
BRANDT, JF
JOURNAL OF COMMUNICATION DISORDERS, 1988, 21 (05) : 373 - 383
[38] A Comprehensive Vowel Space for Whispered Speech
Sharifzadeh, Hamid Reza
McLoughlin, Ian V.
Russell, Martin J.
JOURNAL OF VOICE, 2012, 26 (02) : E49 - E56
[39] RECOGNITION OF WORD TONES IN WHISPERED SPEECH
JENSEN, MK
WORD-JOURNAL OF THE INTERNATIONAL LINGUISTIC ASSOCIATION, 1958, 14 (2-3): : 187 - 196
[40] REALIZATION OF PROSODIC FEATURES IN WHISPERED SPEECH
MEYEREPPLER, W
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1957, 29 (01): : 104 - 106

← 1 2 3 4 5 →