WHISPERED AND LOMBARD NEURAL SPEECH SYNTHESIS

被引:8
|
作者
Hu, Qiong [1 ]
Bleisch, Tobias [1 ]
Petkov, Petko [1 ]
Raitio, Tuomo [1 ]
Marchi, Erik [1 ]
Lakshminarasimhan, Varun [1 ]
机构
[1] Apple Inc, Cupertino, CA 95014 USA
关键词
speech synthesis; speaker adaptation; multi-speaker training; Lombard speech; whisper speech; TEXT-TO-SPEECH;
D O I
10.1109/SLT48900.2021.9383454
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
It is desirable for a text-to-speech system to take into account the environment where synthetic speech is presented, and provide appropriate context-dependent output to the user. In this paper, we present and compare various approaches for generating different speaking styles, namely, normal, Lombard, and whisper speech, using only limited data. The following systems are proposed and assessed: 1) Pre-training and fine-tuning a model for each style. 2) Lombard and whisper speech conversion through a signal processing based approach. 3) Multi-style generation using a single model based on a speaker verification model. Our mean opinion score and AB preference listening tests show that 1) we can generate high quality speech through the pre-training/fine-tuning approach for all speaking styles. 2) Although our speaker verification (SV) model is not explicitly trained to discriminate different speaking styles, and no Lombard and whisper voice is used for pretrain this system, SV model can be used as style encoder for generating different style embeddings as input for Tacotron system. We also show that the resulting synthetic Lombard speech has a significant positive impact on intelligibility gain.
引用
收藏
页码:454 / 461
页数:8
相关论文
共 50 条
  • [1] APPLICATION OF NEURAL NETWORKS IN WHISPERED SPEECH RECOGNITION
    Grozdic, Dorde T.
    Markovic, Branko
    Galic, Jovan
    Jovicic, Slobodan T.
    2012 20TH TELECOMMUNICATIONS FORUM (TELFOR), 2012, : 728 - 731
  • [2] Voice Conversion for Whispered Speech Synthesis
    Cotescu, Marius
    Drugman, Thomas
    Huybrechts, Goeric
    Lorenzo-Trueba, Jaime
    Moinet, Alexis
    IEEE SIGNAL PROCESSING LETTERS, 2020, 27 : 186 - 190
  • [3] Neural Whispered Speech Detection with Imbalanced Learning
    Ashihara, Takanori
    Shinohara, Yusuke
    Sato, Hiroshi
    Moriya, Takafumi
    Matsui, Kiyoaki
    Fukutomi, Takaaki
    Yamaguchi, Yoshikazu
    Aono, Yushi
    INTERSPEECH 2019, 2019, : 3352 - 3356
  • [4] Reconstruction of Normal Speech from Whispered Speech based on RBF Neural Network
    Tao, Zhi
    Tan, Xue-Dan
    Han, Tao
    Gu, Ji-Hua
    Xu, Yi-Shen
    Zhao, He-Ming
    2010 THIRD INTERNATIONAL SYMPOSIUM ON INTELLIGENT INFORMATION TECHNOLOGY AND SECURITY INFORMATICS (IITSI 2010), 2010, : 374 - 377
  • [5] LOMBARD SPEECH SYNTHESIS USING LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORKS
    Bollepalli, Bajibabu
    Airaksinen, Manu
    Alku, Paavo
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5505 - 5509
  • [6] Analysis of HMM-Based Lombard Speech Synthesis
    Raitio, Tuomo
    Suni, Antti
    Vainio, Martti
    Alku, Paavo
    12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 2792 - +
  • [7] Online Lombard-adaptation in incremental speech synthesis
    Rottschaefer, Sebastian
    Buschmeier, Hendrik
    van Welbergen, Herwin
    Kopp, Stefan
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 80 - 84
  • [8] A robust Voiced/Unvoiced phoneme classification from whispered speech using the 'color' of whispered phonemes and Deep Neural Network
    Meenakshi, G. Nisha
    Ghosh, Prasanta Kumar
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 503 - 507
  • [9] Analysis and recognition of whispered speech
    Ito, T
    Takeda, K
    Itakura, F
    SPEECH COMMUNICATION, 2005, 45 (02) : 139 - 152
  • [10] Speaker Identification with Whispered Speech mode Using MFCC: Challenges to Whispered Speech Identification
    Sardar, V. M.
    Shrbahadurkar, S. D.
    2015 IEEE INTERNATIONAL CONFERENCE ON INFORMATION PROCESSING (ICIP), 2015, : 70 - 74