A unit selection text-to-speech-and-singing synthesis framework from neutral speech: proof of concept

被引：1

作者：

Freixes, Marc ^{[1
]}

Alias, Francesc ^{[1
]}

Claudi Socoro, Joan ^{[1
]}

机构：

[1] La Salle Univ Ramon Llull, Grup Recerca Tecnol Media GTM, Quatre Camins 30, Barcelona 08022, Spain

来源：

EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING | 2019年 / 2019卷 / 01期

关键词：

Text-to-speech; Unit selection; Speech synthesis; Singing synthesis; Speech-to-singing; VOICE SYNTHESIS SYSTEM; PLUS NOISE MODEL; QUALITY;

D O I：

10.1186/s13636-019-0163-y

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Text-to-speech (TTS) synthesis systems have been widely used in general-purpose applications based on the generation of speech. Nonetheless, there are some domains, such as storytelling or voice output aid devices, which may also require singing. To enable a corpus-based TTS system to sing, a supplementary singing database should be recorded. This solution, however, might be too costly for eventual singing needs, or even unfeasible if the original speaker is unavailable or unable to sing properly. This work introduces a unit selection-based text-to-speech-and-singing (US-TTS&S) synthesis framework, which integrates speech-to-singing (STS) conversion to enable the generation of both speech and singing from an input text and a score, respectively, using the same neutral speech corpus. The viability of the proposal is evaluated considering three vocal ranges and two tempos on a proof-of-concept implementation using a 2.6-h Spanish neutral speech corpus. The experiments show that challenging STS transformation factors are required to sing beyond the corpus vocal range and/or with notes longer than 150 ms. While score-driven US configurations allow the reduction of pitch-scale factors, time-scale factors are not reduced due to the short length of the spoken vowels. Moreover, in the MUSHRA test, text-driven and score-driven US configurations obtain similar naturalness rates of around 40 for all the analysed scenarios. Although these naturalness scores are far from those of vocaloid, the singing scores of around 60 which were obtained validate that the framework could reasonably address eventual singing needs.

引用

页数：14

共 50 条

[21] SPEECH SYNTHESIS FROM TEXT
SAGISAKA, Y
IEEE COMMUNICATIONS MAGAZINE, 1990, 28 (01) : 35 - +
[22] Recording and annotation of speech corpus for Czech unit selection speech synthesis
Matousek, Jindrich
Romportl, Jan
TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2007, 4629 : 326 - +
[23] Polish unit selection speech synthesis with BOSS: extensions and speech corpora
Demenko, Grazyna
Klessa, Katarzyna
Szymanski, Marcin
Breuer, Stefan
Hess, Wolfgang
INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2010, 13 (02) : 85 - 99
[24] An efficient unit-selection method for concatenative Text-to-speech synthesis systems
Gros, Jerneja Zganec
Zganec, Mario
Journal of Computing and Information Technology, 2008, 16 (01) : 69 - 78
[25] Multimodal Unit Selection for 2D Audiovisual Text-to-Speech Synthesis
Mattheyses, Wesley
Latacz, Lukas
Verhelst, Werner
Sahil, Hichen
MACHINE LEARNING FOR MULTIMODAL INTERACTION, PROCEEDINGS, 2008, 5237 : 125 - 136
[26] A Unit Selection Text-to-Speech Synthesis System Optimized for Use with Screen Readers
Chalamandaris, Aimilios
Karabetsos, Sotiris
Tsiakoulis, Pirros
Raptis, Spyros
IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2010, 56 (03) : 1890 - 1897
[27] PERCEPTUAL CLUSTERING BASED UNIT SELECTION OPTIMIZATION FOR CONCATENATIVE TEXT-TO-SPEECH SYNTHESIS
Jiang, Tao
Wu, Zhiyong
Jia, Jia
Cai, Lianhong
2012 8TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, 2012, : 64 - 68
[28] Unit Selection Model in Arabic Speech Synthesis
Al-Saiyd, Nedhal A.
Hijjawi, Mohammad
INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2018, 18 (04): : 126 - 131
[29] Unit-centric feature mapping for inventory pruning in unit selection text-to-speech synthesis
Bellegarda, Jerome R.
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2008, 16 (01): : 74 - 82
[30] Speech unit selection based on target values driven by speech data in concatenative speech synthesis
Hirai, T
Tenpaku, S
Shikano, K
PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON SPEECH SYNTHESIS, 2002, : 43 - 46

← 1 2 3 4 5 →