Concatenative speech synthesis based on the plural unit selection and fusion method

被引：16

作者：

Mizutani, T ^{[1
]}

Kagoshima, T ^{[1
]}

机构：

[1] Toshiba Co Ltd, Ctr Corp Res & Dev, Kawasaki, Kanagawa 2128582, Japan

来源：

IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS | 2005年 / E88D卷 / 11期

关键词：

speech synthesis; plural unit selection; unit fusion; unit training; sense of stability and sense of voice;

D O I：

10.1093/ietisy/e88-d.11.2565

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper proposes a novel speech synthesis method to generate human-like natural speech. The conventional unit-selection-based synthesis method selects speech units from a large database, and concatenates them with or without modifying the prosody to generate synthetic speech. This method features highly human-like voice quality. The method, however, has a problem that a suitable speech unit is not necessarily selected. Since the unsuitable speech unit selection causes discontinuity between the consecutive speech units, the synthesized speech quality deteriorates. It might be considered that the conventional method can attain higher speech quality if the database size increases. However, preparation of a larger database requires a longer recording time. The narrator's voice quality does not remain constant throughout the recording period. This fact deteriorates the database quality, and still leaves the problem of unsuitable selection. We propose the plural unit selection and fusion method which avoids this problem. This method integrates the unit fusion used in the unit-training-based method with the conventional unit-selection-based method. The proposed method selects plural speech units for each segment, fuses the selected speech units for each segment, modifies the prosody of the fused speech units, and concatenates them to generate synthetic speech. This unit fusion creates speech units which are connected to one another with much less voice discontinuity, and realizes high quality speech. A subjective evaluation test showed that the proposed method greatly improves the speech quality compared with the conventional method. Also, it showed that the speech quality of the proposed method is kept high regardless of the database size, from small (10 minutes) to large (40 minutes). The proposed method is a new framework in the sense that it is a hybrid method between the unit-selection-based method and the unit-training-based method. In the framework, the algorithms of the unit selection and the unit fusion are exchangeable for more efficient techniques. Thus, the framework is expected to lead to new synthesis methods.

引用

页码：2565 / 2572

页数：8

共 50 条

[1] Scalable concatenative speech synthesis based on the plural unit selection and fusion method
Tamura, M
Mizutani, T
Kagoshima, T
[J]. 2005 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1-5: SPEECH PROCESSING, 2005, : 361 - 364
[2] Fast concatenative speech synthesis using pre-fused speech units based on the plural unit selection and fusion method
Tamura, Masatsune
Mizutani, Tatsuya
Kagoshima, Takehiko
[J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2007, E90D (02): : 544 - 553
[3] Speech synthesis based on the plural unit selection and fusion method using FWF model
Morinaka, Ryo
Tamura, Masatsune
Morita, Masahiro
Kagoshima, Takehiko
[J]. INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2019 - 2022
[4] Triphone based unit selection for concatenative visual speech synthesis
Huang, FJ
Cosatto, E
Graf, HP
[J]. 2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 2037 - 2040
[5] An efficient unit-selection method for embedded concatenative speech synthesis
Gros, Jerneja Zganec
Zganec, Mario
[J]. INFORMACIJE MIDEM-JOURNAL OF MICROELECTRONICS ELECTRONIC COMPONENTS AND MATERIALS, 2007, 37 (03): : 158 - 164
[6] A short latency unit selection method with redundant search for concatenative speech synthesis
Nishizawa, Nobuyuki
Kawai, Hisashi
[J]. 2006 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-13, 2006, : 757 - 760
[7] Speech unit selection based on target values driven by speech data in concatenative speech synthesis
Hirai, T
Tenpaku, S
Shikano, K
[J]. PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON SPEECH SYNTHESIS, 2002, : 43 - 46
[8] Joint prosody prediction and unit selection for concatenative speech synthesis
Bulyko, I
Ostendorf, M
[J]. 2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-VI, PROCEEDINGS: VOL I: SPEECH PROCESSING 1; VOL II: SPEECH PROCESSING 2 IND TECHNOL TRACK DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS NEURALNETWORKS FOR SIGNAL PROCESSING; VOL III: IMAGE & MULTIDIMENSIONAL SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING, 2001, : 781 - 784
[9] An efficient unit-selection method for concatenative Text-to-speech synthesis systems
Gros, Jerneja Zganec
Zganec, Mario
[J]. Journal of Computing and Information Technology, 2008, 16 (01) : 69 - 78
[10] PERCEPTUAL CLUSTERING BASED UNIT SELECTION OPTIMIZATION FOR CONCATENATIVE TEXT-TO-SPEECH SYNTHESIS
Jiang, Tao
Wu, Zhiyong
Jia, Jia
Cai, Lianhong
[J]. 2012 8TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, 2012, : 64 - 68

← 1 2 3 4 5 →