Concatenative speech synthesis based on the plural unit selection and fusion method

被引:16
|
作者
Mizutani, T [1 ]
Kagoshima, T [1 ]
机构
[1] Toshiba Co Ltd, Ctr Corp Res & Dev, Kawasaki, Kanagawa 2128582, Japan
来源
关键词
speech synthesis; plural unit selection; unit fusion; unit training; sense of stability and sense of voice;
D O I
10.1093/ietisy/e88-d.11.2565
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper proposes a novel speech synthesis method to generate human-like natural speech. The conventional unit-selection-based synthesis method selects speech units from a large database, and concatenates them with or without modifying the prosody to generate synthetic speech. This method features highly human-like voice quality. The method, however, has a problem that a suitable speech unit is not necessarily selected. Since the unsuitable speech unit selection causes discontinuity between the consecutive speech units, the synthesized speech quality deteriorates. It might be considered that the conventional method can attain higher speech quality if the database size increases. However, preparation of a larger database requires a longer recording time. The narrator's voice quality does not remain constant throughout the recording period. This fact deteriorates the database quality, and still leaves the problem of unsuitable selection. We propose the plural unit selection and fusion method which avoids this problem. This method integrates the unit fusion used in the unit-training-based method with the conventional unit-selection-based method. The proposed method selects plural speech units for each segment, fuses the selected speech units for each segment, modifies the prosody of the fused speech units, and concatenates them to generate synthetic speech. This unit fusion creates speech units which are connected to one another with much less voice discontinuity, and realizes high quality speech. A subjective evaluation test showed that the proposed method greatly improves the speech quality compared with the conventional method. Also, it showed that the speech quality of the proposed method is kept high regardless of the database size, from small (10 minutes) to large (40 minutes). The proposed method is a new framework in the sense that it is a hybrid method between the unit-selection-based method and the unit-training-based method. In the framework, the algorithms of the unit selection and the unit fusion are exchangeable for more efficient techniques. Thus, the framework is expected to lead to new synthesis methods.
引用
收藏
页码:2565 / 2572
页数:8
相关论文
共 50 条
  • [1] Scalable concatenative speech synthesis based on the plural unit selection and fusion method
    Tamura, M
    Mizutani, T
    Kagoshima, T
    [J]. 2005 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1-5: SPEECH PROCESSING, 2005, : 361 - 364
  • [2] Fast concatenative speech synthesis using pre-fused speech units based on the plural unit selection and fusion method
    Tamura, Masatsune
    Mizutani, Tatsuya
    Kagoshima, Takehiko
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2007, E90D (02): : 544 - 553
  • [3] Speech synthesis based on the plural unit selection and fusion method using FWF model
    Morinaka, Ryo
    Tamura, Masatsune
    Morita, Masahiro
    Kagoshima, Takehiko
    [J]. INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2019 - 2022
  • [4] Triphone based unit selection for concatenative visual speech synthesis
    Huang, FJ
    Cosatto, E
    Graf, HP
    [J]. 2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 2037 - 2040
  • [5] An efficient unit-selection method for embedded concatenative speech synthesis
    Gros, Jerneja Zganec
    Zganec, Mario
    [J]. INFORMACIJE MIDEM-JOURNAL OF MICROELECTRONICS ELECTRONIC COMPONENTS AND MATERIALS, 2007, 37 (03): : 158 - 164
  • [6] A short latency unit selection method with redundant search for concatenative speech synthesis
    Nishizawa, Nobuyuki
    Kawai, Hisashi
    [J]. 2006 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-13, 2006, : 757 - 760
  • [7] Speech unit selection based on target values driven by speech data in concatenative speech synthesis
    Hirai, T
    Tenpaku, S
    Shikano, K
    [J]. PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON SPEECH SYNTHESIS, 2002, : 43 - 46
  • [8] Joint prosody prediction and unit selection for concatenative speech synthesis
    Bulyko, I
    Ostendorf, M
    [J]. 2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-VI, PROCEEDINGS: VOL I: SPEECH PROCESSING 1; VOL II: SPEECH PROCESSING 2 IND TECHNOL TRACK DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS NEURALNETWORKS FOR SIGNAL PROCESSING; VOL III: IMAGE & MULTIDIMENSIONAL SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING, 2001, : 781 - 784
  • [9] An efficient unit-selection method for concatenative Text-to-speech synthesis systems
    Gros, Jerneja Zganec
    Zganec, Mario
    [J]. Journal of Computing and Information Technology, 2008, 16 (01) : 69 - 78
  • [10] PERCEPTUAL CLUSTERING BASED UNIT SELECTION OPTIMIZATION FOR CONCATENATIVE TEXT-TO-SPEECH SYNTHESIS
    Jiang, Tao
    Wu, Zhiyong
    Jia, Jia
    Cai, Lianhong
    [J]. 2012 8TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, 2012, : 64 - 68