Segmental intelligibility of four currently used text-to-speech synthesis methods

被引:11
|
作者
Venkatagiri, HS [1 ]
机构
[1] Iowa State Univ, Dept Psychol, Ames, IA 50011 USA
来源
关键词
D O I
10.1121/1.1558356
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The study investigated the segmental intelligibility of four currently available text-to-speech (TTS) products under 0-dBand 5-dB signal-to-noise ratios. The products were IBM ViaVoice(TM) version 5.1, which uses formant coding, Festival version 1.4.2, a diphone-based LPC TTS product, AT&T Next-Gen(TM), a half-phone-based TTS product that uses harmonic-plus-noise method for synthesis, and FlexVoice(TM)2, a hybrid TTS product that combines concatenative and formant coding techniques. Overall, concatenative techniques were more intelligible than formant or hybrid techniques, with formant coding slightly better at modeling vowels and concatenative techniques marginally better at synthesizing consonants. No TTS product was better at resisting noise interference than others, although all were more intelligible at 5 dB than at 0-dB SNR. The better TTS products in this study were, on the average, 22% less intelligible and had about 3 times more phoneme errors than human voice under comparable listening conditions. The hybrid TTS technology of FlexVoice had the lowest intelligibility and highest error rates. There were discernible patterns of errors for stops, fricatives, and nasals. Unrestricted TTS output-e-mail messages, news reports, and so on-under high noise conditions prevalent in automobiles, airports etc. will likely challenge the listeners. (C) 2003 Acoustical Society of America.
引用
收藏
页码:2095 / 2104
页数:10
相关论文
共 50 条
  • [1] Automatic Speech Recognition Used for Intelligibility Assessment of Text-to-Speech Systems
    Vich, Robert
    Nouza, Jan
    Vondra, Martin
    [J]. VERBAL AND NONVERBAL FEATURES OF HUMAN-HUMAN AND HUMAN-MACHINE INTERACTIONS, 2008, 5042 : 136 - +
  • [2] ASSIGNMENT OF SEGMENTAL DURATION IN TEXT-TO-SPEECH SYNTHESIS
    VANSANTEN, JPH
    [J]. COMPUTER SPEECH AND LANGUAGE, 1994, 8 (02): : 95 - 128
  • [3] Text-To-Speech Intelligibility across Speech Rates
    Syrdal, Ann K.
    Bunnell, H. Timothy
    Hertz, Susan R.
    Mishra, Taniya
    Spiegel, Murray
    Bickley, Corine
    Rekart, Deborah
    Makashay, Matthew J.
    [J]. 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 622 - 625
  • [4] Modeling segmental duration in German text-to-speech synthesis
    Mobius, B
    vanSanten, J
    [J]. ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 2395 - 2398
  • [5] Wavelet analysis used in text-to-speech synthesis
    Kobayashi, M
    Sakamoto, M
    Saito, T
    Hashimoto, Y
    Nishimura, M
    Suzuki, K
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-ANALOG AND DIGITAL SIGNAL PROCESSING, 1998, 45 (08): : 1125 - 1129
  • [6] Enhancing Speech Intelligibility in Text-To-Speech Synthesis using Speaking Style Conversion
    Paul, Dipjyoti
    Shifas, Muhammed P., V
    Pantazis, Yannis
    Stylianou, Yannis
    [J]. INTERSPEECH 2020, 2020, : 1361 - 1365
  • [7] Phoneme Intelligibility of Four Text-to-Speech Products to Nonnative Speakers of English in Noise
    Venkatagiri, H. S.
    [J]. INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2005, 8 (04) : 313 - 321
  • [8] Phoneme Intelligibility of Four Text-to-Speech Products to Nonnative Speakers of English in Noise
    H. S. Venkatagiri
    [J]. International Journal of Speech Technology, 2005, 8 (4) : 313 - 321
  • [9] Method of intelligibility testing for text-to-speech systems
    Sheffield, E
    Polizzi, P
    [J]. PROCEEDINGS OF THE FIFTH JOINT CONFERENCE ON INFORMATION SCIENCES, VOLS 1 AND 2, 2000, : A862 - A865
  • [10] Beyond intelligibility - The performance of text-to-speech synthesisers
    Johnston, RD
    [J]. BT TECHNOLOGY JOURNAL, 1996, 14 (01): : 100 - 111