Interacting with computers by voice: Automatic speech recognition and synthesis

被引:45
|
作者
O'Shaughnessy, D [1 ]
机构
[1] INRS Telecommun, Montreal, PQ H5A 1K6, Canada
关键词
continuous speech recognition; distance measures; hidden Markov models (HMMs); human-computer dialogues; language models (LMs); linear predictive coding (LPC); spectral analysis; speech synthesis; text-to-speech (TTS);
D O I
10.1109/JPROC.2003.817117
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
This paper examines how people communicate with computers using speech. Automatic speech recognition (ASR) transforms speech into text, while automatic speech synthesis [or text-to-speech (TTS)] performs the reverse, task. ASR has largely developed based on speech coding theory, while simulating certain spectral analyses performed by the ear Typically, a Fourier transform is employed, but following the auditory Bark scale and simplifying the spectral representation with a decorrelation into cepstral coefficients'. Current ASR provides good accuracy and performance on limited practical tasks, but exploits only the most rudimentary knowledge about human production and perception phenomena. The popular mathematical model called the hidden Markov model (HMM) is examined; first-order HMMs are efficient but ignore long-range correlations in actual speech. Common language models use a time window of three successive words in their syntactic-semantic analysis. Speech synthesis is the automatic generation of a speech waveform, typically from an input text. As with ASR, TTS starts from a database of information previously established by analysis of much training data, both speech and text. Previously analyzed speech is stored in small units in the database, for concatenation in the proper sequence at runtime. TTS systems first perform text processing, including "letter-to-sound" conversion, to generate the phonetic transcription. Intonation must be properly specified to approximate the naturalness of human speech. Modem synthesizers using large databases of stored spectral patterns or waveforms output highly intelligible synthetic speech, but naturalness remains to be improved.
引用
收藏
页码:1272 / 1305
页数:34
相关论文
共 50 条
  • [1] Automatic Speech Recognition Technique For Voice Command
    Gupta, Anshul
    Patel, Nileshkumar
    Khan, Shabana
    [J]. 2014 INTERNATIONAL CONFERENCE ON SCIENCE ENGINEERING AND MANAGEMENT RESEARCH (ICSEMR), 2014,
  • [2] Voice recognition and aphasia: can computers understand aphasic speech?
    Wade, J
    Petheram, B
    Cain, R
    [J]. DISABILITY AND REHABILITATION, 2001, 23 (14) : 604 - 613
  • [4] Improving Speech Synthesis by Automatic Speech Recognition and Speech Discriminator
    Huang, Li-Yu
    Chen, Chia-Ping
    [J]. JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2024, 40 (01) : 189 - 200
  • [5] AUTOMATIC SPEECH RECOGNITION FOR ACOUSTICAL ANALYSIS AND ASSESSMENT OF CANTONESE PATHOLOGICAL VOICE AND SPEECH
    Lee, Tan
    Liu, Yuanyuan
    Huang, Pei-Wen
    Chien, Jen-Tzung
    Lam, Wang Kong
    Yeung, Yu Ting
    Law, Thomas K. T.
    Lee, Kathy Y. S.
    Kong, Anthony Pak-Hin
    Law, Sam-Po
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 6475 - 6479
  • [6] Automatic Speech Recognition Systems for the Evaluation of Voice and Speech Disorders in Head and Neck Cancer
    Andreas Maier
    Tino Haderlein
    Florian Stelzle
    Elmar Nöth
    Emeka Nkenke
    Frank Rosanowski
    Anne Schützenberger
    Maria Schuster
    [J]. EURASIP Journal on Audio, Speech, and Music Processing, 2010
  • [7] ON AUTOMATIC VOICE CASTING FOR EXPRESSIVE SPEECH: SPEAKER RECOGNITION VS. SPEECH CLASSIFICATION
    Obin, Nicolas
    Roebel, Axel
    Bachman, Gregoire
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [8] Automatic Speech Recognition Systems for the Evaluation of Voice and Speech Disorders in Head and Neck Cancer
    Maier, Andreas
    Haderlein, Tino
    Stelzle, Florian
    Noeth, Elmar
    Nkenke, Emeka
    Rosanowski, Frank
    Schuetzenberger, Anne
    Schuster, Maria
    [J]. EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2010,
  • [9] AN INTEGRATED KNOWLEDGE BASE FOR SPEECH SYNTHESIS AND AUTOMATIC SPEECH RECOGNITION
    TATHAM, MAA
    [J]. JOURNAL OF PHONETICS, 1985, 13 (02) : 175 - 188
  • [10] Dependence and independence in automatic speech recognition and synthesis
    King, S
    [J]. JOURNAL OF PHONETICS, 2003, 31 (3-4) : 407 - 411