An Experimental Analysis on Integrating Multi-Stream Spectro-Temporal, Cepstral and Pitch Information for Mandarin Speech Recognition

被引:7
|
作者
Wang, Yow-Bang [1 ]
Li, Shang-Wen [2 ]
Lee, Lin-shan [3 ]
机构
[1] Natl Taiwan Univ, Grad Inst Elect Engn, Taipei 10617, Taiwan
[2] MIT, Comp Sci & Artificial Intelligence Lab, Cambridge, MA 02139 USA
[3] Natl Taiwan Univ, Dept Elect Engn, Taipei 10617, Taiwan
关键词
Pitch; spectro-temporal features; tandem system; toneme; CHINESE-LANGUAGE; FEATURES; DICTATION;
D O I
10.1109/TASL.2013.2263803
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Gabor features have been proposed for extracting spectro-temporal modulation information from speech signals, and have been shown to yield large improvements in recognition accuracy. We use a flexible Tandem system framework that integrates multi-stream information including Gabor, MFCC, and pitch features in various ways, by modeling either or both of the tone and phoneme variations in Mandarin speech recognition. We use either phonemes or tonal phonemes (tonemes) as either the target classes of MLP posterior estimation and/or the acoustic units of HMM recognition. The experiments yield a comprehensive analysis on the contributions to recognition accuracy made by either of the feature sets. We discuss their complementarities in tone, phoneme, and toneme classification. We show that Gabor features are better for recognition of vowels and unvoiced consonants, while MFCCs are better for voiced consonants. Also, Gabor features are capable of capturing changes in signals across time and frequency bands caused by Mandarin tone patterns, while pitch features further offer extra tonal information. This explains why the integration of Gabor, MFCC, and pitch features offers such significant improvements.
引用
收藏
页码:2006 / 2014
页数:9
相关论文
共 8 条
  • [1] Multi-Stream Spectro-Temporal Features for Robust Speech Recognition
    Zhao, Sherry Y.
    Morgan, Nelson
    [J]. INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 898 - 901
  • [2] Localized spectro-temporal cepstral analysis of speech
    Bouvrie, Jake
    Ezzat, Tony
    Poggio, Tomaso
    [J]. 2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 4733 - 4736
  • [3] MULTI-STREAM SPECTRO-TEMPORAL AND CEPSTRAL FEATURES BASED ON DATA-DRIVEN HIERARCHICAL PHONEME CLUSTERS
    Li, Shang-wen
    Sun, Liang-che
    Lee, Lin-shan
    [J]. 2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 5196 - 5199
  • [4] Multi-Stream to Many-Stream: Using Spectro-Temporal Features for ASR
    Zhao, Sherry Y.
    Ravuri, Suman
    Morgan, Nelson
    [J]. INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2935 - 2938
  • [5] Improved Phoneme Recognition by Integrating Evidence from Spectro-temporal and Cepstral Features
    Li, Shang-wen
    Sun, Liang-che
    Lee, Lin-shan
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 1177 - 1180
  • [6] Improved Tonal Language Speech Recognition by Integrating Spectro-temporal Evidence and Pitch Information with Properly Chosen Tonal Acoustic Units
    Li, Shang-wen
    Wang, Yow-bang
    Sun, Liang-che
    Lee, Lin-shan
    [J]. 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 2304 - +
  • [7] Combining Information from Multi-Stream Features Using Deep Neural Network in Speech Recognition
    Zhou, Pan
    Dai, Lirong
    Liu, Qingfeng
    Jiang, Hui
    [J]. PROCEEDINGS OF 2012 IEEE 11TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP) VOLS 1-3, 2012, : 557 - +
  • [8] The use of temporal speech and lip information for multi-modal speaker identification via multi-stream HMM's
    Wark, T
    Sridharan, S
    Chandran, V
    [J]. 2000 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS, VOLS I-VI, 2000, : 2389 - 2392