An Experimental Analysis on Integrating Multi-Stream Spectro-Temporal, Cepstral and Pitch Information for Mandarin Speech Recognition

被引：7

作者：

Wang, Yow-Bang ^{[1
]}

Li, Shang-Wen ^{[2
]}

Lee, Lin-shan ^{[3
]}

机构：

[1] Natl Taiwan Univ, Grad Inst Elect Engn, Taipei 10617, Taiwan

[2] MIT, Comp Sci & Artificial Intelligence Lab, Cambridge, MA 02139 USA

[3] Natl Taiwan Univ, Dept Elect Engn, Taipei 10617, Taiwan

来源：

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2013年 / 21卷 / 10期

关键词：

Pitch; spectro-temporal features; tandem system; toneme; CHINESE-LANGUAGE; FEATURES; DICTATION;

D O I：

10.1109/TASL.2013.2263803

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Gabor features have been proposed for extracting spectro-temporal modulation information from speech signals, and have been shown to yield large improvements in recognition accuracy. We use a flexible Tandem system framework that integrates multi-stream information including Gabor, MFCC, and pitch features in various ways, by modeling either or both of the tone and phoneme variations in Mandarin speech recognition. We use either phonemes or tonal phonemes (tonemes) as either the target classes of MLP posterior estimation and/or the acoustic units of HMM recognition. The experiments yield a comprehensive analysis on the contributions to recognition accuracy made by either of the feature sets. We discuss their complementarities in tone, phoneme, and toneme classification. We show that Gabor features are better for recognition of vowels and unvoiced consonants, while MFCCs are better for voiced consonants. Also, Gabor features are capable of capturing changes in signals across time and frequency bands caused by Mandarin tone patterns, while pitch features further offer extra tonal information. This explains why the integration of Gabor, MFCC, and pitch features offers such significant improvements.

引用

页码：2006 / 2014

页数：9

共 8 条

[1] Multi-Stream Spectro-Temporal Features for Robust Speech Recognition
Zhao, Sherry Y.
Morgan, Nelson
[J]. INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 898 - 901
[2] Localized spectro-temporal cepstral analysis of speech
Bouvrie, Jake
Ezzat, Tony
Poggio, Tomaso
[J]. 2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 4733 - 4736
[3] MULTI-STREAM SPECTRO-TEMPORAL AND CEPSTRAL FEATURES BASED ON DATA-DRIVEN HIERARCHICAL PHONEME CLUSTERS
Li, Shang-wen
Sun, Liang-che
Lee, Lin-shan
[J]. 2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 5196 - 5199
[4] Multi-Stream to Many-Stream: Using Spectro-Temporal Features for ASR
Zhao, Sherry Y.
Ravuri, Suman
Morgan, Nelson
[J]. INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2935 - 2938
[5] Improved Phoneme Recognition by Integrating Evidence from Spectro-temporal and Cepstral Features
Li, Shang-wen
Sun, Liang-che
Lee, Lin-shan
[J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 1177 - 1180
[6] Improved Tonal Language Speech Recognition by Integrating Spectro-temporal Evidence and Pitch Information with Properly Chosen Tonal Acoustic Units
Li, Shang-wen
Wang, Yow-bang
Sun, Liang-che
Lee, Lin-shan
[J]. 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 2304 - +
[7] Combining Information from Multi-Stream Features Using Deep Neural Network in Speech Recognition
Zhou, Pan
Dai, Lirong
Liu, Qingfeng
Jiang, Hui
[J]. PROCEEDINGS OF 2012 IEEE 11TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP) VOLS 1-3, 2012, : 557 - +
[8] The use of temporal speech and lip information for multi-modal speaker identification via multi-stream HMM's
Wark, T
Sridharan, S
Chandran, V
[J]. 2000 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS, VOLS I-VI, 2000, : 2389 - 2392

← 1 →