Proficiency Assessment of ESL Learner's Sentence Prosody with TTS Synthesized Voice as Reference

被引：7

作者：

Xiao, Yujia ^{[1
,2
]}

Soong, Frank K. ^{[2
]}

机构：

[1] South China Univ Technol, Guangzhou, Guangdong, Peoples R China

[2] Microsoft Res Asia, Beijing, Peoples R China

来源：

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION | 2017年

关键词：

Nativeness; Dynamic Time Warping (DTW); Prosody; Gaussian mixture model; Deep Neural Network;

D O I：

10.21437/Interspeech.2017-64

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We investigate how to assess the prosody quality of an ESL learner's spoken sentence against native speaker's natural recording or TTS synthesized voice. A spoken English utterance read by an ESL leaner is compared with the recording of a native speaker, or TTS voice. The corresponding F0 contours (with voicings) and breaks are compared at the mapped syllable level via a DTW. The correlations between the prosody patterns of learner and native speaker (or TTS voice) of the same sentence are computed after the speech rates and F0 distributions between speakers arc equalized. Based upon collected native and non-native speakers' databases and correlation coefficients, we use Gaussian mixtures to model them as continuous distributions for training a two-class (native vs non-native) neural net classifier. We found that classification accuracy between using native speaker's and TTS reference is close, i.e., 91.2% vs 88.1%. To assess the prosody proficiency of an ESL learner with one sentence input, the prosody patterns of our high quality TTS is almost as effective as those of native speakers' recordings, which are more expensive and inconvenient to collect.

引用

页码：1755 / 1759

页数：5