Improving human scoring of prosody using parametric speech synthesis

被引:5
|
作者
Prafianto, Hafiyan [1 ]
Nose, Takashi [1 ]
Chiba, Yuya [1 ]
Ito, Akinori [1 ]
机构
[1] Tohoku Univ, Grad Sch Engn, Aoba Ku, Aramaki Aza Aoba 6-6-05, Sendai, Miyagi 9808579, Japan
关键词
Computer assisted language learning (CALL); Computer assisted pronunciation training (CAPT); Automatic pronunciation evaluation system; Parametric speech synthesis; Average voice model; RECOGNITION;
D O I
10.1016/j.specom.2019.06.001
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper proposes a method that utilizes parametric speech synthesis to improve human scoring of non-native speaker utterances. Instead of assessing each prosodic feature by directly listening to the utterance itself, in order to focus only on the target prosodic feature, the unassessed features are substituted with those of the native speakers. We used parametric speech synthesis to generate the features for substitution. In this study, HMM-based speech synthesis from an average model of native speakers was utilized. The experimental result shows that the proposed method can improve scoring reliability, which is confirmed by an increase in the inter-rater correlation. We also build an automatic pronunciation evaluation system trained from non-native speech databases with scores given by either the conventional and proposed methods, and compare the performance of the systems. The result shows that the predicted pronunciation scores matched the human-rated scores; the human-machine correlation produced a score of 0.87, while the conventional scoring method produced a score of 0.74.
引用
收藏
页码:14 / 21
页数:8
相关论文
共 50 条
  • [31] Fairy Tale Storytelling System: Using Both Prosody and Text for Emotional Speech Synthesis
    Lee, Ho-Joon
    CONVERGENCE AND HYBRID INFORMATION TECHNOLOGY, 2012, 310 : 317 - 324
  • [32] PROSODY MODIFICATION ON MIXED-LANGUAGE SPEECH SYNTHESIS
    Zhang, Yi
    Tao, Jianhua
    2008 6TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, 2008, : 253 - 256
  • [33] Hierarchical Prosody Conversion Using Regression-Based Clustering for Emotional Speech Synthesis
    Wu, Chung-Hsien
    Hsia, Chi-Chun
    Lee, Chung-Han
    Lin, Mai-Chun
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (06): : 1394 - 1405
  • [34] Using Automatic Stress Extraction from Audio for Improved Prosody Modelling in Speech Synthesis
    Szaszak, Gyorgy
    Beke, Andras
    Olaszy, Gabor
    Toth, Balint Pal
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 2227 - 2231
  • [35] GRAPHPB: GRAPHICAL REPRESENTATIONS OF PROSODY BOUNDARY IN SPEECH SYNTHESIS
    Sun, Aolan
    Wang, Jianzong
    Cheng, Ning
    Peng, Huayi
    Zeng, Zhen
    Kong, Lingwei
    Xiao, Jing
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 438 - 445
  • [36] Statistical parametric speech synthesis
    Black, Alan W.
    Zen, Heiga
    Tokuda, Keiichi
    2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 1229 - +
  • [37] Statistical parametric speech synthesis
    Zen, Heiga
    Tokuda, Keiichi
    Black, Alan W.
    SPEECH COMMUNICATION, 2009, 51 (11) : 1039 - 1064
  • [38] Fine-grained prosody modeling in neural speech synthesis using ToBI representation
    Zou, Yuxiang
    Liu, Shichao
    Yin, Xiang
    Lin, Haopeng
    Wang, Chunfeng
    Zhang, Haoyu
    Ma, Zejun
    INTERSPEECH 2021, 2021, : 3146 - 3150
  • [39] A statistical approach for modeling prosody features using POS tags for emotional speech synthesis
    Bulut, Murtaza
    Lee, Sungbok
    Narayanan, Shrikanth
    2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 1237 - +
  • [40] STATISTICAL PARAMETRIC SPEECH SYNTHESIS USING DEEP NEURAL NETWORKS
    Zen, Heiga
    Senior, Andrew
    Schuster, Mike
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7962 - 7966