Multimodal Unsupervised Speech Translation for Recognizing and Evaluating Second Language Speech

被引:4
|
作者
Lee, Yun Kyung [1 ]
Park, Jeon Gue [1 ]
机构
[1] Elect & Telecommun Res Inst ETRI, Artificial Intelligence Res Lab, Daejeon 34129, South Korea
来源
APPLIED SCIENCES-BASEL | 2021年 / 11卷 / 06期
关键词
fluency evaluation; speech recognition; data augmentation; variational autoencoder; speech conversion; NONPARALLEL VOICE CONVERSION; BLIND SEPARATION; RECOGNITION;
D O I
10.3390/app11062642
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
This paper addresses an automatic proficiency evaluation and speech recognition for second language (L2) speech. The proposed method recognizes the speech uttered by the L2 speaker, measures a variety of fluency scores, and evaluates the proficiency of the speaker's spoken English. Stress and rhythm scores are one of the important factors used to evaluate fluency in spoken English and are computed by comparing the stress patterns and the rhythm distributions to those of native speakers. In order to compute the stress and rhythm scores even when the phonemic sequence of the L2 speaker's English sentence is different from the native speaker's one, we align the phonemic sequences based on a dynamic time-warping approach. We also improve the performance of the speech recognition system for non-native speakers and compute fluency features more accurately by augmenting the non-native training dataset and training an acoustic model with the augmented dataset. In this work, we augment the non-native speech by converting some speech signal characteristics (style) while preserving its linguistic information. The proposed variational autoencoder (VAE)-based speech conversion network trains the conversion model by decomposing the spectral features of the speech into a speaker-invariant content factor and a speaker-specific style factor to estimate diverse and robust speech styles. Experimental results show that the proposed method effectively measures the fluency scores and generates diverse output signals. Also, in the proficiency evaluation and speech recognition tests, the proposed method improves the proficiency score performance and speech recognition accuracy for all proficiency areas compared to a method employing conventional acoustic models.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Multimodal input in second-language speech processing
    Hardison, Debra M.
    LANGUAGE TEACHING, 2021, 54 (02) : 206 - 220
  • [2] An objective method for evaluating speech translation system: Using a second, language learner's corpus
    Yasuda, K
    Sugaya, F
    Takezawa, T
    Kikui, G
    Yamamoto, S
    Yanagida, M
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2005, E88D (03): : 569 - 577
  • [3] Language Identification for Speech-to-Speech Translation
    Lim, Daniel Chung Yong
    Lane, Ian
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 204 - 207
  • [4] Simple and Effective Unsupervised Speech Translation
    Wang, Changhan
    Inaguma, Hirofumi
    Chen, Peng-Jen
    Kulikov, Ilia
    Tang, Yun
    Hsu, Wei-Ning
    Auli, Michael
    Pino, Juan
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 10771 - 10784
  • [5] Speech enabled Integrated AR-based Multimodal Language Translation
    Bhargava, Mahesh
    Dhote, Pavan
    Srivastava, Amit
    Kumar, Ajai
    2016 CONFERENCE ON ADVANCES IN SIGNAL PROCESSING (CASP), 2016, : 226 - 230
  • [6] Unsupervised features from text for speech synthesis in a speech-to-speech translation system
    Watts, Oliver
    Zhou, Bowen
    12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 2164 - 2167
  • [7] Unsupervised Methods for Evaluating Speech Representations
    Gump, Michael
    Hsu, Wei-Ning
    Glass, James
    INTERSPEECH 2020, 2020, : 170 - 174
  • [8] Applications of Language Modeling in Speech-To-Speech Translation
    Liu, Fu-Hua
    Gu, Liang
    Gao, Yuqing
    Picheny, Michael
    International Journal of Speech Technology, 2004, 7 (2-3) : 221 - 229
  • [9] Unsupervised training for Farsi-English speech-to-speech translation
    Xiang, Bing
    Deng, Yonggang
    Gao, Yuqing
    2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 4977 - 4980
  • [10] TOWARDS UNSUPERVISED SPEECH-TO-TEXT TRANSLATION
    Chung, Yu-An
    Weng, Wei-Hung
    Tong, Schrasing
    Glass, James
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7170 - 7174