Multimodal Unsupervised Speech Translation for Recognizing and Evaluating Second Language Speech

被引:4
|
作者
Lee, Yun Kyung [1 ]
Park, Jeon Gue [1 ]
机构
[1] Elect & Telecommun Res Inst ETRI, Artificial Intelligence Res Lab, Daejeon 34129, South Korea
来源
APPLIED SCIENCES-BASEL | 2021年 / 11卷 / 06期
关键词
fluency evaluation; speech recognition; data augmentation; variational autoencoder; speech conversion; NONPARALLEL VOICE CONVERSION; BLIND SEPARATION; RECOGNITION;
D O I
10.3390/app11062642
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
This paper addresses an automatic proficiency evaluation and speech recognition for second language (L2) speech. The proposed method recognizes the speech uttered by the L2 speaker, measures a variety of fluency scores, and evaluates the proficiency of the speaker's spoken English. Stress and rhythm scores are one of the important factors used to evaluate fluency in spoken English and are computed by comparing the stress patterns and the rhythm distributions to those of native speakers. In order to compute the stress and rhythm scores even when the phonemic sequence of the L2 speaker's English sentence is different from the native speaker's one, we align the phonemic sequences based on a dynamic time-warping approach. We also improve the performance of the speech recognition system for non-native speakers and compute fluency features more accurately by augmenting the non-native training dataset and training an acoustic model with the augmented dataset. In this work, we augment the non-native speech by converting some speech signal characteristics (style) while preserving its linguistic information. The proposed variational autoencoder (VAE)-based speech conversion network trains the conversion model by decomposing the spectral features of the speech into a speaker-invariant content factor and a speaker-specific style factor to estimate diverse and robust speech styles. Experimental results show that the proposed method effectively measures the fluency scores and generates diverse output signals. Also, in the proficiency evaluation and speech recognition tests, the proposed method improves the proficiency score performance and speech recognition accuracy for all proficiency areas compared to a method employing conventional acoustic models.
引用
收藏
页数:16
相关论文
共 50 条
  • [21] Evaluating the translation of speech to virtually-performed sign language on AR glasses
    Lan Thao Nguyen
    Schicktanz, Florian
    Stankowski, Aeneas
    Avramidis, Eleftherios
    2021 13TH INTERNATIONAL CONFERENCE ON QUALITY OF MULTIMEDIA EXPERIENCE (QOMEX), 2021, : 141 - 144
  • [22] Cascade Speech Translation for the Kazakh Language
    Kozhirbayev, Zhanibek
    Islamgozhayev, Talgat
    APPLIED SCIENCES-BASEL, 2023, 13 (15):
  • [23] Speech to Text Translation for Malay Language
    Al-Khulaidi, Rami Ali
    Akmeliawati, Rini
    6TH INTERNATIONAL CONFERENCE ON MECHATRONICS (ICOM'17), 2017, 260
  • [24] EVALUATING DIFFERENT CONFIRMATION STRATEGIES FOR SPEECH-TO-SPEECH TRANSLATION SYSTEMS
    Stallard, David
    Prasad, Rohit
    Ananthakrishnan, Shankar
    Choi, Fred
    Saleem, Shirin
    Natarajan, Prem
    2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 5218 - 5221
  • [25] Second language learner speech production
    Temple, L
    STUDIA LINGUISTICA, 2000, 54 (02) : 288 - 295
  • [26] Second Language Speech: Theory and Practice
    Simonet, Miquel
    PHONETICA, 2018, 75 (02) : 182 - 184
  • [27] Speech production and second language acquisition
    Odlin, Terence
    MODERN LANGUAGE JOURNAL, 2008, 92 (02): : 323 - 324
  • [28] Speech production and second language acquisition
    Sunderman, Gretchen
    INTERNATIONAL JOURNAL OF BILINGUAL EDUCATION AND BILINGUALISM, 2009, 12 (02) : 246 - 248
  • [29] Speech Production and Second Language Acquisition
    Hilton, Heather
    SYSTEM, 2009, 37 (01) : 168 - 170
  • [30] Age of learning and second language speech
    Flege, JE
    SECOND LANGUAGE ACQUISITION AND THE CRITICAL PERIOD HYPOTHESIS, 1999, : 101 - 131