Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab and Convolutional Recurrent Neural Networks

被引:0
|
作者
Gao, Yingming [1 ]
Birkholz, Peter [2 ]
Li, Ya [1 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Artificial Intelligence, Beijing 100876, Peoples R China
[2] Tech Univ Dresden, Inst Acoust & Speech Commun, D-01069 Dresden, Germany
关键词
Acoustics; Speech processing; Trajectory; Training; Synthesizers; Long short term memory; Data models; Speech inversion; copy synthesis; articulatory synthesis; VocalTractLab (VTL); convolutional recurrent neural networks; FEATURES;
D O I
10.1109/TASLP.2024.3372874
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Articulatory copy synthesis (ACS) refers to the synthetic reproduction of natural utterances. The existing methods of ACS have the limitations of poor generalizability for unknown speakers, high computing costs, the lack of systematic evaluation, etc. Here we propose an ACS method based on the articulatory speech synthesizer VocalTractLab (VTL) and convolutional recurrent neural networks. We first created paired articulatory-acoustic samples using VTL, and then trained neural-network-based ACS models with acoustic features and articulatory trajectories as inputs and outputs, respectively. The basic approach for training relied on fully synthetic training data (and was later supplemented with natural speech and corresponding synthetic articulatory data). In addition, to represent as much of the articulatory and acoustic space as possible, the training samples were augmented by varying the phonation type, speaking effort, and the vocal tract length of the synthetic utterances. Furthermore, two regularization methods were proposed: one based on the smoothness loss of articulatory trajectories and another based on the acoustic loss between original and estimated acoustic features. For given new utterances of arbitrary length, the trained ACS models could estimate articulatory trajectories that were then fed into VTL to synthesize new speech. Experiments showed that our proposed ACS method achieved an average correlation coefficient of 0.983 between the reference and estimated VTL articulatory parameters for speaker-dependent German utterances. When applied to speaker-independent German, English, and Mandarin Chinese utterances, the copy-synthesized speech achieved recognition rates of 73.88%, 52.92%, and 52.41%, respectively, using the automatic speech recognizer Google Speech-to-Text.
引用
收藏
页码:1845 / 1858
页数:14
相关论文
共 50 条
  • [1] Hybrid convolutional neural networks for articulatory and acoustic information based speech recognition
    Mitra, Vikramjit
    Sivaraman, Ganesh
    Nam, Hosung
    Espy-Wilson, Carol
    Saltzman, Elliot
    Tiede, Mark
    [J]. SPEECH COMMUNICATION, 2017, 89 : 103 - 112
  • [2] ON THE USE OF NEURAL NETWORKS IN ARTICULATORY SPEECH SYNTHESIS
    RAHIM, MG
    GOODYEAR, CC
    KLEIJN, WB
    SCHROETER, J
    SONDHI, MM
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1993, 93 (02): : 1109 - 1121
  • [3] CONVOLUTIONAL-RECURRENT NEURAL NETWORKS FOR SPEECH ENHANCEMENT
    Zhao, Han
    Zarar, Shuayb
    Tashev, Ivan
    Lee, Chin-Hui
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 2401 - 2405
  • [4] Control of an Articulatory Speech Synthesizer based on Dynamic Approximation of Spatial Articulatory Targets
    Birkholz, Peter
    [J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 629 - 632
  • [5] IMPROVING CONVOLUTIONAL RECURRENT NEURAL NETWORKS FOR SPEECH EMOTION RECOGNITION
    Meyer, Patrick
    Xu, Ziyi
    Fingscheidt, Tim
    [J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 365 - 372
  • [6] RECURRENT CONVOLUTIONAL NEURAL NETWORKS FOR STRUCTURED SPEECH ACT TAGGING
    Ushio, Takashi
    Shi, Hongjie
    Endo, Mitsuru
    Yamagami, Katsuyoshi
    Horii, Noriaki
    [J]. 2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016), 2016, : 518 - 524
  • [7] Speech Emotion Recognition using Convolutional and Recurrent Neural Networks
    Lim, Wootaek
    Jang, Daeyoung
    Lee, Taejin
    [J]. 2016 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2016,
  • [8] Image Copy Detection Based on Convolutional Neural Networks
    Zhang, Jing
    Zhu, Wenting
    Li, Bing
    Hu, Weiming
    Yang, Jinfeng
    [J]. PATTERN RECOGNITION (CCPR 2016), PT II, 2016, 663 : 111 - 121
  • [9] Speech Recognition Based on Convolutional Neural Networks
    Du Guiming
    Wang Xia
    Wang Guangyan
    Zhang Yan
    Li Dan
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON SIGNAL AND IMAGE PROCESSING (ICSIP), 2016, : 708 - 711
  • [10] Ultrasound-Based Silent Speech Interface Using Convolutional and Recurrent Neural Networks
    Moliner Juanpere, Eloi
    Csapo, Tamas Gabor
    [J]. ACTA ACUSTICA UNITED WITH ACUSTICA, 2019, 105 (04) : 587 - 590