The First Vietnamese FOSD-Tacotron-2-based Text-to-Speech Model Dataset

被引:4
|
作者
Tran, Duc Chung [1 ]
机构
[1] FPT Univ, Comp Fundamental Dept, Hoa Lac Hi Tech Pk, Hanoi 155300, Vietnam
来源
DATA IN BRIEF | 2020年 / 31卷
关键词
Text-to-speech; Natural language processing; Natural language generation; Vietnamese; Speech; Dataset; Tacotron; WaveNet;
D O I
10.1016/j.dib.2020.105775
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Recent trends in voicebot application development have enabled utilization of both speech-to-text and text-to-speech (TTS) generation techniques. In order to generate a voice response to a given speech, one needs to use a TTS engine. The recently developed TTS engines are shifting towards end-to-end approaches utilizing models such as Tacotron, Tacotron2, WaveNet, and WaveGlow. The reason is that it enables a TTS service provider to focus on developing training and validating datasets comprising of labelled texts and recorded speeches instead of designing an entirely new model that outperforms the others which is time-consuming and costly. In this context, this work introduces the first Vietnamese FPT Open Speech Data (FOSD)-Tacotron-2-based TTS model dataset. This dataset comprises of a configuration file in *.json format; training and validating text input files (in *.csv format); a 225,0 0 0-step checkpoint of the trained model; and several sample generated audios. The published dataset is extremely worth for serving as a model for benchmarking with other newly developed TTS models / engines. In addition, it opens an entirely new TTS research optimization problem to be addressed: How to effectively generate speech from text given: a black box TTS (trained) model and its training and validation input texts. (C) 2020 The Author(s). Published by Elsevier Inc.
引用
收藏
页数:5
相关论文
共 50 条
  • [31] A prosodic model for text-to-speech synthesis in French
    Di Cristo, A
    Di Cristo, P
    Campione, E
    Véronis, J
    [J]. INTONATION: ANALYSIS, MODELLING AND TECHNOLOGY, 2000, 15 : 321 - 355
  • [32] A stochastic model of intonation for text-to-speech synthesis
    Véronis, J
    Di Cristo, P
    Courtois, F
    Chaumette, C
    [J]. SPEECH COMMUNICATION, 1998, 26 (04) : 233 - 244
  • [33] Towards a multilingual prosody model for text-to-speech
    Jokisch, O
    Ding, HW
    Kruschke, H
    [J]. 2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 421 - 424
  • [34] Text-to-Speech with Model Compression on Edge Devices
    Koc, Wai-Wan
    Chang, Yung-Ting
    Yu, Jian-Yu
    Ik, Tsi-Ui
    [J]. 2021 22ND ASIA-PACIFIC NETWORK OPERATIONS AND MANAGEMENT SYMPOSIUM (APNOMS), 2021, : 114 - 119
  • [35] Excitation-by-SampleRNN Model for Text-to-Speech
    Byun, Kyungguen
    Song, Eunwoo
    Kim, Jinseob
    Kim, Jae-Min
    Kang, Hong-Goo
    [J]. 2019 34TH INTERNATIONAL TECHNICAL CONFERENCE ON CIRCUITS/SYSTEMS, COMPUTERS AND COMMUNICATIONS (ITC-CSCC 2019), 2019, : 322 - 325
  • [36] Novel Eigenpitch-based Prosody Model for Text-to-Speech Synthesis
    Tian, Jilei
    Nurminen, Jani
    Kiss, Imre
    [J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 313 - 316
  • [37] Prosody model in a Mandarin Text-to-Speech System based on a hierarchical approach
    Pan, NH
    Jen, WT
    Yu, SS
    Yu, MS
    Huang, SY
    Wu, MJ
    [J]. 2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 448 - 451
  • [38] EXAMPLAR-BASED SPEECH WAVEFORM GENERATION FOR TEXT-TO-SPEECH
    Valentini-Botinhao, Cassia
    Watts, Oliver
    Espic, Felipe
    King, Simon
    [J]. 2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 332 - 338
  • [39] TIMIT-TTS: A Text-to-Speech Dataset for Multimodal Synthetic Media Detection
    Salvi, Davide
    Hosler, Brian
    Bestagini, Paolo
    Stamm, Matthew C.
    Tubaro, Stefano
    [J]. IEEE ACCESS, 2023, 11 : 50851 - 50866
  • [40] SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis
    Maniati, Georgia
    Vioni, Alexandra
    Ellinas, Nikolaos
    Nikitaras, Karolos
    Klapsas, Konstantinos
    Sung, June Sig
    Jho, Gunu
    Chalamandaris, Aimilios
    Tsiakoulis, Pirros
    [J]. INTERSPEECH 2022, 2022, : 2388 - 2392